Skip to content

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

License

Notifications You must be signed in to change notification settings

castorini/pyserini

Repository files navigation

Pyserini

PyPI Downloads PyPI Download Stats Maven Central Generic badge LICENSE

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, prebuilt indexes, and evaluation scripts for many commonly used IR test collections. With Pyserini, it's easy to reproduce runs on a number of standard IR test collections!

For additional details, our paper in SIGIR 2021 provides a nice overview.

✨ New! Guide to working with the MS MARCO 2.1 Document Corpus for TREC 2024 RAG Track.

❗ Anserini was upgraded from JDK 11 to JDK 21 at commit 272565 (2024/04/03), which corresponds to the release of v0.35.0. Correspondingly, Pyserini was upgraded to JDK 21 at commit b2f677 (2024/04/04).

🎬 Installation

Install via PyPI:

pip install pyserini

Pyserini is built on Python 3.10 (other versions might work, but YMMV) and Java 21 (due to its dependency on Anserini). A pip installation will automatically pull in major dependencies such as PyTorch, πŸ€— Transformers, and the ONNX Runtime.

The toolkit also comes with "extras":

pip install 'pyserini[extras]'

Notably, faiss-cpu, lightgbm, and nmslib are included in these "extras". Installation of these packages can be temperamental, which is why they are not included in the core dependencies. It might be a good idea to install these yourself separately.

The software ecosystem is rapidly evolving and a potential source of frustration is incompatibility among different versions of underlying dependencies. We provide additional detailed installation instructions here.

If you're planning on just using Pyserini, then the pip instruction (without "extras") should be fine. However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation. Instructions are provided here.

πŸ™‹ How do I search?

Pyserini supports different types of retrieval models. See this guide for details on how to search common corpora in IR and NLP research (e.g., MS MARCO, NaturalQuestions, BEIR, etc.) using indexes that we have already built for you. Here are direct links into the guide:

Once you get the top-k results, you'll actually want to fetch the document text... See this guide for how.

πŸ™‹ How do I index my own corpus?

Well, it depends on what type of retrieval model you want to search with:

The steps are different for different classes of models: this guide (same as the links above) describes the details.

πŸ™‹ Additional FAQs

βš—οΈ Reproducibility

With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! We provide a number of prebuilt indexes that directly support reproducibility "out of the box".

In our SIGIR 2022 paper, we introduced "two-click reproductions" that allow anyone to reproduce experimental runs with only two clicks (i.e., copy and paste). Documentation is organized into reproduction matrices for different corpora that provide a summary of different experimental conditions and query sets:

For more details, see our paper on Building a Culture of Reproducibility in Academic Research.

Additional reproduction guides below provide detailed step-by-step instructions.

Sparse Retrieval

Sparse Retrieval

Dense Retrieval

Dense Retrieval

Hybrid Sparse-Dense Retrieval

Hybrid Sparse-Dense Retrieval

Available Corpora

Available Corpora

Corpora Size Checksum
MS MARCO V1 passage: uniCOIL (noexp) 2.7 GB f17ddd8c7c00ff121c3c3b147d2e17d8
MS MARCO V1 passage: uniCOIL (d2q-T5) 3.4 GB 78eef752c78c8691f7d61600ceed306f
MS MARCO V1 doc: uniCOIL (noexp) 11 GB 11b226e1cacd9c8ae0a660fd14cdd710
MS MARCO V1 doc: uniCOIL (d2q-T5) 19 GB 6a00e2c0c375cb1e52c83ae5ac377ebb
MS MARCO V2 passage: uniCOIL (noexp) 24 GB d9cc1ed3049746e68a2c91bf90e5212d
MS MARCO V2 passage: uniCOIL (d2q-T5) 41 GB 1949a00bfd5e1f1a230a04bbc1f01539
MS MARCO V2 doc: uniCOIL (noexp) 55 GB 97ba262c497164de1054f357caea0c63
MS MARCO V2 doc: uniCOIL (d2q-T5) 72 GB c5639748c2cbad0152e10b0ebde3b804

πŸ“ƒ Additional Documentation

πŸ“œοΈ Release History

older... (and historic notes)

πŸ“œοΈ Historical Notes

⁉️ Lucene 8 to Lucene 9 Transition. In 2022, Pyserini underwent a transition from Lucene 8 to Lucene 9. Most of the prebuilt indexes have been rebuilt using Lucene 9, but there are a few still based on Lucene 8.

More details:

Explanations:

  • What's the impact? Indexes built with Lucene 8 are not fully compatible with Lucene 9 code (see Anserini #1952). The workaround is to disable consistent tie-breaking, which happens automatically if a Lucene 8 index is detected by Pyserini. However, Lucene 9 code running on Lucene 8 indexes will give slightly different results than Lucene 8 code running on Lucene 8 indexes. Note that Lucene 8 code is not able to read indexes built with Lucene 9.

  • Why is this necessary? Although disruptive, an upgrade to Lucene 9 is necessary to take advantage of Lucene's HNSW indexes, which will increase the capabilities of Pyserini and open up the design space of dense/sparse hybrids.

With v0.11.0.0 and before, Pyserini versions adopted the convention of X.Y.Z.W, where X.Y.Z tracks the version of Anserini, and W is used to distinguish different releases on the Python end. Starting with Anserini v0.12.0, Anserini and Pyserini versions have become decoupled.

Anserini is designed to work with JDK 11. There was a JRE path change above JDK 9 that breaks pyjnius 1.2.0, as documented in this issue, also reported in Anserini here and here. This issue was fixed with pyjnius 1.2.1 (released December 2019). The previous error was documented in this notebook and this notebook documents the fix.

✨ References

If you use Pyserini, please cite the following paper:

@INPROCEEDINGS{Lin_etal_SIGIR2021_Pyserini,
   author = "Jimmy Lin and Xueguang Ma and Sheng-Chieh Lin and Jheng-Hong Yang and Ronak Pradeep and Rodrigo Nogueira",
   title = "{Pyserini}: A {Python} Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations",
   booktitle = "Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)",
   year = 2021,
   pages = "2356--2362",
}

πŸ™ Acknowledgments

This research is primarily supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.