Skip to main content

A Python toolkit for reproducible information retrieval research with sparse and dense representations

Project description

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections

Installation

Install via PyPI:

pip install pyserini

Pyserini requires Python 3.8+ and Java 11 (due to its dependency on Anserini).

Since dense retrieval depends on neural networks, Pyserini requires a more complex set of dependencies to use this feature. A pip installation will automatically pull in the 🤗 Transformers library to satisfy the package requirements. Pyserini also depends on PyTorch and Faiss, but since these packages may require platform-specific custom configuration, they are not explicitly listed in the package requirements. We leave the installation of these packages to you. Refer to documentation in our repo for additional details.

Usage

The LuceneSearcher class provides the entry point for sparse retrieval using bag-of-words representations. Anserini supports a number of pre-built indexes for common collections that it'll automatically download for you and store in ~/.cache/pyserini/indexes/. Here's how to use a pre-built index for the MS MARCO passage ranking task and issue a query interactively (using BM25 ranking):

from pyserini.search.lucene import LuceneSearcher

searcher = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')
hits = searcher.search('what is a lobster roll?')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

The results should be as follows:

 1 7157707 11.00830
 2 6034357 10.94310
 3 5837606 10.81740
 4 7157715 10.59820
 5 6034350 10.48360
 6 2900045 10.31190
 7 7157713 10.12300
 8 1584344 10.05290
 9 533614  9.96350
10 6234461 9.92200

The FaissSearcher class provides the entry point for dense retrieval, and its usage is quite similar to LuceneSearcher. The only additional thing we need to specify for dense retrieval is the query encoder.

from pyserini.search.faiss import FaissSearcher, TctColBertQueryEncoder

encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
searcher = FaissSearcher.from_prebuilt_index(
    'msmarco-passage-tct_colbert-hnsw',
    encoder
)
hits = searcher.search('what is a lobster roll')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

The results should be as follows:

 1 7157710 70.53742
 2 7157715 70.50040
 3 7157707 70.13804
 4 6034350 69.93666
 5 6321969 69.62683
 6 4112862 69.34587
 7 5515474 69.21354
 8 7157708 69.08416
 9 6321974 69.06841
10 2920399 69.01737

For complete documentation, please refer to our repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyserini-0.19.0.tar.gz (130.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyserini-0.19.0-py3-none-any.whl (130.5 MB view details)

Uploaded Python 3

File details

Details for the file pyserini-0.19.0.tar.gz.

File metadata

  • Download URL: pyserini-0.19.0.tar.gz
  • Upload date:
  • Size: 130.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for pyserini-0.19.0.tar.gz
Algorithm Hash digest
SHA256 bd6887ce7541a9a62cf043c7ad8a428862db5bf5b34da9bdd1ff8645bb3f98d0
MD5 64f75e0ef521913f3ac8900a4e81ff1a
BLAKE2b-256 0194932b783a2c25942b6e47353f82d1e5195c9b8dcd6b515587d562260c47f3

See more details on using hashes here.

File details

Details for the file pyserini-0.19.0-py3-none-any.whl.

File metadata

  • Download URL: pyserini-0.19.0-py3-none-any.whl
  • Upload date:
  • Size: 130.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for pyserini-0.19.0-py3-none-any.whl
Algorithm Hash digest
SHA256 135fc8f3da42fee9ce2a91630ffd1e8b178d97de32c4ed82afa033db28555693
MD5 7215dfee5ffc5de01c949a9a3c876949
BLAKE2b-256 ae986aca4e513fc3b5c988192bc20f544f10c3de2ae413f358f75c2fe93fd4d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page