pyserini-install

A Python toolkit for reproducible information retrieval research with sparse and dense representations

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections

Installation

Install via PyPI:

pip install pyserini

Pyserini requires Python 3.10+ and Java 11 (due to its dependency on Anserini).

Since dense retrieval depends on neural networks, Pyserini requires a more complex set of dependencies to use this feature. A pip installation will automatically pull in the 🤗 Transformers library to satisfy the package requirements. Pyserini also depends on PyTorch and Faiss, but since these packages may require platform-specific custom configuration, they are not explicitly listed in the package requirements. We leave the installation of these packages to you. Refer to documentation in our repo for additional details.

Usage

The LuceneSearcher class provides the entry point for sparse retrieval using bag-of-words representations. Anserini supports a number of pre-built indexes for common collections that it'll automatically download for you and store in ~/.cache/pyserini/indexes/. Here's how to use a pre-built index for the MS MARCO passage ranking task and issue a query interactively (using BM25 ranking):

from pyserini.search.lucene import LuceneSearcher

lucene_searcher = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')
hits = lucene_searcher.search('what is a lobster roll?')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

The results should be as follows:

 1 7157707 11.00830
 2 6034357 10.94310
 3 5837606 10.81740
 4 7157715 10.59820
 5 6034350 10.48360
 6 2900045 10.31190
 7 7157713 10.12300
 8 1584344 10.05290
 9 533614  9.96350
10 6234461 9.92200

You can examine the actual text of the first hit, as follows:

hits[0].raw

Which is:

Cookbook: Lobster roll Media: Lobster roll A lobster-salad style roll from The Lobster Roll in Amagansett, New York on the Eastern End of Long Island A lobster roll is a fast-food sandwich native to New England made of lobster meat served on a grilled hot dog-style bun with the opening on the top rather than the side. The filling may also contain butter, lemon juice, salt and black pepper, with variants made in other parts of New England replacing the butter with mayonnaise. Others contain diced celery or scallion. Potato chips or french fries are the typical sides.

The FaissSearcher class provides the entry point for dense retrieval, and its usage is quite similar to LuceneSearcher. The only additional thing we need to specify for dense retrieval is the query encoder.

from pyserini.search.faiss import FaissSearcher, TctColBertQueryEncoder

encoder = TctColBertQueryEncoder('castorini/tct_colbert-v2-hnp-msmarco')
faiss_searcher = FaissSearcher.from_prebuilt_index(
    'msmarco-v1-passage.tct_colbert-v2-hnp',
    encoder
)
hits = faiss_searcher.search('what is a lobster roll')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

The results should be as follows:

 1 7157715 80.14327
 2 7157710 80.09985
 3 7157707 79.70108
 4 6321969 79.37906
 5 6034350 79.14087
 6 7157708 79.08399
 7 4112862 79.03954
 8 7157713 78.71204
 9 4112861 78.67692
10 5515474 78.54551

The Faiss index does not store the original passages, so let's use the lucene_searcher to fetch the actual text:

lucene_searcher.doc(hits[0].docid).raw()

Which is:

A Lobster Roll is a bread roll filled with bite-sized chunks of lobster meat. Lobster Rolls are made on the Atlantic coast of North America, from the New England area of the United States on up into the Maritimes areas of Canada.

For complete documentation, please refer to our repo.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.23.1

Dec 19, 2023

0.23.0

Dec 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyserini-install-0.23.1.tar.gz (349.1 kB view details)

Uploaded Dec 19, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyserini_install-0.23.1-py3-none-any.whl (389.2 kB view details)

Uploaded Dec 19, 2023 Python 3

File details

Details for the file pyserini-install-0.23.1.tar.gz.

File metadata

Download URL: pyserini-install-0.23.1.tar.gz
Upload date: Dec 19, 2023
Size: 349.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for pyserini-install-0.23.1.tar.gz
Algorithm	Hash digest
SHA256	`eeb900af8ada38cffda180baa5291035eb7fa7a83fe516d7f72506c178c94045`
MD5	`487eb76c4e41b9621a66337c03e67e40`
BLAKE2b-256	`13da66f4766509436ea541e986e97f0d44001a4619a00417fd334b0573e6adf8`

See more details on using hashes here.

File details

Details for the file pyserini_install-0.23.1-py3-none-any.whl.

File metadata

Download URL: pyserini_install-0.23.1-py3-none-any.whl
Upload date: Dec 19, 2023
Size: 389.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for pyserini_install-0.23.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a951920a8821529f0d522e1a84ad03db8f200bdcce1c49e468c2a7b331037640`
MD5	`e85dc1ce64567a4cd56ed1ae6429a704`
BLAKE2b-256	`8e89f4558b6bbbee86338f2e7e5ef9e48298ed211f7ab1455e2b09e5c6048e65`

See more details on using hashes here.

pyserini-install 0.23.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes