Skip to main content

Python interface to the Anserini IR toolkit built on Lucene

Project description

Pyserini provides a simple Python interface to the Anserini IR toolkit via pyjnius.

Installation

Install via PyPI

pip install pyserini

Fetch the Anserini fatjar from Maven Central:

wget -O anserini-0.6.0-fatjar.jar https://search.maven.org/remotecontent?filepath=io/anserini/anserini/0.6.0/anserini-0.6.0-fatjar.jar

Set the environment variable ANSERINI_CLASSPATH to the directory where the fatjar is located:

export ANSERINI_CLASSPATH="/path/to/fatjar/directory"

Here's a sample pre-built index on TREC Disks 4 & 5 to play with (used in the TREC 2004 Robust Track):

wget https://www.dropbox.com/s/mdoly9sjdalh44x/lucene-index.robust04.pos%2Bdocvectors%2Brawdocs.tar.gz
tar xvfz lucene-index.robust04.pos+docvectors+rawdocs.tar.gz

Usage

Use the SimpleSearcher for searching:

from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')
hits = searcher.search('hubble space telescope')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits[i].docid, hits[i].score))

# Grab the actual text
hits[0].content

Configure BM25 parameters and use RM3 query expansion:

searcher.set_bm25_similarity(0.9, 0.4)
searcher.set_rm3_reranker(10, 10, 0.5)

hits2 = searcher.search('hubble space telescope')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits2[i].docid, hits2[i].score))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyserini-0.6.0.0.tar.gz (7.4 kB view hashes)

Uploaded Source

Built Distribution

pyserini-0.6.0.0-py3-none-any.whl (9.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page