Python interface to the Anserini IR toolkit built on Lucene
Project description
Pyserini provides a simple Python interface to the Anserini IR toolkit via pyjnius.
Installation
Install via PyPI
pip install pyserini
Fetch the Anserini fatjar from Maven Central:
wget -O anserini-0.6.0-fatjar.jar https://search.maven.org/remotecontent?filepath=io/anserini/anserini/0.6.0/anserini-0.6.0-fatjar.jar
Set the environment variable ANSERINI_CLASSPATH
to the directory where the fatjar is located:
export ANSERINI_CLASSPATH="/path/to/fatjar/directory"
Here's a sample pre-built index on TREC Disks 4 & 5 to play with (used in the TREC 2004 Robust Track):
wget https://www.dropbox.com/s/mdoly9sjdalh44x/lucene-index.robust04.pos%2Bdocvectors%2Brawdocs.tar.gz
tar xvfz lucene-index.robust04.pos+docvectors+rawdocs.tar.gz
Usage
Use the SimpleSearcher
for searching:
from pyserini.search import pysearch
searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')
hits = searcher.search('hubble space telescope')
# Prints the first 10 hits
for i in range(0, 10):
print('{} {} {}'.format(i+1, hits[i].docid, hits[i].score))
# Grab the actual text
hits[0].content
Configure BM25 parameters and use RM3 query expansion:
searcher.set_bm25_similarity(0.9, 0.4)
searcher.set_rm3_reranker(10, 10, 0.5)
hits2 = searcher.search('hubble space telescope')
# Prints the first 10 hits
for i in range(0, 10):
print('{} {} {}'.format(i+1, hits2[i].docid, hits2[i].score))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyserini-0.6.0.0.tar.gz
(7.4 kB
view hashes)
Built Distribution
Close
Hashes for pyserini-0.6.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0ffeefeccccc0d108ff12b78a658b1f4687ed8ef34c0f996088566ca71af36c |
|
MD5 | e2f00d28f9b3df0ebdbfa690e5d847f3 |
|
BLAKE2b-256 | 228b8a504df397c9a59c4ccfc1357cff95dff48ddfa0370532f42888c2ecb211 |