A PyTerrier interface to the PISA search engine
Project description
PyTerrier PISA
PyTerrier bindings for the PISA search engine.
Getting Started
These bindings are only available for cpython 3.8-3.12 on manylinux2010_x86_64 platforms. They can be installed via pip:
pip install pyterrier_pisa
Indexing
You can easily index corpora from PyTerrier datasets:
import pyterrier as pt
from pyterrier_pisa import PisaIndex
# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())
You can also select which text field(s) to index. If not specified, all fields of type str will be indexed.
dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field=['title', 'abstract'])
index.index(dataset.get_corpus_iter())
PisaIndex accepts various other options to configure the indexing process. Most notable are:
stemmer: Which stemmer to use? Options:porter2(default),krovetz,nonethreads: How many threads to use for indexing? Default:8index_encoding: Which index encoding to use. Default:block_simdbpstops: Which set of stopwords to use. Default:terrier.
# E.g.,
index = PisaIndex('./cord19-pisa', stemmer='krovetz', threads=32)
For some collections you can download pre-built indices from data.terrier.org. PISA indices are prefixed with pisa_.
index = PisaIndex.from_dataset('trec-covid')
Retrieval
From an index, you can build retrieval transformers:
dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)
These retrievers support all the typical pipeline operations.
Search:
bm25.search('covid symptoms')
# qid query docno score
# 0 1 covid symptoms a6avr09j 6.273450
# 1 1 covid symptoms hdxs9dgu 6.272374
# 2 1 covid symptoms zxq7dl9t 6.272374
# .. .. ... ... ...
# 999 1 covid symptoms m8wggdc7 4.690651
Batch retrieval:
print(dph(dataset.get_topics('title')))
# qid query docno score
# 0 1 coronavirus origin 8ccl9aui 9.329109
# 1 1 coronavirus origin es7q6c90 9.260190
# 2 1 coronavirus origin 8l411r1w 8.862670
# ... .. ... ... ...
# 49999 50 mrna vaccine coronavirus eyitkr3s 5.610429
Experiment:
from pyterrier.measures import *
pt.Experiment(
[dph, bm25, pl2, qld],
dataset.get_topics('title'),
dataset.get_qrels(),
[nDCG@10, P@5, P(rel=2)@5, 'mrt'],
names=['dph', 'bm25', 'pl2', 'qld']
)
# name nDCG@10 P@5 P(rel=2)@5 mrt
# 0 dph 0.623450 0.720 0.548 1.101846
# 1 bm25 0.624923 0.728 0.572 0.880318
# 2 pl2 0.536506 0.632 0.456 1.123883
# 3 qld 0.570032 0.676 0.504 0.974924
You can also build a retrieval transformer from PisaRetrieve:
from pyterrier_pisa import PisaRetrieve
# from index path:
bm25 = PisaRetrieve('./cord19-pisa', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
# from dataset
bm25 = PisaRetrieve.from_dataset('trec-covid', 'pisa_unstemmed', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
Extras
You can access PISA's tokenizer and stemmers using the tokenize function:
import pyterrier_pisa
pyterrier_pisa.tokenize('hello worlds!')
# ['hello', 'worlds']
pyterrier_pisa.tokenize('hello worlds!', stemmer='porter2')
# ['hello', 'world']
FAQ
What retrieval functions are supported?
"dph". Parameters: (none)"bm25". Parameters:k1,b"pl2". Parameters:c"qld". Parameters:mu
How do I index [some other type of data]?
PisaIndex accepts an iterator over dicts, each of which containing a docno field and a text field. All you need to do is coerce the data into that
format and you're set.
Examples:
# any iterator
def iter_docs():
for i in range(100):
yield {'docno': str(i), 'text': f'document {i}'}
index = PisaIndex('./dummy-pisa')
index.index(iter_docs())
# from a dataframe
import pandas as pd
docs = pd.DataFrame([
('1', 'test doc'),
('2', 'another doc'),
], columns=['docno', 'text'])
index = PisaIndex('./dummy-pisa-2')
index.index(docs.to_dict(orient="records"))
Can I build a doc2query index?
You can use PisaIndex with any document rewriter, such as doc2query or DeepCT.
All you need to do is build an indexing pipeline. For example:
pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
unzip t5-base.zip
doc2query = Doc2Query(out_attr="exp_terms", batch_size=8)
dataset = pt.get_dataset('irds:vaswani')
index = PisaIndex('./vaswani-doc2query-pisa')
index_pipeline = doc2query >> pt.apply.text(lambda r: f'{r["text"]} {r["exp_terms"]}') >> index
index_pipeline.index(dataset.get_corpus_iter())
Can I build a learned sparse retrieval (e.g., SPLADE) index?
Yes! Example:
import pyt_splade
splade = pyt_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')
# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())
# retrieval
retr_pipeline = splade.query_encoder() >> index.quantized()
msmarco-passage/trec-dl-2019 effectiveness for naver/splade-cocondenser-ensembledistil:
| System | nDCG@10 | R(rel=2)@1000 |
|---|---|---|
| PISA | 0.731 | 0.872 |
| From Paper | 0.732 | 0.875 |
What are the supported index encodings and query algorithms?
Right now we support the following index encodings: ef, single, pefuniform, pefopt, block_optpfor, block_varintg8iu, block_streamvbyte, block_maskedvbyte, block_interpolative, block_qmx, block_varintgb, block_simple8b, block_simple16, block_simdbp.
Index encodings are supplied when a PisaIndex is constructed:
index = PisaIndex('./cord19-pisa', index_encoding='ef')
We support the following query algorithms: wand, block_max_wand, block_max_maxscore, block_max_ranked_and, ranked_and, ranked_or, maxscore.
Query algorithms are supplied when you construct a retrieval transformer:
index.bm25(query_algorithm='ranked_and')
Can I import/export from CIFF?
Yes! Using .from_ciff(ciff_file, index_path) and .to_ciff(ciff_file)
# from a CIFF export:
index = PisaIndex.from_ciff('path/to/something.ciff', 'path/to/index.pisa', stemmer='krovetz') # stemmer is optional
# to a CIFF export:
index.to_ciff('path/to/something.ciff')
Note that you need to be careful to set stemmer to match whatever was used when constructing the index; CIFF does not directly store which stemmer
was used when building the index. If it's a stemmer that's not supported by PISA, you can set stemmer='none' and apply stemming in a PyTerrier pipeline.
References
- [Mallia19]: Antonio Mallia, Michal Siedlaczek, Joel Mackenzie, Torsten Suel. PISA: Performant Indexes and Search for Academia. Proceedings of the Open-Source IR Replicability Challenge. http://ceur-ws.org/Vol-2409/docker08.pdf
- [MacAvaney22]: Sean MacAvaney, Craig Macdonald. A Python Interface to PISA!. Proceedings of SIGIR 2022. https://dl.acm.org/doi/abs/10.1145/3477495.3531656
- [Macdonald21]: Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, Iadh Ounis. PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. Proceedings of CIKM 2021. https://dl.acm.org/doi/abs/10.1145/3459637.3482013
Credits
- Sean MacAvaney, University of Glasgow
- Craig Macdonald, University of Glasgow
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyterrier_pisa-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pyterrier_pisa-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef8a9a621c8dfc39ed1ab7be0af5ea6a4283fc130c24b1d6546f50259cfb0431
|
|
| MD5 |
a055f10a4b2dd1f5e60da77f6638a277
|
|
| BLAKE2b-256 |
7b4ba0d6c9213e409c12997ea839603f1b6e3472211fc22721167f4fd575de6b
|
File details
Details for the file pyterrier_pisa-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pyterrier_pisa-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e99a13bf57cb21057efb2f7269b59583c12d7512fc751f35c1277f22a9901676
|
|
| MD5 |
f7fec24b6c17a75ed2538ea70adcc8b2
|
|
| BLAKE2b-256 |
17450598a093f83bdff571926789634c53b801c9326e5fe4aa6a06fb62463bf2
|
File details
Details for the file pyterrier_pisa-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pyterrier_pisa-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d41fc8322531eaa1147d30edaef269ed4185a638177794ab374f76c7f533a4fc
|
|
| MD5 |
a4dee877c147d659230ecba5d6775c8b
|
|
| BLAKE2b-256 |
ec730646f78ba93c795990afb6725be5fb39c1f8ad3164d046e6cebe711bba91
|
File details
Details for the file pyterrier_pisa-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pyterrier_pisa-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e261ee4e134cb08ee1c136a92bb8255e21ed11901a823f772baf6415ab36b5f4
|
|
| MD5 |
00c1af2df3c69a0cfc8388885202995a
|
|
| BLAKE2b-256 |
7b545a90ba195c688ea95cb44120b19d6c15f19120a7b7b9ff0ef71074c8e42f
|
File details
Details for the file pyterrier_pisa-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pyterrier_pisa-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.1 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab96d23f8c58ac9580ba8a17ef07dfce0206cc3c37b211a64e5a9cb17eb3bfe8
|
|
| MD5 |
debc486d7d131117d15b68bfb71652fe
|
|
| BLAKE2b-256 |
2d4014727efcfd3019bada1c081a50ba152096b3d7e1883215f8a3e1e77151e2
|