A PyTerrier interface to the PISA search engine
Project description
PyTerrier PISA
PyTerrier bindings for the PISA search engine.
Getting Started
These bindings are only available for cpython 3.7-3.8 on manylinux2010_x86_64
platforms. They can be installed via pip:
pip install pyterrier_pisa
Indexing
You can easily index corpora from PyTerrier datasets:
import pyterrier as pt
if not pt.started():
pt.init()
from pyterrier_pisa import PisaIndex
# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())
You can also select which text field(s) to index. If not specified, all fields of type str
will be indexed.
dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field=['title', 'abstract'])
index.index(dataset.get_corpus_iter())
PisaIndex
accepts various other options to configure the indexing process. Most notable are:
stemmer
: Which stemmer to use? Options:porter2
(default),krovetz
,none
threads
: How many threads to use for indexing? Default:8
index_encoding
: Which index encoding to use. Default:block_simdbp
stops
: Which set of stopwords to use. Default:terrier
.
# E.g.,
index = PisaIndex('./cord19-pisa', stemmer='krovetz', threads=32)
For some collections you can download pre-built indices from data.terrier.org. PISA indices are prefixed with pisa_
.
index = PisaIndex.from_dataset('trec-covid')
Retrieval
From an index, you can build retrieval transformers:
dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)
These retrievers support all the typical pipeline operations.
Search:
bm25.search('covid symptoms')
# qid query docno score
# 0 1 covid symptoms a6avr09j 6.273450
# 1 1 covid symptoms hdxs9dgu 6.272374
# 2 1 covid symptoms zxq7dl9t 6.272374
# .. .. ... ... ...
# 999 1 covid symptoms m8wggdc7 4.690651
Batch retrieval:
print(dph(dataset.get_topics('title')))
# qid query docno score
# 0 1 coronavirus origin 8ccl9aui 9.329109
# 1 1 coronavirus origin es7q6c90 9.260190
# 2 1 coronavirus origin 8l411r1w 8.862670
# ... .. ... ... ...
# 49999 50 mrna vaccine coronavirus eyitkr3s 5.610429
Experiment:
from pyterrier.measures import *
pt.Experiment(
[dph, bm25, pl2, qld],
dataset.get_topics('title'),
dataset.get_qrels(),
[nDCG@10, P@5, P(rel=2)@5, 'mrt'],
names=['dph', 'bm25', 'pl2', 'qld']
)
# name nDCG@10 P@5 P(rel=2)@5 mrt
# 0 dph 0.623450 0.720 0.548 1.101846
# 1 bm25 0.624923 0.728 0.572 0.880318
# 2 pl2 0.536506 0.632 0.456 1.123883
# 3 qld 0.570032 0.676 0.504 0.974924
You can also build a retrieval transformer from PisaRetrieve
:
from pyterrier_pisa import PisaRetrieve
# from index path:
bm25 = PisaRetrieve('./cord19-pisa', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
# from dataset
bm25 = PisaRetrieve.from_dataset('trec-covid', 'pisa_unstemmed', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
FAQ
What retrieval functions are supported?
"dph"
. Parameters: (none)"bm25"
. Parameters:k1
,b
"pl2"
. Parameters:c
"qld"
. Parameters:mu
How do I index [some other type of data]?
PisaIndex
accepts an iterator over dicts, each of which containing a docno
field and a text
field. All you need to do is coerce the data into that
format and you're set.
Examples:
# any iterator
def iter_docs():
for i in range(100):
yield {'docno': str(i), 'text': f'document {i}'}
index = PisaIndex('./dummy-pisa')
index.index(iter_docs())
# from a dataframe
import pandas as pd
docs = pd.DataFrame([
('1', 'test doc'),
('2', 'another doc'),
], columns=['docno', 'text'])
index = PisaIndex('./dummy-pisa-2')
index.index(docs.to_dict(orient="records"))
Can I build a doc2query index?
You can use PisaIndex
with any document rewriter, such as doc2query or DeepCT.
All you need to do is build an indexing pipeline. For example:
pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
unzip t5-base.zip
doc2query = Doc2Query(out_attr="exp_terms", batch_size=8)
dataset = pt.get_dataset('irds:vaswani')
index = PisaIndex('./vaswani-doc2query-pisa')
index_pipeline = doc2query >> pt.apply.text(lambda r: f'{r["text"]} {r["exp_terms"]}') >> index
index_pipeline.index(dataset.get_corpus_iter())
Can I build a learned sparse retrieval (e.g., SPLADE) index?
Yes! Example:
import pyt_splade
splade = pyt_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')
# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())
# retrieval
retr_pipeline = splade.query_encoder() >> index.quantized()
msmarco-passage/trec-dl-2019
effectiveness for naver/splade-cocondenser-ensembledistil
:
System | nDCG@10 | R(rel=2)@1000 |
---|---|---|
PISA | 0.731 | 0.872 |
From Paper | 0.732 | 0.875 |
What are the supported index encodings and query algorithms?
Right now we support the following index encodings: ef
, single
, pefuniform
, pefopt
, block_optpfor
, block_varintg8iu
, block_streamvbyte
, block_maskedvbyte
, block_interpolative
, block_qmx
, block_varintgb
, block_simple8b
, block_simple16
, block_simdbp
.
Index encodings are supplied when a PisaIndex
is constructed:
index = PisaIndex('./cord19-pisa', index_encoding='ef')
We support the following query algorithms: wand
, block_max_wand
, block_max_maxscore
, block_max_ranked_and
, ranked_and
, ranked_or
, maxscore
.
Query algorithms are supplied when you construct a retrieval transformer:
index.bm25(query_algorithm='ranked_and')
Can I import/export from CIFF?
Yes! Using .from_ciff(ciff_file, index_path)
and .to_ciff(ciff_file)
# from a CIFF export:
index = PisaIndex.from_ciff('path/to/something.ciff', 'path/to/index.pisa', stemmer='krovetz') # stemmer is optional
# to a CIFF export:
index.to_ciff('path/to/something.ciff')
Note that you need to be careful to set stemmer to match whatever was used when constructing the index; CIFF does not directly store which stemmer
was used when building the index. If it's a stemmer that's not supported by PISA, you can set stemmer='none'
and apply stemming in a PyTerrier pipeline.
References
- [Mallia19]: Antonio Mallia, Michal Siedlaczek, Joel Mackenzie, Torsten Suel. PISA: Performant Indexes and Search for Academia. Proceedings of the Open-Source IR Replicability Challenge. http://ceur-ws.org/Vol-2409/docker08.pdf
- [MacAvaney22]: Sean MacAvaney, Craig Macdonald. A Python Interface to PISA!. Proceedings of SIGIR 2022.
- [Macdonald21]: Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, Iadh Ounis. PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. Proceedings of CIKM 2021. https://dl.acm.org/doi/abs/10.1145/3459637.3482013
Credits
- Sean MacAvaney, University of Glasgow
- Craig Macdonald, University of Glasgow
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for pyterrier_pisa-0.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 515147844ee1ed9986020e3a1cc14f3bdbe25494ad19560bbdca5ee3dc5e3ddd |
|
MD5 | 88e8c385d518527485f7746135180e2c |
|
BLAKE2b-256 | 6bfd91e835dbed5e430eb9eace09b9a2aa5cc6fcb9c1968db3900c4597db9296 |
Hashes for pyterrier_pisa-0.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d879aa6f4ae09ab6b5b411c163dd9eda6d1ab9f30da04e00f5074a1a9a1f1a53 |
|
MD5 | a99518d4b3062b9ef003ffa39e842ebf |
|
BLAKE2b-256 | d9b4a09f9624e9cef0ae200fffc4dcbbb602be82cfe749fe7ed7edfc63570c27 |
Hashes for pyterrier_pisa-0.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17f1d77abfe61802c049d2d48980e6b7d9274ec295fac28a8743994b76d07955 |
|
MD5 | d1d99a2937b103f28f30bda93ad8e000 |
|
BLAKE2b-256 | 0fc45236d696eafed26701234897a827cae0c823ef4d84b4845a55717f6315b1 |
Hashes for pyterrier_pisa-0.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1c2a08a14fceafdcf7288ed27d51a1d23dbf0c5637cfa86c272cce3c1afe207 |
|
MD5 | 1343898604617eefee2e9d04640f932c |
|
BLAKE2b-256 | 82c285bf636a25620e3d2ff77ca2c16c2aa5b465f15814153ed4fe078e149df1 |
Hashes for pyterrier_pisa-0.0.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25bd4ab23c5501946d40047a7229fae32ce2afdab473d8e16a0891b3f5320b1d |
|
MD5 | d37aff21289155bc9654f5fe6c2d4450 |
|
BLAKE2b-256 | 5d971ccfb483c995ce0a4627c6f8bcb6fc44f89ee13b1fcf2eafef3a713f5479 |