A PyTerrier interface to the PISA search engine
Project description
PyTerrier PISA
PyTerrier bindings for the PISA search engine.
Getting Started
These bindings are only available for cpython 3.7-3.10 on manylinux2010_x86_64
platforms. They can be installed via pip:
pip install pyterrier_pisa
Indexing
You can easily index corpora from PyTerrier datasets:
import pyterrier as pt
if not pt.started():
pt.init()
from pyterrier_pisa import PisaIndex
# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())
Since PISA does not support multiple fields, you will need to have all the text you want to index in a single field. By default, it uses the "text" field, but this can be overridden with text_field
.
dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field='title_and_abstract')
# create a new field called title_and_abstract, from the title and abstract text
index_pipeline = pt.apply.title_and_abstract(lambda r: f'{r["title"]} {r["abstract"]}') >> index
index_pipeline.index(dataset.get_corpus_iter())
PisaIndex
accepts various other options to configure the indexing process. Most notable are:
stemmer
: Which stemmer to use? Options:porter2
(default),krovetz
,none
threads
: How many threads to use for indexing? Default:8
# E.g.,
index = PisaIndex('./cord19-pisa', stemmer='krovetz', threads=32)
For some collections you can download pre-built indices from data.terrier.org. PISA indices are prefixed with pisa_
.
index = PisaIndex.from_dataset('trec-covid', 'pisa_unstemmed')
Retrieval
From an index, you can build retrieval transformers:
dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)
These retrievers support all the typical pipeline operations.
Search:
bm25.search('covid symptoms')
# qid query docno score
# 0 1 covid symptoms a6avr09j 6.273450
# 1 1 covid symptoms hdxs9dgu 6.272374
# 2 1 covid symptoms zxq7dl9t 6.272374
# .. .. ... ... ...
# 999 1 covid symptoms m8wggdc7 4.690651
Batch retrieval:
print(dph(dataset.get_topics('title')))
# qid query docno score
# 0 1 coronavirus origin 8ccl9aui 9.329109
# 1 1 coronavirus origin es7q6c90 9.260190
# 2 1 coronavirus origin 8l411r1w 8.862670
# ... .. ... ... ...
# 49999 50 mrna vaccine coronavirus eyitkr3s 5.610429
Experiment:
from pyterrier.measures import *
pt.Experiment(
[dph, bm25, pl2, qld],
dataset.get_topics('title'),
dataset.get_qrels(),
[nDCG@10, P@5, P(rel=2)@5, 'mrt'],
names=['dph', 'bm25', 'pl2', 'qld']
)
# name nDCG@10 P@5 P(rel=2)@5 mrt
# 0 dph 0.623450 0.720 0.548 1.101846
# 1 bm25 0.624923 0.728 0.572 0.880318
# 2 pl2 0.536506 0.632 0.456 1.123883
# 3 qld 0.570032 0.676 0.504 0.974924
You can also build a retrieval transformer from PisaRetrieve
:
from pyterrier_pisa import PisaRetrieve
# from index path:
bm25 = PisaRetrieve('./cord19-pisa', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
# from dataset
bm25 = PisaRetrieve.from_dataset('trec-covid', 'pisa_unstemmed', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
FAQ
What retrieval functions are supported?
"dph"
. Parameters: (none)"bm25"
. Parameters:k1
,b
"pl2"
. Parameters:c
"qld"
. Parameters:mu
How do I index [some other type of data]?
PisaIndex
accepts an iterator over dicts, each of which containing a docno
field and a text
field. All you need to do is coerce the data into that
format and you're set.
Examples:
# any iterator
def iter_docs():
for i in range(100):
yield {'docno': str(i), 'text': f'document {i}'}
index = PisaIndex('./dummy-pisa')
index.index(iter_docs())
# from a dataframe
import pandas as pd
docs = pd.DataFrame([
('1', 'test doc'),
('2', 'another doc'),
], columns=['docno', 'text'])
index = PisaIndex('./dummy-pisa-2')
index.index(docs.to_dict(orient="records"))
Can I build a doc2query index?
You can use PisaIndex
with any document rewriter, such as doc2query or DeepCT.
All you need to do is build an indexing pipeline. For example:
pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
unzip t5-base.zip
doc2query = Doc2Query(out_attr="exp_terms", batch_size=8)
dataset = pt.get_dataset('irds:vaswani')
index = PisaIndex('./vaswani-doc2query-pisa')
index_pipeline = doc2query >> pt.apply.text(lambda r: f'{r["text"]} {r["exp_terms"]}') >> index
index_pipeline.index(dataset.get_corpus_iter())
What are the supported index encodings and query algorithms?
Right now only PisaIndexEncoding.block_simdbp
and PisaQueryAlgorithm.block_max_wand
are supported. Feel free to submit a PR to support other encodings/algorithms!
References
- [Mallia19]: Antonio Mallia, Michal Siedlaczek, Joel Mackenzie, Torsten Suel. PISA: Performant Indexes and Search for Academia. Proceedings of the Open-Source IR Replicability Challenge. http://ceur-ws.org/Vol-2409/docker08.pdf
- [Macdonald21]: Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, Iadh Ounis. PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. Proceedings of CIKM 2021. https://dl.acm.org/doi/abs/10.1145/3459637.3482013
Credits
- Sean MacAvaney, University of Glasgow
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for pyterrier_pisa-0.0.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d6b44839233f9c65a55e61abe782adfb6901f0cb5fe6733c69129057c84432f |
|
MD5 | bc755d78581787d17e8f83a76b72dcea |
|
BLAKE2b-256 | a8f2d235650f860516b3f03788539c100ead693c33a24277ca3003b6149f7376 |
Hashes for pyterrier_pisa-0.0.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 712f28c0dab3b4853eda23d5f5b968d8bb4730b8c85341a912f351963b431bd7 |
|
MD5 | 790629271561fc22f7056109fe3746d2 |
|
BLAKE2b-256 | 84462a15e917ba7cee227f18897f6e19c78020fa9c00ab17845215ca82e35df5 |