A PyTerrier interface to the PISA search engine

Project description

PyTerrier PISA

PyTerrier bindings for the PISA search engine.

Getting Started

These bindings are only available for cpython 3.7-3.10 on manylinux2010_x86_64 platforms. They can be installed via pip:

pip install pyterrier_pisa

Indexing

You can easily index corpora from PyTerrier datasets:

import pyterrier as pt
if not pt.started():
  pt.init()
from pyterrier_pisa import PisaIndex

# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())

Since PISA does not support multiple fields, you will need to have all the text you want to index in a single field. By default, it uses the "text" field, but this can be overridden with text_field.

dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field='title_and_abstract')
# create a new field called title_and_abstract, from the title and abstract text
index_pipeline = pt.apply.title_and_abstract(lambda r: f'{r["title"]} {r["abstract"]}') >> index
index_pipeline.index(dataset.get_corpus_iter())

PisaIndex accepts various other options to configure the indexing process. Most notable are:

stemmer: Which stemmer to use? Options: porter2 (default), krovetz, none
threads: How many threads to use for indexing? Default: 8

# E.g.,
index = PisaIndex('./cord19-pisa', stemmer='krovetz', threads=32)

For some collections you can download pre-built indices from data.terrier.org. PISA indices are prefixed with pisa_.

index = PisaIndex.from_dataset('trec-covid', 'pisa_unstemmed')

Retrieval

From an index, you can build retrieval transformers:

dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)

These retrievers support all the typical pipeline operations.

Search:

bm25.search('covid symptoms')
#     qid           query     docno     score
# 0     1  covid symptoms  a6avr09j  6.273450
# 1     1  covid symptoms  hdxs9dgu  6.272374
# 2     1  covid symptoms  zxq7dl9t  6.272374
# ..   ..             ...       ...       ...
# 999   1  covid symptoms  m8wggdc7  4.690651

Batch retrieval:

print(dph(dataset.get_topics('title')))
#       qid                     query     docno     score
# 0       1        coronavirus origin  8ccl9aui  9.329109
# 1       1        coronavirus origin  es7q6c90  9.260190
# 2       1        coronavirus origin  8l411r1w  8.862670
# ...    ..                       ...       ...       ...
# 49999  50  mrna vaccine coronavirus  eyitkr3s  5.610429

Experiment:

from pyterrier.measures import *
pt.Experiment(
  [dph, bm25, pl2, qld],
  dataset.get_topics('title'),
  dataset.get_qrels(),
  [nDCG@10, P@5, P(rel=2)@5, 'mrt'],
  names=['dph', 'bm25', 'pl2', 'qld']
)
#    name   nDCG@10    P@5  P(rel=2)@5       mrt
# 0   dph  0.623450  0.720       0.548  1.101846
# 1  bm25  0.624923  0.728       0.572  0.880318
# 2   pl2  0.536506  0.632       0.456  1.123883
# 3   qld  0.570032  0.676       0.504  0.974924

You can also build a retrieval transformer from PisaRetrieve:

from pyterrier_pisa import PisaRetrieve
# from index path:
bm25 = PisaRetrieve('./cord19-pisa', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
# from dataset
bm25 = PisaRetrieve.from_dataset('trec-covid', 'pisa_unstemmed', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)

FAQ

What retrieval functions are supported?

"dph". Parameters: (none)
"bm25". Parameters: k1, b
"pl2". Parameters: c
"qld". Parameters: mu

How do I index [some other type of data]?

PisaIndex accepts an iterator over dicts, each of which containing a docno field and a text field. All you need to do is coerce the data into that format and you're set.

Examples:

# any iterator
def iter_docs():
  for i in range(100):
    yield {'docno': str(i), 'text': f'document {i}'}
index = PisaIndex('./dummy-pisa')
index.index(iter_docs())

# from a dataframe
import pandas as pd
docs = pd.DataFrame([
  ('1', 'test doc'),
  ('2', 'another doc'),
], columns=['docno', 'text'])
index = PisaIndex('./dummy-pisa-2')
index.index(docs.to_dict(orient="records"))

Can I build a doc2query index?

You can use PisaIndex with any document rewriter, such as doc2query or DeepCT. All you need to do is build an indexing pipeline. For example:

pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
unzip t5-base.zip

doc2query = Doc2Query(out_attr="exp_terms", batch_size=8)
dataset = pt.get_dataset('irds:vaswani')
index = PisaIndex('./vaswani-doc2query-pisa')
index_pipeline = doc2query >> pt.apply.text(lambda r: f'{r["text"]} {r["exp_terms"]}') >> index
index_pipeline.index(dataset.get_corpus_iter())

What are the supported index encodings and query algorithms?

Right now only PisaIndexEncoding.block_simdbp and PisaQueryAlgorithm.block_max_wand are supported. Feel free to submit a PR to support other encodings/algorithms!

References

[Mallia19]: Antonio Mallia, Michal Siedlaczek, Joel Mackenzie, Torsten Suel. PISA: Performant Indexes and Search for Academia. Proceedings of the Open-Source IR Replicability Challenge. http://ceur-ws.org/Vol-2409/docker08.pdf
[Macdonald21]: Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, Iadh Ounis. PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. Proceedings of CIKM 2021. https://dl.acm.org/doi/abs/10.1145/3459637.3482013

Credits

Sean MacAvaney, University of Glasgow

Project details

Release history Release notifications | RSS feed

0.0.6

Jun 22, 2023

0.0.5 yanked

Jun 22, 2023

Reason this release was yanked:

bad build

0.0.4 yanked

Jun 22, 2023

Reason this release was yanked:

bad build

0.0.3

Feb 21, 2022

This version

0.0.2 yanked

Feb 21, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pyterrier_pisa-0.0.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.6 MB view hashes)

Uploaded Feb 21, 2022 CPython 3.8 manylinux: glibc 2.12+ x86-64

pyterrier_pisa-0.0.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.6 MB view hashes)

Uploaded Feb 21, 2022 CPython 3.7m manylinux: glibc 2.12+ x86-64

Hashes for pyterrier_pisa-0.0.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl

Hashes for pyterrier_pisa-0.0.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`9d6b44839233f9c65a55e61abe782adfb6901f0cb5fe6733c69129057c84432f`
MD5	`bc755d78581787d17e8f83a76b72dcea`
BLAKE2b-256	`a8f2d235650f860516b3f03788539c100ead693c33a24277ca3003b6149f7376`

Hashes for pyterrier_pisa-0.0.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl

Hashes for pyterrier_pisa-0.0.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`712f28c0dab3b4853eda23d5f5b968d8bb4730b8c85341a912f351963b431bd7`
MD5	`790629271561fc22f7056109fe3746d2`
BLAKE2b-256	`84462a15e917ba7cee227f18897f6e19c78020fa9c00ab17845215ca82e35df5`