Skip to main content

A PyTerrier interface to the PISA search engine

Project description

PyTerrier PISA

PyTerrier bindings for the PISA search engine.

Interactive Colab Demo: Open In Colab

Getting Started

These bindings are only available for cpython 3.8-3.12 on manylinux2010_x86_64 platforms. They can be installed via pip:

pip install pyterrier_pisa

Indexing

You can easily index corpora from PyTerrier datasets:

import pyterrier as pt
from pyterrier_pisa import PisaIndex

# from a dataset
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-pisa')
index.index(dataset.get_corpus_iter())

You can also select which text field(s) to index. If not specified, all fields of type str will be indexed.

dataset = pt.get_dataset('irds:cord19')
index = PisaIndex('./cord19-pisa', text_field=['title', 'abstract'])
index.index(dataset.get_corpus_iter())

PisaIndex accepts various other options to configure the indexing process. Most notable are:

  • stemmer: Which stemmer to use? Options: porter2 (default), krovetz, none
  • threads: How many threads to use for indexing? Default: 8
  • index_encoding: Which index encoding to use. Default: block_simdbp
  • stops: Which set of stopwords to use. Default: terrier.
# E.g.,
index = PisaIndex('./cord19-pisa', stemmer='krovetz', threads=32)

For some collections you can download pre-built indices from data.terrier.org. PISA indices are prefixed with pisa_.

index = PisaIndex.from_dataset('trec-covid')

Retrieval

From an index, you can build retrieval transformers:

dph = index.dph()
bm25 = index.bm25(k1=1.2, b=0.4)
pl2 = index.pl2(c=1.0)
qld = index.qld(mu=1000.)

These retrievers support all the typical pipeline operations.

Search:

bm25.search('covid symptoms')
#     qid           query     docno     score
# 0     1  covid symptoms  a6avr09j  6.273450
# 1     1  covid symptoms  hdxs9dgu  6.272374
# 2     1  covid symptoms  zxq7dl9t  6.272374
# ..   ..             ...       ...       ...
# 999   1  covid symptoms  m8wggdc7  4.690651

Batch retrieval:

print(dph(dataset.get_topics('title')))
#       qid                     query     docno     score
# 0       1        coronavirus origin  8ccl9aui  9.329109
# 1       1        coronavirus origin  es7q6c90  9.260190
# 2       1        coronavirus origin  8l411r1w  8.862670
# ...    ..                       ...       ...       ...
# 49999  50  mrna vaccine coronavirus  eyitkr3s  5.610429

Experiment:

from pyterrier.measures import *
pt.Experiment(
  [dph, bm25, pl2, qld],
  dataset.get_topics('title'),
  dataset.get_qrels(),
  [nDCG@10, P@5, P(rel=2)@5, 'mrt'],
  names=['dph', 'bm25', 'pl2', 'qld']
)
#    name   nDCG@10    P@5  P(rel=2)@5       mrt
# 0   dph  0.623450  0.720       0.548  1.101846
# 1  bm25  0.624923  0.728       0.572  0.880318
# 2   pl2  0.536506  0.632       0.456  1.123883
# 3   qld  0.570032  0.676       0.504  0.974924

You can also build a retrieval transformer from PisaRetrieve:

from pyterrier_pisa import PisaRetrieve
# from index path:
bm25 = PisaRetrieve('./cord19-pisa', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)
# from dataset
bm25 = PisaRetrieve.from_dataset('trec-covid', 'pisa_unstemmed', scorer='bm25', bm25_k1=1.2, bm25_b=0.4)

Extras

You can access PISA's tokenizer and stemmers using the tokenize function:

import pyterrier_pisa
pyterrier_pisa.tokenize('hello worlds!')
# ['hello', 'worlds']
pyterrier_pisa.tokenize('hello worlds!', stemmer='porter2')
# ['hello', 'world']

FAQ

What retrieval functions are supported?

  • "dph". Parameters: (none)
  • "bm25". Parameters: k1, b
  • "pl2". Parameters: c
  • "qld". Parameters: mu

How do I index [some other type of data]?

PisaIndex accepts an iterator over dicts, each of which containing a docno field and a text field. All you need to do is coerce the data into that format and you're set.

Examples:

# any iterator
def iter_docs():
  for i in range(100):
    yield {'docno': str(i), 'text': f'document {i}'}
index = PisaIndex('./dummy-pisa')
index.index(iter_docs())

# from a dataframe
import pandas as pd
docs = pd.DataFrame([
  ('1', 'test doc'),
  ('2', 'another doc'),
], columns=['docno', 'text'])
index = PisaIndex('./dummy-pisa-2')
index.index(docs.to_dict(orient="records"))

Can I build a doc2query index?

You can use PisaIndex with any document rewriter, such as doc2query or DeepCT. All you need to do is build an indexing pipeline. For example:

pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
unzip t5-base.zip
doc2query = Doc2Query(out_attr="exp_terms", batch_size=8)
dataset = pt.get_dataset('irds:vaswani')
index = PisaIndex('./vaswani-doc2query-pisa')
index_pipeline = doc2query >> pt.apply.text(lambda r: f'{r["text"]} {r["exp_terms"]}') >> index
index_pipeline.index(dataset.get_corpus_iter())

Can I build a learned sparse retrieval (e.g., SPLADE) index?

Yes! Example:

import pyt_splade
splade = pyt_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')

# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())

# retrieval

retr_pipeline = splade.query_encoder() >> index.quantized()

msmarco-passage/trec-dl-2019 effectiveness for naver/splade-cocondenser-ensembledistil:

System nDCG@10 R(rel=2)@1000
PISA 0.731 0.872
From Paper 0.732 0.875

What are the supported index encodings and query algorithms?

Right now we support the following index encodings: ef, single, pefuniform, pefopt, block_optpfor, block_varintg8iu, block_streamvbyte, block_maskedvbyte, block_interpolative, block_qmx, block_varintgb, block_simple8b, block_simple16, block_simdbp.

Index encodings are supplied when a PisaIndex is constructed:

index = PisaIndex('./cord19-pisa', index_encoding='ef')

We support the following query algorithms: wand, block_max_wand, block_max_maxscore, block_max_ranked_and, ranked_and, ranked_or, maxscore.

Query algorithms are supplied when you construct a retrieval transformer:

index.bm25(query_algorithm='ranked_and')

Can I import/export from CIFF?

Yes! Using .from_ciff(ciff_file, index_path) and .to_ciff(ciff_file)

# from a CIFF export:
index = PisaIndex.from_ciff('path/to/something.ciff', 'path/to/index.pisa', stemmer='krovetz') # stemmer is optional
# to a CIFF export:
index.to_ciff('path/to/something.ciff')

Note that you need to be careful to set stemmer to match whatever was used when constructing the index; CIFF does not directly store which stemmer was used when building the index. If it's a stemmer that's not supported by PISA, you can set stemmer='none' and apply stemming in a PyTerrier pipeline.

References

Credits

  • Sean MacAvaney, University of Glasgow
  • Craig Macdonald, University of Glasgow

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyterrier_pisa-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pyterrier_pisa-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pyterrier_pisa-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pyterrier_pisa-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pyterrier_pisa-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file pyterrier_pisa-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyterrier_pisa-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ef8a9a621c8dfc39ed1ab7be0af5ea6a4283fc130c24b1d6546f50259cfb0431
MD5 a055f10a4b2dd1f5e60da77f6638a277
BLAKE2b-256 7b4ba0d6c9213e409c12997ea839603f1b6e3472211fc22721167f4fd575de6b

See more details on using hashes here.

File details

Details for the file pyterrier_pisa-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyterrier_pisa-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e99a13bf57cb21057efb2f7269b59583c12d7512fc751f35c1277f22a9901676
MD5 f7fec24b6c17a75ed2538ea70adcc8b2
BLAKE2b-256 17450598a093f83bdff571926789634c53b801c9326e5fe4aa6a06fb62463bf2

See more details on using hashes here.

File details

Details for the file pyterrier_pisa-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyterrier_pisa-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d41fc8322531eaa1147d30edaef269ed4185a638177794ab374f76c7f533a4fc
MD5 a4dee877c147d659230ecba5d6775c8b
BLAKE2b-256 ec730646f78ba93c795990afb6725be5fb39c1f8ad3164d046e6cebe711bba91

See more details on using hashes here.

File details

Details for the file pyterrier_pisa-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyterrier_pisa-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e261ee4e134cb08ee1c136a92bb8255e21ed11901a823f772baf6415ab36b5f4
MD5 00c1af2df3c69a0cfc8388885202995a
BLAKE2b-256 7b545a90ba195c688ea95cb44120b19d6c15f19120a7b7b9ff0ef71074c8e42f

See more details on using hashes here.

File details

Details for the file pyterrier_pisa-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyterrier_pisa-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ab96d23f8c58ac9580ba8a17ef07dfce0206cc3c37b211a64e5a9cb17eb3bfe8
MD5 debc486d7d131117d15b68bfb71652fe
BLAKE2b-256 2d4014727efcfd3019bada1c081a50ba152096b3d7e1883215f8a3e1e77151e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page