Skip to main content

PyTerrier wrapper for SPLADE learned sparse indexing and retrieval

Project description

pyterrier_splade

An example of a SPLADE learned sparse indexing and retrieval using PyTerrier transformers.

Installation

%pip install -q git+https://github.com/cmacdonald/pyt_splade.git

Indexing

Indexing takes place as a pipeline: we apply SPLADE transformation of the documents, which maps raw text into a dictionary of BERT WordPiece tokens and corresponding weights. The underlying indexer, Terrier, is configured to handle arbitrary word counts without further tokenisation (pretokenised=True).

The Terrier indexer is configured to index tokens unchanged.

import pyterrier as pt

import pyterrier_splade
splade = pyterrier_splade.Splade()
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True)

indxr_pipe = splade.doc_encoder() >> indexer
index_ref = indxr_pipe.index(dataset.get_corpus_iter(), batch_size=128)

Retrieval

Similarly, SPLADE encodes the query into BERT WordPieces and corresponding weights. We apply this as a query encoding transformer.

splade_retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf')

Scoring

SPLADE can also be used as a text scoring function.

first_stage = ... # e.g., BM25, dense retrieval, etc.
splade_scorer = first_stage >> dataset.text_loader() >> splade.scorer()

PISA

For faster retrieval with SPLADE, you can use the fast PISA retrieval backend provided by PyTerrier_PISA:

import pyterrier_splade
splade = pyterrier_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')

# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())

# retrieval

retr_pipeline = splade.query_encoder() >> index.quantized()

Demo

We have a demo of PyTerrier_SPLADE at https://huggingface.co/spaces/terrierteam/splade

Note

Note that this package used to be named pyt_splade. The package is still available under that name (but this may be removed in the future). The new name is pyterrier_splade.

Credits

  • Craig Macdonald
  • Sean MacAvaney
  • Nicola Tonellotto

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyterrier_splade-0.1.0.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyterrier_splade-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file pyterrier_splade-0.1.0.tar.gz.

File metadata

  • Download URL: pyterrier_splade-0.1.0.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for pyterrier_splade-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f7d52079939ac9301a78e87edf98db5eedb3541e05fb65acfd28d2e397c397c8
MD5 71c5a18791f92fc3d2dc35363a87d8a1
BLAKE2b-256 d44d581707d6ed978f98137b4d4e9abf450143d39c7606d08c366c5b9ce618df

See more details on using hashes here.

File details

Details for the file pyterrier_splade-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyterrier_splade-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 61a1a1ff360cc438d5a2872089e7c345a8c4c15562f40326122f154a22d55ced
MD5 846501c34af1482c76902c2db37bf766
BLAKE2b-256 f8ac2507170b1c7aa5e5f0b1e19d061f5498128dbdf77936b8526bb289c788e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page