PyTerrier wrapper for SPLADE learned sparse indexing and retrieval
Project description
pyterrier_splade
An example of a SPLADE learned sparse indexing and retrieval using PyTerrier transformers.
Installation
%pip install -q git+https://github.com/cmacdonald/pyt_splade.git
Indexing
Indexing takes place as a pipeline: we apply SPLADE transformation of the documents, which maps raw text into a dictionary of BERT WordPiece tokens and corresponding weights. The underlying indexer, Terrier, is configured to handle arbitrary word counts without further tokenisation (pretokenised=True).
The Terrier indexer is configured to index tokens unchanged.
import pyterrier as pt
import pyterrier_splade
splade = pyterrier_splade.Splade()
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True)
indxr_pipe = splade.doc_encoder() >> indexer
index_ref = indxr_pipe.index(dataset.get_corpus_iter(), batch_size=128)
Retrieval
Similarly, SPLADE encodes the query into BERT WordPieces and corresponding weights. We apply this as a query encoding transformer.
splade_retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf')
Scoring
SPLADE can also be used as a text scoring function.
first_stage = ... # e.g., BM25, dense retrieval, etc.
splade_scorer = first_stage >> dataset.text_loader() >> splade.scorer()
PISA
For faster retrieval with SPLADE, you can use the fast PISA retrieval backend provided by PyTerrier_PISA:
import pyterrier_splade
splade = pyterrier_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')
# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())
# retrieval
retr_pipeline = splade.query_encoder() >> index.quantized()
Demo
We have a demo of PyTerrier_SPLADE at https://huggingface.co/spaces/terrierteam/splade
Note
Note that this package used to be named pyt_splade. The package is still available under that name
(but this may be removed in the future). The new name is pyterrier_splade.
Credits
- Craig Macdonald
- Sean MacAvaney
- Nicola Tonellotto
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyterrier_splade-0.1.0.tar.gz.
File metadata
- Download URL: pyterrier_splade-0.1.0.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7d52079939ac9301a78e87edf98db5eedb3541e05fb65acfd28d2e397c397c8
|
|
| MD5 |
71c5a18791f92fc3d2dc35363a87d8a1
|
|
| BLAKE2b-256 |
d44d581707d6ed978f98137b4d4e9abf450143d39c7606d08c366c5b9ce618df
|
File details
Details for the file pyterrier_splade-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyterrier_splade-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61a1a1ff360cc438d5a2872089e7c345a8c4c15562f40326122f154a22d55ced
|
|
| MD5 |
846501c34af1482c76902c2db37bf766
|
|
| BLAKE2b-256 |
f8ac2507170b1c7aa5e5f0b1e19d061f5498128dbdf77936b8526bb289c788e3
|