Skip to main content

Create an pyterrier index using any sentence-transformers model

Project description

pyterrier_sentence_transformers

A codebase derived on terrierteam/pyterrier_ance that allows encoding using any sentence_transformers model.

Installation

If running faiss on CPU:

pip install git+https://github.com/soldni/pyterrier_sentence_transformers.git
conda install -c pytorch faiss-cpu

else, for gpu support:

pip install git+https://github.com/soldni/pyterrier_sentence_transformers.git
conda install -c pytorch faiss-gpu cudatoolkit=11.3

If you need to install faiss from scratch, see instructions here.

Running

See example in scripts/contriever_scifact.py.

                          name       map  recip_rank      P.10  ndcg_cut.10
0                         BM25  0.637799    0.647941  0.091667     0.683904
1  facebook/contriever-msmarco  0.641346    0.653874  0.091667     0.682851

Note that the nDCG@10 we get for BM25 is much better than in the paper: instead of 66.5 on row 0, we get '68.4'. The contriever result is also a bit better, with 68.3 instead of 67.7. Not sure what kind of magic pyterrier is doing here 🤷.

Note that, by default, this codebase uses exhaustive search when querying the dense index. This is not ideal for performance, but it is the setting contriever was evaluated on. If you want to switch to approximate search, you can do so by setting the faiss_factory_config attribute of SentenceTransformersRetriever / SentenceTransformersIndexer to any valid index factory string (or pass faiss_factory_config= to the contriever_scifact.py script). I recommend checking out the faiss docs for more info on the various approximate search options; a good starting point is probably HNSW:

python scripts/contriever_scifact.py \
    faiss_factory_config='HNSW32' \
    per_call_size=1024

This gets you close performance to the exact search:

                          name       map  recip_rank      P.10  ndcg_cut.10
0                         BM25  0.637799    0.647941  0.091667     0.683904
1  facebook/contriever-msmarco  0.629594    0.642171  0.090000     0.670841

Note Note that sometimes you might have to increment the number of passages batch batch (per_call_size); this is because the approximate search gets trained using the first batch of passages, and the more passages you have, the better the search will be.

In the example above, switching to faiss_factory_config='HNSW64' gets you another point of accuracy in nDCG@10, but it will increase query time.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyterrier-sentence-transformers-0.2.2.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file pyterrier-sentence-transformers-0.2.2.tar.gz.

File metadata

File hashes

Hashes for pyterrier-sentence-transformers-0.2.2.tar.gz
Algorithm Hash digest
SHA256 61fec7e059f9a89b87559b974e8b33951b348b0502b424cea51d515d82385191
MD5 15b44e8d1068a32f10e8f5174756ab65
BLAKE2b-256 588fc7863a9ed147dbbea355b84725b534a79e1aea08fa29fdc4cb54548cb397

See more details on using hashes here.

File details

Details for the file pyterrier_sentence_transformers-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pyterrier_sentence_transformers-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 091c10fea6f972b4319774bbc9bb12c91c24b21c90d343b7cede935fbdb621b4
MD5 e846697a6312a47598169e316da12699
BLAKE2b-256 2c9826aa06fd8f25270bbf601301cd398968d8c24b3d5b308455fbe0e7d384d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page