Create an pyterrier index using any sentence-transformers model
Project description
pyterrier_sentence_transformers
A codebase derived on terrierteam/pyterrier_ance
that allows encoding using any sentence_transformers
model.
Installation
If running faiss on CPU:
pip install git+https://github.com/soldni/pyterrier_sentence_transformers.git
conda install -c pytorch faiss-cpu
else, for gpu support:
pip install git+https://github.com/soldni/pyterrier_sentence_transformers.git
conda install -c pytorch faiss-gpu cudatoolkit=11.3
If you need to install faiss from scratch, see instructions here.
Running
See example in scripts/contriever_scifact.py
.
name map recip_rank P.10 ndcg_cut.10
0 BM25 0.637799 0.647941 0.091667 0.683904
1 facebook/contriever-msmarco 0.641346 0.653874 0.091667 0.682851
Note that the nDCG@10 we get for BM25 is much better than in the paper: instead of 66.5
on row 0, we get '68.4'. The contriever result is also a bit better, with 68.3
instead of 67.7
. Not sure what kind of magic pyterrier is doing here 🤷.
Note that, by default, this codebase uses exhaustive search when querying the dense index. This is not ideal for performance, but it is the setting contriever was evaluated on. If you want to switch to approximate search, you can do so by setting the faiss_factory_config
attribute of SentenceTransformersRetriever
/ SentenceTransformersIndexer
to any valid index factory string (or pass faiss_factory_config=
to the contriever_scifact.py
script). I recommend checking out the faiss docs for more info on the various approximate search options; a good starting point is probably HNSW
:
python scripts/contriever_scifact.py \
faiss_factory_config='HNSW32' \
per_call_size=1024
This gets you close performance to the exact search:
name map recip_rank P.10 ndcg_cut.10
0 BM25 0.637799 0.647941 0.091667 0.683904
1 facebook/contriever-msmarco 0.629594 0.642171 0.090000 0.670841
Note Note that sometimes you might have to increment the number of passages batch batch (per_call_size
); this is because the approximate search gets trained using the first batch of passages, and the more passages you have, the better the search will be.
In the example above, switching to faiss_factory_config='HNSW64'
gets you another point of accuracy in nDCG@10, but it will increase query time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyterrier-sentence-transformers-0.2.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61fec7e059f9a89b87559b974e8b33951b348b0502b424cea51d515d82385191 |
|
MD5 | 15b44e8d1068a32f10e8f5174756ab65 |
|
BLAKE2b-256 | 588fc7863a9ed147dbbea355b84725b534a79e1aea08fa29fdc4cb54548cb397 |
Hashes for pyterrier_sentence_transformers-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 091c10fea6f972b4319774bbc9bb12c91c24b21c90d343b7cede935fbdb621b4 |
|
MD5 | e846697a6312a47598169e316da12699 |
|
BLAKE2b-256 | 2c9826aa06fd8f25270bbf601301cd398968d8c24b3d5b308455fbe0e7d384d7 |