The lsr-benchmark aims to support holistic evaluations of the lexical sparse retrieval paradigm to contrast efficiency and effectiveness across diverse retrieval scenarios.
Project description
lsr-benchmark
CLI • Python API • Citation
The lsr-benchmark aims to support holistic evaluations of the learned sparse retrieval paradigm to contrast efficiency and effectiveness across diverse retrieval scenarios.
Task
The learned sparse retrieval paradigm conducts retrieval in three steps:
- Documents are segmented into passages so that the passages can be processed by pre-trained transformers.
- Documents and queries are embedded into a sparse learned embedding.
- Retrieval systems create an index of the document embeddings to return a ranking for each embedded query.
You can submit solutions to step 2 (i.e., models that embed documents and queries into sparse embeddings) and/or solutions to step 3 (i.e., retrieval systems). The idea is then to validate all combinations of embeddings with all retrieval systems to identify which solutions work well for which use case, taking different notions of efficiency/effectiveness trade-offs into consideration. The passage segmentation for step 1 is open source (i.e., created via lsr-benchmark segment-corpus <IR-DATASETS-ID>) but fixed for this task.
Installation
You can install the lsr-benchmark via:
pip3 install lsr-benchmark
If you want the latest features, you can install from the main branch:
pip3 install git+https://github.com/reneuir/lsr-benchmark.git
Supported Corpora and Embeddings
Please run lsr-benchmark overview for an up-to-date overview over all datasets and all embeddings. Alternatively, online overview in TIRA provides an overview.
Running Tests
We have a suite of unit tests that you can run via:
# first install the local version of the lsr-benchmark
pip3 install -e .[dev,test]
# then run the unit tests
pytest .
Documentation and Tutorials
We have a set of tutorials available.
The lsr-benchmark --help command serves as entrypoint to the documentation.
Instructions to add new datasets are available in the data directory.
- ToDo: Write how to add new datasets, embeddings, retrieval, evaluation
- short video
Data
The formats for data inputs and outputs aim to support slicing and dicing diverse query and document distributions while enabling caching, allowing for GreenIR research.
You can slice and dice the document texts and document embeddings via the API. The document texts for private corpora are only available within the TIRA sandbox whereas the document embeddings are publicly available for all corpora (as one can not re-construct the original documents from sparse embeddings).
dataset = lsr_benchmark.load('<IR-DATASETS-ID>')
# process the document embeddings:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>', passage_aggregation="first-passage"):
doc # namedtuple<doc_id, embedding>
# process the document embeddings for all segments:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>'):
doc # namedtuple<doc_id, segments.embedding>
# process the document texts:
for doc in dataset.docs_iter(embedding=None):
doc # namedtuple<doc_id, segments.text>
# process the document texts via segmented versions in ir_datasets
lsr_benchmark.register_to_ir_datasets()
for segmented_doc in ir_datasets.load(f"lsr-benchmark/{dataset}/segmented")
doc # namedtuple<doc_id, segment>
Format of Document Texts
Inspired by the processing of MS MARCO v2.1, each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages (e.g., "text-of-passage-1 text-of-passage-2 text-of-passage-3 text-of-passage-4") would be represented as:
- doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
- segments:
- {"start": 1, "end": 2, "text": "text-of-passage-1 text-of-passage-2"}
- {"start": 2, "end": 3, "text": "text-of-passage-2 text-of-passage-3"}
- {"start": 3, "end": 4, "text": "text-of-passage-3 text-of-passage-4"}
Format of Document Embeddings
Each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages would be represented as:
- doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
- segments:
- {"start": 1, "end": 2, "embedding": {"term-1": 0.123, "term-2": 0.912}}
- {"start": 2, "end": 3, "embedding": {"term-1": 0.421, "term-3": 0.743}}
- {"start": 3, "end": 4, "embedding": {"term-2": 0.108, "term-4": 0.043}}
Evaluation
The online overview in TIRA provides an overview of aggregated evaluations. Alternatively, all data and further custom evaluations are available in the step-04-evaluation directory of this repository.
Our evaluation methodology encourages the development of diverse and novel measures for lsr models that take efficiency and effectiveness into consideration. We assume that a suitable interpretation of efficiency for a target task highly depends on the application and its context. Therefore, we aim to measure as many efficiency-oriented aspects as possible in a standardized way with the tirex-tracker to ensure that different efficiency/effectiveness interpretations can be evaluated post-hoc. This methodology and related aspects were developed as part of the ReNeuIR workshop series held at SIGIR 2022, 2023, 2024, and 2025.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lsr_benchmark-0.1.0.tar.gz.
File metadata
- Download URL: lsr_benchmark-0.1.0.tar.gz
- Upload date:
- Size: 9.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cb92e1d6cf08e0542e664f7f451e8ae2a77deb31809c5d2618eace2eb00d876
|
|
| MD5 |
463c71403f993ca1c1f365283318b8b6
|
|
| BLAKE2b-256 |
0376edc497905b766eee86cf6c6dc742e6cc5a6f5db96d6e642a94c022816ab3
|
Provenance
The following attestation bundles were made for lsr_benchmark-0.1.0.tar.gz:
Publisher:
ci.yml on reneuir/lsr-benchmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lsr_benchmark-0.1.0.tar.gz -
Subject digest:
2cb92e1d6cf08e0542e664f7f451e8ae2a77deb31809c5d2618eace2eb00d876 - Sigstore transparency entry: 843181161
- Sigstore integration time:
-
Permalink:
reneuir/lsr-benchmark@4d1db193a773eb1275447740292988ab5e7880b4 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/reneuir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@4d1db193a773eb1275447740292988ab5e7880b4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lsr_benchmark-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lsr_benchmark-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b2f17c480e593bc42e01c4631b3a04953581ca7f9eb0f64cb135b1ad7a76403
|
|
| MD5 |
87eefb7ca38484bc51a4b2f6bdda021c
|
|
| BLAKE2b-256 |
b24695e9c6a0a54235a3fbc25d59e4900f5cea1a043979af32c2a223b0a4289a
|
Provenance
The following attestation bundles were made for lsr_benchmark-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on reneuir/lsr-benchmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lsr_benchmark-0.1.0-py3-none-any.whl -
Subject digest:
7b2f17c480e593bc42e01c4631b3a04953581ca7f9eb0f64cb135b1ad7a76403 - Sigstore transparency entry: 843181176
- Sigstore integration time:
-
Permalink:
reneuir/lsr-benchmark@4d1db193a773eb1275447740292988ab5e7880b4 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/reneuir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@4d1db193a773eb1275447740292988ab5e7880b4 -
Trigger Event:
push
-
Statement type: