The lsr-benchmark aims to support holistic evaluations of the lexical sparse retrieval paradigm to contrast efficiency and effectiveness across diverse retrieval scenarios.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

heinrichreimer mam10eks

These details have not been verified by PyPI

Project description

lsr-benchmark

CLI • Python API • Citation

The lsr-benchmark aims to support holistic evaluations of the learned sparse retrieval paradigm to contrast efficiency and effectiveness across diverse retrieval scenarios.

Task

The learned sparse retrieval paradigm conducts retrieval in three steps:

Documents are segmented into passages so that the passages can be processed by pre-trained transformers.
Documents and queries are embedded into a sparse learned embedding.
Retrieval systems create an index of the document embeddings to return a ranking for each embedded query.

You can submit solutions to step 2 (i.e., models that embed documents and queries into sparse embeddings) and/or solutions to step 3 (i.e., retrieval systems). The idea is then to validate all combinations of embeddings with all retrieval systems to identify which solutions work well for which use case, taking different notions of efficiency/effectiveness trade-offs into consideration. The passage segmentation for step 1 is open source (i.e., created via lsr-benchmark segment-corpus <IR-DATASETS-ID>) but fixed for this task.

Installation

You can install the lsr-benchmark via:

pip3 install lsr-benchmark

If you want the latest features, you can install from the main branch:

pip3 install git+https://github.com/reneuir/lsr-benchmark.git

Supported Corpora and Embeddings

Please run lsr-benchmark overview for an up-to-date overview over all datasets and all embeddings. Alternatively, online overview in TIRA provides an overview.

Running Tests

We have a suite of unit tests that you can run via:

# first install the local version of the lsr-benchmark
pip3 install -e .[dev,test]
# then run the unit tests
pytest .

Documentation and Tutorials

We have a set of tutorials available.

The lsr-benchmark --help command serves as entrypoint to the documentation.

Instructions to add new datasets are available in the data directory.

ToDo: Write how to add new datasets, embeddings, retrieval, evaluation
- short video

Data

The formats for data inputs and outputs aim to support slicing and dicing diverse query and document distributions while enabling caching, allowing for GreenIR research.

You can slice and dice the document texts and document embeddings via the API. The document texts for private corpora are only available within the TIRA sandbox whereas the document embeddings are publicly available for all corpora (as one can not re-construct the original documents from sparse embeddings).

dataset = lsr_benchmark.load('<IR-DATASETS-ID>')

# process the document embeddings:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>', passage_aggregation="first-passage"):
    doc # namedtuple<doc_id, embedding>

# process the document embeddings for all segments:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>'):
    doc # namedtuple<doc_id, segments.embedding>

# process the document texts:
for doc in dataset.docs_iter(embedding=None):
    doc # namedtuple<doc_id, segments.text>

# process the document texts via segmented versions in ir_datasets
lsr_benchmark.register_to_ir_datasets()
for segmented_doc in ir_datasets.load(f"lsr-benchmark/{dataset}/segmented")
    doc # namedtuple<doc_id, segment>

Format of Document Texts

Inspired by the processing of MS MARCO v2.1, each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages (e.g., "text-of-passage-1 text-of-passage-2 text-of-passage-3 text-of-passage-4") would be represented as:

doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
segments:
- {"start": 1, "end": 2, "text": "text-of-passage-1 text-of-passage-2"}
- {"start": 2, "end": 3, "text": "text-of-passage-2 text-of-passage-3"}
- {"start": 3, "end": 4, "text": "text-of-passage-3 text-of-passage-4"}

Format of Document Embeddings

Each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages would be represented as:

doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
segments:
- {"start": 1, "end": 2, "embedding": {"term-1": 0.123, "term-2": 0.912}}
- {"start": 2, "end": 3, "embedding": {"term-1": 0.421, "term-3": 0.743}}
- {"start": 3, "end": 4, "embedding": {"term-2": 0.108, "term-4": 0.043}}

Evaluation

The online overview in TIRA provides an overview of aggregated evaluations. Alternatively, all data and further custom evaluations are available in the step-04-evaluation directory of this repository.

Our evaluation methodology encourages the development of diverse and novel measures for lsr models that take efficiency and effectiveness into consideration. We assume that a suitable interpretation of efficiency for a target task highly depends on the application and its context. Therefore, we aim to measure as many efficiency-oriented aspects as possible in a standardized way with the tirex-tracker to ensure that different efficiency/effectiveness interpretations can be evaluated post-hoc. This methodology and related aspects were developed as part of the ReNeuIR workshop series held at SIGIR 2022, 2023, 2024, and 2025.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

heinrichreimer mam10eks

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jan 21, 2026

0.0.1rc5 pre-release

Oct 31, 2025

0.0.1rc4 pre-release

Oct 30, 2025

0.0.1rc3 pre-release

Oct 27, 2025

0.0.1rc2 pre-release

Oct 13, 2025

0.0.1rc1 pre-release

Oct 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsr_benchmark-0.1.0.tar.gz (9.1 MB view details)

Uploaded Jan 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lsr_benchmark-0.1.0-py3-none-any.whl (9.4 MB view details)

Uploaded Jan 21, 2026 Python 3

File details

Details for the file lsr_benchmark-0.1.0.tar.gz.

File metadata

Download URL: lsr_benchmark-0.1.0.tar.gz
Upload date: Jan 21, 2026
Size: 9.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lsr_benchmark-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2cb92e1d6cf08e0542e664f7f451e8ae2a77deb31809c5d2618eace2eb00d876`
MD5	`463c71403f993ca1c1f365283318b8b6`
BLAKE2b-256	`0376edc497905b766eee86cf6c6dc742e6cc5a6f5db96d6e642a94c022816ab3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lsr_benchmark-0.1.0.tar.gz:

Publisher: ci.yml on reneuir/lsr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lsr_benchmark-0.1.0.tar.gz
- Subject digest: 2cb92e1d6cf08e0542e664f7f451e8ae2a77deb31809c5d2618eace2eb00d876
- Sigstore transparency entry: 843181161
- Sigstore integration time: Jan 21, 2026
Source repository:
- Permalink: reneuir/lsr-benchmark@4d1db193a773eb1275447740292988ab5e7880b4
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/reneuir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@4d1db193a773eb1275447740292988ab5e7880b4
- Trigger Event: push

File details

Details for the file lsr_benchmark-0.1.0-py3-none-any.whl.

File metadata

Download URL: lsr_benchmark-0.1.0-py3-none-any.whl
Upload date: Jan 21, 2026
Size: 9.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lsr_benchmark-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b2f17c480e593bc42e01c4631b3a04953581ca7f9eb0f64cb135b1ad7a76403`
MD5	`87eefb7ca38484bc51a4b2f6bdda021c`
BLAKE2b-256	`b24695e9c6a0a54235a3fbc25d59e4900f5cea1a043979af32c2a223b0a4289a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lsr_benchmark-0.1.0-py3-none-any.whl:

Publisher: ci.yml on reneuir/lsr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lsr_benchmark-0.1.0-py3-none-any.whl
- Subject digest: 7b2f17c480e593bc42e01c4631b3a04953581ca7f9eb0f64cb135b1ad7a76403
- Sigstore transparency entry: 843181176
- Sigstore integration time: Jan 21, 2026
Source repository:
- Permalink: reneuir/lsr-benchmark@4d1db193a773eb1275447740292988ab5e7880b4
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/reneuir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@4d1db193a773eb1275447740292988ab5e7880b4
- Trigger Event: push

lsr-benchmark 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

lsr-benchmark

Task

Installation

Supported Corpora and Embeddings

Running Tests

Documentation and Tutorials

Data

Format of Document Texts

Format of Document Embeddings

Evaluation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance