Skip to main content

The lsr-benchmark aims to support holistic evaluations of the lexical sparse retrieval paradigm to contrast efficiency and effectiveness across diverse retrieval scenarios.

Project description

The lsr-benchmark banner image

lsr-benchmark

CI Maintenance Code coverage
Release PyPi Downloads Commit activity

CLI • Python API • Citation

The lsr-benchmark aims to support holistic evaluations of the learned sparse retrieval paradigm to contrast efficiency and effectiveness across diverse retrieval scenarios.

Task

The learned sparse retrieval paradigm conducts retrieval in three steps:

  1. Documents are segmented into passages so that the passages can be processed by pre-trained transformers.
  2. Documents and queries are embedded into a sparse learned embedding.
  3. Retrieval systems create an index of the document embeddings to return a ranking for each embedded query.

You can submit solutions to step 2 (i.e., models that embed documents and queries into sparse embeddings) and/or solutions to step 3 (i.e., retrieval systems). The idea is then to validate all combinations of embeddings with all retrieval systems to identify which solutions work well for which use case, taking different notions of efficiency/effectiveness trade-offs into consideration. The passage segmentation for step 1 is open source (i.e., created via lsr-benchmark segment-corpus <IR-DATASETS-ID>) but fixed for this task.

Installation

You can install the lsr-benchmark via:

pip3 install lsr-benchmark

If you want the latest features, you can install from the main branch:

pip3 install git+https://github.com/reneuir/lsr-benchmark.git

Supported Corpora and Embeddings

Please run lsr-benchmark overview for an up-to-date overview over all datasets and all embeddings. Alternatively, online overview in TIRA provides an overview.

Running Tests

We have a suite of unit tests that you can run via:

# first install the local version of the lsr-benchmark
pip3 install -e .[dev,test]
# then run the unit tests
pytest .

Documentation and Tutorials

We have a set of tutorials available.

The lsr-benchmark --help command serves as entrypoint to the documentation.

Instructions to add new datasets are available in the data directory.

  • ToDo: Write how to add new datasets, embeddings, retrieval, evaluation
    • short video

Data

The formats for data inputs and outputs aim to support slicing and dicing diverse query and document distributions while enabling caching, allowing for GreenIR research.

You can slice and dice the document texts and document embeddings via the API. The document texts for private corpora are only available within the TIRA sandbox whereas the document embeddings are publicly available for all corpora (as one can not re-construct the original documents from sparse embeddings).

dataset = lsr_benchmark.load('<IR-DATASETS-ID>')

# process the document embeddings:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>', passage_aggregation="first-passage"):
    doc # namedtuple<doc_id, embedding>

# process the document embeddings for all segments:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>'):
    doc # namedtuple<doc_id, segments.embedding>

# process the document texts:
for doc in dataset.docs_iter(embedding=None):
    doc # namedtuple<doc_id, segments.text>

# process the document texts via segmented versions in ir_datasets
lsr_benchmark.register_to_ir_datasets()
for segmented_doc in ir_datasets.load(f"lsr-benchmark/{dataset}/segmented")
    doc # namedtuple<doc_id, segment>

Format of Document Texts

Inspired by the processing of MS MARCO v2.1, each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages (e.g., "text-of-passage-1 text-of-passage-2 text-of-passage-3 text-of-passage-4") would be represented as:

  • doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
  • segments:
    • {"start": 1, "end": 2, "text": "text-of-passage-1 text-of-passage-2"}
    • {"start": 2, "end": 3, "text": "text-of-passage-2 text-of-passage-3"}
    • {"start": 3, "end": 4, "text": "text-of-passage-3 text-of-passage-4"}

Format of Document Embeddings

Each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages would be represented as:

  • doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
  • segments:
    • {"start": 1, "end": 2, "embedding": {"term-1": 0.123, "term-2": 0.912}}
    • {"start": 2, "end": 3, "embedding": {"term-1": 0.421, "term-3": 0.743}}
    • {"start": 3, "end": 4, "embedding": {"term-2": 0.108, "term-4": 0.043}}

Evaluation

The online overview in TIRA provides an overview of aggregated evaluations. Alternatively, all data and further custom evaluations are available in the step-04-evaluation directory of this repository.

Our evaluation methodology encourages the development of diverse and novel measures for lsr models that take efficiency and effectiveness into consideration. We assume that a suitable interpretation of efficiency for a target task highly depends on the application and its context. Therefore, we aim to measure as many efficiency-oriented aspects as possible in a standardized way with the tirex-tracker to ensure that different efficiency/effectiveness interpretations can be evaluated post-hoc. This methodology and related aspects were developed as part of the ReNeuIR workshop series held at SIGIR 2022, 2023, 2024, and 2025.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsr_benchmark-0.1.0.tar.gz (9.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lsr_benchmark-0.1.0-py3-none-any.whl (9.4 MB view details)

Uploaded Python 3

File details

Details for the file lsr_benchmark-0.1.0.tar.gz.

File metadata

  • Download URL: lsr_benchmark-0.1.0.tar.gz
  • Upload date:
  • Size: 9.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lsr_benchmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2cb92e1d6cf08e0542e664f7f451e8ae2a77deb31809c5d2618eace2eb00d876
MD5 463c71403f993ca1c1f365283318b8b6
BLAKE2b-256 0376edc497905b766eee86cf6c6dc742e6cc5a6f5db96d6e642a94c022816ab3

See more details on using hashes here.

Provenance

The following attestation bundles were made for lsr_benchmark-0.1.0.tar.gz:

Publisher: ci.yml on reneuir/lsr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lsr_benchmark-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lsr_benchmark-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lsr_benchmark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b2f17c480e593bc42e01c4631b3a04953581ca7f9eb0f64cb135b1ad7a76403
MD5 87eefb7ca38484bc51a4b2f6bdda021c
BLAKE2b-256 b24695e9c6a0a54235a3fbc25d59e4900f5cea1a043979af32c2a223b0a4289a

See more details on using hashes here.

Provenance

The following attestation bundles were made for lsr_benchmark-0.1.0-py3-none-any.whl:

Publisher: ci.yml on reneuir/lsr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page