Skip to main content

CLI tools for running BM25 benchmark evals.

Project description

BM25 Benchmarks

CLI

Installation

From PyPI with pip:

pip install bm25-benchmarks

From GitHub with pip:

pip install "bm25-benchmarks @ git+https://github.com/xhluca/bm25-benchmarks.git"

With uv as a globally available tool:

uv tool install bm25-benchmarks

With uv into the current virtual environment:

uv pip install "bm25-benchmarks @ git+https://github.com/xhluca/bm25-benchmarks.git"

For local development:

pip install -e "."
# or
uv pip install -e "."

The default install includes bm25s. The bm25s backend uses the dataset helpers shipped by bm25s, so the default CLI path does not install beir. Install another backend with bm25-benchmark install rank, bm25-benchmark install bm25-pt, bm25-benchmark install pyserini, bm25-benchmark install elastic, bm25-benchmark install pisa, or bm25-benchmark install all.

For rank-bm25, use the CLI installer so the pinned Git dependency is installed for you. PyPI rejects direct Git dependencies in package metadata, so the pin is kept in requirements-rank-bm25.txt and the CLI installer:

bm25-benchmark install rank
bm25-benchmark install all
bm25-benchmark install rank --installer uv

Usage

The benchmark package exposes one CLI for running evals:

bm25-benchmark --help
bm25-benchmark models
bm25-benchmark datasets

Run an eval by choosing a backend and dataset:

bm25-benchmark eval bm25s -d fiqa
bm25-benchmark eval rank-bm25 -d fiqa --samples 1000
bm25-benchmark eval pyserini -d fiqa --threads 4
bm25-benchmark eval elastic -d fiqa --hostname localhost
bm25-benchmark eval pisa -d fiqa
bm25-benchmark eval bm25-pt -d fiqa --batch-size 32

Common eval options include:

bm25-benchmark eval bm25s -d fiqa -d scifact --result-dir results --save-dir datasets
bm25-benchmark eval bm25s -d fiqa,scifact --num-runs 3
bm25-benchmark eval bm25s -d fiqa --dry-run

Use bm25-benchmark eval <backend> --help to see backend-specific options.

The module form is also available:

python -m benchmark eval bm25s -d fiqa

Running Benchmarks

BM25S options

For bm25s, you can specify which scoring methods and retrieval backends to benchmark:

# Default: runs the jit scorer and numba backend
bm25-benchmark eval bm25s -d fiqa

# Specify scorers (uncompiled, legacy, jit)
bm25-benchmark eval bm25s -d fiqa --scorers legacy jit
bm25-benchmark eval bm25s -d fiqa --scorers jit
bm25-benchmark eval bm25s -d fiqa --scorers uncompiled legacy jit

# Specify backends (jax, numba, numpy)
bm25-benchmark eval bm25s -d fiqa --backends numba
bm25-benchmark eval bm25s -d fiqa --backends jax numba numpy

# Combine both
bm25-benchmark eval bm25s -d fiqa --scorers jit --backends numba

Scorer options:

  • uncompiled: Default NumPy implementation (optimized with np.add.at)
  • legacy: Legacy implementation (similar to uncompiled, kept for comparison)
  • jit: Numba JIT-compiled version (fastest after warmup)

Backend options:

  • jax: JAX-based retrieval
  • numba: Numba JIT-compiled retrieval (default)
  • numpy: Pure NumPy retrieval

Available datasets

The available datasets are public BEIR datasets: trec-covid, nfcorpus, fiqa, arguana, webis-touche2020, quora, scidocs, scifact, cqadupstack, nq, msmarco, hotpotqa, dbpedia-entity, fever, climate-fever,

Sampling during benchmarking

For rank-bm25, due to the long runtime, we can sample queries

bm25-benchmark eval rank-bm25 -d "<dataset>" --samples <num_samples>

Rank-bm25 variants

For rank-bm25, we can also specify the method with --method to be used:

  • rank (default)
  • bm25l
  • bm25+

Results will be saved in results/ directory.

Elasticsearch server

If you want to use elastic search, you need to start the server first.

First, download the elastic search from here. You will get a file, e.g. elasticsearch-8.14.0-linux-x86_64.tar.gz. Extract the file and ensure it is in the same directory as the bm25-benchmarks directory.

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-8.14.0-linux-x86_64.tar.gz
# remove the tar file
rm elasticsearch-8.14.0-linux-x86_64.tar.gz

Then, start the server with the following command:

./elasticsearch-8.14.0/bin/elasticsearch -E xpack.security.enabled=false -E thread_pool.search.size=1 -E thread_pool.write.size=1

Results

The results are benchmarked using Kaggle notebooks to ensure reproducibility. Each one is run on single-core, Intel Xeon CPU @ 2.20GHz, using 30GB RAM.

The shorthands used are:

  • BM25PT for bm25_pt
  • PSRN for pyserini
  • R-BM25 for rank-bm25
  • BM25S for bm25, and BM25S+J for Numba JIT version of bm25s (v0.2.0+)
  • ES for elasticsearch
  • PISA for the Pisa Engine (via the pyterrier_pisa Python bindings)
  • OOM for out-of-memory error
  • DNT for did not terminate (i.e. went over 12 hours)

Queries per second

dataset PISA BM25S+J BM25S ES PSRN PT R-BM25
arguana 270.53 869.95 573.91 13.67 11.95 110.51 2
climate-fever 35.95 38.49 13.09 4.02 8.06 OOM 0.03
cqadupstack 362.39 396.5 170.91 13.38 DNT OOM 0.77
dbpedia-entity 197.45 71.8 13.44 10.68 12.69 OOM 0.11
fever 81.42 53.84 20.19 7.45 10.52 OOM 0.06
fiqa 714.35 1237.39 717.78 16.96 12.51 20.52 4.46
hotpotqa 54.98 47.16 20.88 7.11 10.41 OOM 0.04
msmarco 178.65 39.18 12.2 11.88 11.01 OOM 0.07
nfcorpus 5111.72 5696.21 1196.16 45.84 32.94 256.67 224.66
nq 168.12 109.47 41.85 12.16 11.04 OOM 0.1
quora 735.20 479.71 272.04 21.8 15.58 6.49 1.18
scidocs 818.97 1448.32 767.05 17.93 14.1 41.34 9.01
scifact 1463.73 2787.84 1317.12 20.81 15.02 184.3 47.6
trec-covid 282.94 483.84 85.64 7.34 8.53 3.73 1.48
webis-touche2020 431.12 390.03 60.59 13.53 12.36 OOM 1.1

Notes:

  • For Rank-BM25, larger datasets are ran with 1000 samples rather than the full dataset to ensure it finishes within 12h (limit for Kaggle notebooks).
  • For ES and BM25S, we can set a number of threads to use. However, you might not see an improvement, in fact you might even see a decrease in throughput in the case of BM25S due to how multi-threading is implemented. Click below to see the results.
Show BM25S & ES multi-threaded (4T) performance (Q/s)
dataset PISA BM25S ES
arguana 590.93 211 33.37
climate-fever 91.68 22.06 8.13
cqadupstack 945.66 248.87 27.76
dbpedia-entity 478.26 26.18 15.49
fever 222.08 47.03 14.07
fiqa 1382.32 449.82 36.33
hotpotqa 134.60 45.02 10.35
msmarco 393.16 21.64 18.19
nfcorpus 6706.53 784.24 81.07
nq 423.54 77.49 19.18
quora 1892.98 308.58 43.02
scidocs 1757.44 614.23 46.36
scifact 2480.86 645.88 50.93
trec-covid 676.40 100.88 13.5
webis-touche2020 938.57 202.39 26.55
Show normalized table wrt Rank-BM25
dataset PISA BM25S ES PSRN PT Rank
arguana 135.27 286.96 6.84 5.98 55.26 1
climate-fever 1198.33 436.33 134 268.67 nan 1
cqadupstack 470.64 221.96 17.38 nan nan 1
dbpedia-entity 1795.00 122.18 97.09 115.36 nan 1
fever 1357.00 336.5 124.17 175.33 nan 1
fiqa 160.17 160.94 3.8 2.8 4.6 1
hotpotqa 1374.50 522 177.75 260.25 nan 1
msmarco 2552.14 174.29 169.71 157.29 nan 1
nfcorpus 22.75 5.32 0.2 0.15 1.14 1
nq 1681.20 418.5 121.6 110.4 nan 1
quora 623.05 230.54 18.47 13.2 5.5 1
scidocs 90.90 85.13 1.99 1.56 4.59 1
scifact 30.75 27.67 0.44 0.32 3.87 1
trec-covid 191.18 57.86 4.96 5.76 2.52 1
webis-touche2020 391.93 55.08 12.3 11.24 nan 1

Stats

# Docs # Queries # Tokens
msmarco 8,841,823 6,980 340,859,891
hotpotqa 5,233,329 7,405 169,530,287
trec-covid 171,332 50 20,231,412
webis-touche2020 382,545 49 74,180,340
arguana 8,674 1,406 947,470
fiqa 57,638 648 5,189,035
nfcorpus 3,633 323 614,081
climate-fever 5,416,593 1,535 318,190,120
nq 2,681,468 3,452 148,249,808
scidocs 25,657 1,000 3,211,248
quora 522,931 10,000 4,202,123
dbpedia-entity 4,635,922 400 162,336,256
cqadupstack 457,199 13,145 44,857,487
fever 5,416,568 6,666 318,184,321
scifact 5,183 300 812,074

Indexing time (docs/s)

The following results follow the same setup as the queries/s benchmarks described above (single-core).

dataset PISA BM25S ES PSRN PT Rank
arguana 3432.50 4314.79 3591.63 1225.18 638.1 5021.3
climate-fever 5462.73 4364.43 3825.89 6880.42 nan 7085.51
cqadupstack 3963.76 4800.89 3725.43 nan nan 5370.32
dbpedia-entity 9019.62 7576.28 6333.82 8501.7 nan 9110.36
fever 4903.06 4921.88 3879.63 7007.5 nan 5482.64
fiqa 4426.92 5959.25 4035.11 3735.38 421.51 6455.53
hotpotqa 9883.85 7420.39 5455.6 10342.5 nan 9407.9
msmarco 10205.53 7480.71 5391.29 9686.07 nan 12455.9
nfcorpus 2381.11 3169.4 1688.15 692.05 442.2 3579.47
nq 7122.05 6083.86 5742.13 6652.33 nan 6048.85
quora 38512.02 28002.4 8189.75 22818.5 6251.26 47609.2
scidocs 3085.13 4107.46 3008.45 2137.64 312.72 4232.15
scifact 2449.91 3253.63 2649.57 880.53 442.61 3792.84
trec-covid 4642.59 4600.14 2966.98 3768.1 406.37 4672.62
webis-touche2020 2228.10 2971.96 2484.87 2718.41 nan 3115.96

NDCG@10

We use abbreviations for datasets of BEIR benchmarks.

Click to show dataset abbreviations
  • AG for arguana
  • CD for cqadupstack
  • CF for climate-fever
  • DB for dbpedia-entity
  • FQ for fiqa
  • FV for fever
  • HP for hotpotqa
  • MS for msmarco
  • NF for nfcorpus
  • NQ for nq
  • QR for quora
  • SD for scidocs
  • SF for scifact
  • TC for trec-covid
  • WT for webis-touche2020
k1 b method Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
0.9 0.4 Lucene 41.1 40.8 28.2 16.2 31.9 23.8 63.8 62.9 22.8 31.8 30.5 78.7 15.0 67.6 58.9 44.2
1.2 0.75 ATIRE 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25+ 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25L 39.5 49.6 29.8 13.5 29.4 25.0 46.6 55.9 21.4 32.2 28.1 80.3 15.8 68.7 62.9 33.0
1.2 0.75 Lucene 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.0 61.0 33.2
1.2 0.75 Robertson 39.9 49.2 29.9 13.7 30.3 25.4 50.3 58.5 22.6 31.9 29.2 80.4 15.5 68.3 59.0 33.8
1.5 0.75 ES 42.0 47.7 29.8 17.8 31.1 25.3 62.0 58.6 22.1 34.4 31.6 80.6 16.3 69.0 68.0 35.4
1.5 0.75 Lucene 39.7 49.3 29.9 13.6 29.9 25.1 48.1 56.9 21.9 32.1 28.5 80.4 15.8 68.7 62.3 33.1
1.5 0.75 PSRN 40.0 48.4 29.8 14.2 30.0 25.3 50.0 57.6 22.1 32.6 28.6 80.6 15.6 68.8 63.4 33.5
1.5 0.75 PT 45.0 44.9 -- -- -- 22.5 -- -- -- 31.9 -- 75.1 14.7 67.8 58.0 --
1.5 0.75 Rank 39.6 49.5 29.6 13.6 29.9 25.3 49.3 58.1 21.1 32.1 28.5 80.3 15.8 68.5 60.1 32.9
1.2 0.75 PISA 38.8 41.1 27.8 13.9 30.5 24.5 49.2 58.2 22.8 34.3 28.2 72.0 15.7 68.9 64.2 30.9

Recall@1000

k1 b method Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
0.9 0.4 Lucene 77.3 98.8 71.1 63.3 67.5 74.3 95.7 88.0 85.3 47.7 89.6 99.5 56.5 97.0 39.2 86.0
1.2 0.75 ATIRE 77.4 99.3 73.0 59.0 67.0 76.5 94.2 86.8 85.7 47.8 89.8 99.5 57.3 97.0 40.3 87.2
1.2 0.75 BM25+ 77.4 99.3 73.0 59.0 67.0 76.5 94.2 86.8 85.7 47.8 89.8 99.5 57.3 97.0 40.3 87.2
1.2 0.75 BM25L 77.2 99.4 73.4 57.3 66.1 77.3 93.7 85.7 85.0 47.7 89.3 99.5 57.7 97.0 40.8 87.5
1.2 0.75 Lucene 77.4 99.3 73.0 59.0 67.0 76.5 94.2 86.8 85.6 47.8 89.8 99.5 57.3 97.0 40.3 87.2
1.2 0.75 Robertson 77.4 99.3 73.2 59.1 66.7 76.8 94.2 86.8 85.9 47.5 89.8 99.5 57.3 96.7 40.2 87.4
1.5 0.75 ES 76.9 99.2 74.2 58.8 63.6 76.7 95.9 85.2 85.1 39.0 90.8 99.6 57.9 98.0 41.3 88.0
1.5 0.75 Lucene 77.2 99.3 73.3 57.8 66.3 77.2 93.8 86.1 85.2 47.7 89.5 99.6 57.5 97.0 40.6 87.4
1.5 0.75 PSRN 76.7 99.2 74.2 58.7 66.2 76.7 94.2 86.4 85.1 37.1 89.4 99.6 57.4 97.7 41.1 87.2
1.5 0.75 PT 73.0 98.3 -- -- -- 72.5 -- -- -- 51.0 -- 98.9 56.0 97.8 36.3 --
1.5 0.75 Rank 77.1 99.4 73.4 57.5 66.4 77.4 93.6 87.7 82.6 47.6 89.5 99.5 57.4 96.7 40.5 87.5
1.2 0.75 PISA 77.1 98.7 72.2 60.2 67.7 76.5 93.7 86.8 86.9 38.4 89.1 98.9 56.9 97.0 45.9 87.4

Links

Legacy Module Entry Points

Prefer the bm25-benchmark eval ... CLI for new runs. The older module entry points are still available for existing scripts:

# For bm25_pt
python -m benchmark.on_bm25_pt -d "<dataset>"

# For rank-bm25
python -m benchmark.on_rank_bm25 -d "<dataset>"

# For Pyserini
python -m benchmark.on_pyserini -d "<dataset>"

# For elastic, after starting the server, run:
python -m benchmark.on_elastic -d "<dataset>"

# For PISA
python -m benchmark.on_pisa -d "<dataset>"

# For bm25s
python -m benchmark.on_bm25s -d "<dataset>"

where <dataset> is the name of the dataset to be used.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bm25_benchmarks-0.0.4.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bm25_benchmarks-0.0.4-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file bm25_benchmarks-0.0.4.tar.gz.

File metadata

  • Download URL: bm25_benchmarks-0.0.4.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bm25_benchmarks-0.0.4.tar.gz
Algorithm Hash digest
SHA256 c5bf0b64d785ceea8c9fbdea09022a6d6f42a95bda08f68f061048b316f2c8ae
MD5 89c6a5af83ffbe99fc32c99ed950a89d
BLAKE2b-256 39011f50c87d11d6535224b69090512190421b4abb20331ecfa8f6bdd6b8e75d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_benchmarks-0.0.4.tar.gz:

Publisher: publish.yml on xhluca/bm25-benchmarks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bm25_benchmarks-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: bm25_benchmarks-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bm25_benchmarks-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 96ee305819d84a2cdbdce2aaec7ec82968337394052ee427e288b3185175937a
MD5 175b0755008cf21c46662bef12781a58
BLAKE2b-256 5f5680a96bbc8fa8a3b331c0c3bdaa35279d27b1d21027156584a49bef450cea

See more details on using hashes here.

Provenance

The following attestation bundles were made for bm25_benchmarks-0.0.4-py3-none-any.whl:

Publisher: publish.yml on xhluca/bm25-benchmarks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page