Skip to main content

Bayesian BM25 scoring and experimental validation (Rust core + Python bindings)

Project description

bb25 (Bayesian BM25)

bb25 is a fast, self-contained BM25 + Bayesian calibration implementation with a minimal Python API. It also includes a small reference corpus and experiment suite so you can validate the expected numerical properties.

Install

pip install bb25

Quick start

Use the built-in corpus and queries

import bb25 as bb

corpus = bb.build_default_corpus()
docs = corpus.documents()
queries = bb.build_default_queries()

bm25 = bb.BM25Scorer(corpus, 1.2, 0.75)
score = bm25.score(queries[0].terms, docs[0])
print("score0", score)

Build your own corpus

import bb25 as bb

corpus = bb.Corpus()
corpus.add_document("d1", "neural networks for ranking", [0.1] * 8)
corpus.add_document("d2", "bm25 is a strong baseline", [0.2] * 8)
corpus.build_index()  # must be called before creating scorers

bm25 = bb.BM25Scorer(corpus, 1.2, 0.75)
print(bm25.idf("bm25"))

Bayesian calibration + hybrid fusion

import bb25 as bb

corpus = bb.build_default_corpus()
docs = corpus.documents()
queries = bb.build_default_queries()

bm25 = bb.BM25Scorer(corpus, 1.2, 0.75)
bayes = bb.BayesianBM25Scorer(bm25, 1.0, 0.5)
vector = bb.VectorScorer()
hybrid = bb.HybridScorer(bayes, vector)

q = queries[0]
prob_or = hybrid.score_or(q.terms, q.embedding, docs[0])
prob_and = hybrid.score_and(q.terms, q.embedding, docs[0])
print("OR", prob_or, "AND", prob_and)

Run the experiments

import bb25 as bb

results = bb.run_experiments()
print(all(r.passed for r in results))

Sample script

See docs/sample_usage.py for an end-to-end example using BM25, Bayesian calibration, and hybrid fusion.

Benchmarks (BM25 vs Bayesian)

See benchmarks/README.md for a lightweight runner that compares BM25 and Bayesian BM25 on your own corpora.

English Benchmark (SQuAD, 100 validation queries)

This is where BB25 shines: Bayesian Hybrid beats the classic BM25 Hybrid.

Method NDCG@10 MRR@10 Notes
WS (BB25+Dense) 0.9149 0.8850 SOTA!
WS (BM25+Dense) 0.9051 0.8717
RRF (BM25+Dense) 0.8874 0.8483 RRF underperforms weighted sum

Conclusion

"Bayesian BM25 (bb25) has demonstrated the potential to outperform classic BM25 in hybrid search."

On the English dataset (SQuAD), combining bb25 with Dense (BGE-M3) achieves higher performance than the BM25 + Dense baseline (+1.0%p NDCG). This suggests the probabilistic score from bb25 blends more smoothly with vector scores (less scale mismatch than a simple weighted sum).

Original paper:

https://www.researchgate.net/publication/400212695_Bayesian_BM25_A_Probabilistic_Framework_for_Hybrid_Text_and_Vector_Search

Build from source (Rust)

make build

PyPI publishing

Build a wheel with maturin:

python -m pip install maturin
maturin build --release

For Pyodide builds, see docs/pyodide.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bb25-0.1.2.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bb25-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (392.1 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file bb25-0.1.2.tar.gz.

File metadata

  • Download URL: bb25-0.1.2.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.14

File hashes

Hashes for bb25-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c6af1cb6a846b9068fecbb33db69066aca17281c0ec2f2820971ae87edea2d16
MD5 6243930c0d8087a0cf2d0662a7395670
BLAKE2b-256 a4add8b41a78e985670800bd150f110e1b5eaf22c43269cb583f0c245bde5ec6

See more details on using hashes here.

File details

Details for the file bb25-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bb25-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b7030fbb2531ec3bb79be5f7420c863eb998398e1dc42b1db8a0c111dc103267
MD5 60d9835006dc594a6bb162cfc1688fea
BLAKE2b-256 8f10581f0841161411b2dcb1565cc77a340d4e01403217372b542cbf7e91a39b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page