Bayesian probability transforms for BM25 retrieval scores
Project description
Bayesian BM25
A probabilistic framework that converts raw BM25 retrieval scores into calibrated relevance probabilities using Bayesian inference.
Overview
Standard BM25 produces unbounded scores that lack consistent meaning across queries, making threshold-based filtering and multi-signal fusion unreliable. Bayesian BM25 addresses this by applying a sigmoid likelihood model with a composite prior (term frequency + document length normalization) and computing Bayesian posteriors that output well-calibrated probabilities in [0, 1].
Key capabilities:
- Score-to-probability transform -- convert raw BM25 scores into calibrated relevance probabilities via sigmoid likelihood + composite prior + Bayesian posterior
- Parameter learning -- batch gradient descent or online SGD with EMA-smoothed gradients and Polyak averaging
- Probabilistic fusion -- combine multiple probability signals using log-odds conjunction, which resolves the shrinkage problem of naive probabilistic AND
- Search integration -- drop-in scorer wrapping bm25s that returns probabilities instead of raw scores
Installation
pip install bayesian-bm25
To use the integrated search scorer (requires bm25s):
pip install bayesian-bm25[scorer]
Quick Start
Converting BM25 Scores to Probabilities
import numpy as np
from bayesian_bm25 import BayesianProbabilityTransform
transform = BayesianProbabilityTransform(alpha=1.5, beta=1.0)
scores = np.array([0.5, 1.0, 1.5, 2.0, 3.0])
tfs = np.array([1, 2, 3, 5, 8])
doc_len_ratios = np.array([0.3, 0.5, 0.8, 1.0, 1.5])
probabilities = transform.score_to_probability(scores, tfs, doc_len_ratios)
End-to-End Search with Probabilities
from bayesian_bm25 import BayesianBM25Scorer
corpus_tokens = [
["python", "machine", "learning"],
["deep", "learning", "neural", "networks"],
["data", "visualization", "tools"],
]
scorer = BayesianBM25Scorer(k1=1.2, b=0.75, method="lucene")
scorer.index(corpus_tokens, show_progress=False)
doc_ids, probabilities = scorer.retrieve([["machine", "learning"]], k=3)
Combining Multiple Signals
import numpy as np
from bayesian_bm25 import log_odds_conjunction, prob_and, prob_or
signals = np.array([0.85, 0.70, 0.60])
prob_and(signals) # 0.357 (shrinkage problem)
log_odds_conjunction(signals) # 0.773 (agreement-aware)
Online Learning from User Feedback
from bayesian_bm25 import BayesianProbabilityTransform
transform = BayesianProbabilityTransform(alpha=1.0, beta=0.0)
# Batch warmup on historical data
transform.fit(historical_scores, historical_labels)
# Online refinement from live feedback
for score, label in feedback_stream:
transform.update(score, label, learning_rate=0.01, momentum=0.95)
# Use Polyak-averaged parameters for stable inference
alpha = transform.averaged_alpha
beta = transform.averaged_beta
Citation
If you use this work, please cite the following papers:
@preprint{Jeong2026BayesianBM25,
author = {Jeong, Jaepil},
title = {Bayesian {BM25}: {A} Probabilistic Framework for Hybrid Text
and Vector Search},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18414940},
url = {https://doi.org/10.5281/zenodo.18414940}
}
@preprint{Jeong2026BayesianNeural,
author = {Jeong, Jaepil},
title = {From {Bayesian} Inference to Neural Computation: The Analytical
Emergence of Neural Network Structure from Probabilistic
Relevance Estimation},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18512411},
url = {https://doi.org/10.5281/zenodo.18512411}
}
License
This project is licensed under the Apache License 2.0.
Copyright (c) 2023-2026 Cognica, Inc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bayesian_bm25-0.1.1.tar.gz.
File metadata
- Download URL: bayesian_bm25-0.1.1.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50d169a654fbe9defbe928dfe0d103c425aeab72138a5d39ed115711b4f50c64
|
|
| MD5 |
dee70ce00e6d2d31954d1ac33b8e937b
|
|
| BLAKE2b-256 |
99b4b5c87e20ac0db4f04dcaf8d51c6b44a2b65abf1e1f95fc17947b1ab1a6e5
|
File details
Details for the file bayesian_bm25-0.1.1-py3-none-any.whl.
File metadata
- Download URL: bayesian_bm25-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
019ca6b10bed10b49218e0d0ed8124f59abe5227483f7a6c0fc6c2321b71e0db
|
|
| MD5 |
68775acddbcd2f68c7b74d822de111f8
|
|
| BLAKE2b-256 |
7b47a01be2fbcdc2d5a6c863cafdd3ea5e45f82a5c8ce0e6fbed592961181f82
|