Skip to main content

Bayesian probability transforms for BM25 retrieval scores

Project description

Bayesian BM25

[Blog] [Papers]

A probabilistic framework that converts raw BM25 retrieval scores into calibrated relevance probabilities using Bayesian inference.

Overview

Standard BM25 produces unbounded scores that lack consistent meaning across queries, making threshold-based filtering and multi-signal fusion unreliable. Bayesian BM25 addresses this by applying a sigmoid likelihood model with a composite prior (term frequency + document length normalization) and computing Bayesian posteriors that output well-calibrated probabilities in [0, 1]. A corpus-level base rate prior further improves calibration by 68--77% without requiring relevance labels.

Key capabilities:

  • Score-to-probability transform -- convert raw BM25 scores into calibrated relevance probabilities via sigmoid likelihood + composite prior + Bayesian posterior
  • Base rate calibration -- corpus-level base rate prior estimated from score distribution decomposes the posterior into three additive log-odds terms, reducing expected calibration error by 68--77% without relevance labels
  • Parameter learning -- batch gradient descent or online SGD with EMA-smoothed gradients and Polyak averaging, with three training modes: balanced (C1), prior-aware (C2), and prior-free (C3)
  • Probabilistic fusion -- combine multiple probability signals using log-odds conjunction with optional per-signal reliability weights (Log-OP), which resolves the shrinkage problem of naive probabilistic AND
  • Hybrid search -- cosine_to_probability() converts vector similarity scores to probabilities for fusion with BM25 signals via weighted log-odds conjunction
  • WAND pruning -- wand_upper_bound() computes safe Bayesian probability upper bounds for document pruning in top-k retrieval
  • Search integration -- drop-in scorer wrapping bm25s that returns probabilities instead of raw scores

Adoption

  • MTEB -- included as a baseline retrieval model (bb25) for the Massive Text Embedding Benchmark
  • txtai -- used for BM25 score normalization in hybrid search (normalize="bayesian-bm25")

Installation

pip install bayesian-bm25

To use the integrated search scorer (requires bm25s):

pip install bayesian-bm25[scorer]

Quick Start

Converting BM25 Scores to Probabilities

import numpy as np
from bayesian_bm25 import BayesianProbabilityTransform

transform = BayesianProbabilityTransform(alpha=1.5, beta=1.0, base_rate=0.01)

scores = np.array([0.5, 1.0, 1.5, 2.0, 3.0])
tfs = np.array([1, 2, 3, 5, 8])
doc_len_ratios = np.array([0.3, 0.5, 0.8, 1.0, 1.5])

probabilities = transform.score_to_probability(scores, tfs, doc_len_ratios)

End-to-End Search with Probabilities

from bayesian_bm25 import BayesianBM25Scorer

corpus_tokens = [
    ["python", "machine", "learning"],
    ["deep", "learning", "neural", "networks"],
    ["data", "visualization", "tools"],
]

scorer = BayesianBM25Scorer(k1=1.2, b=0.75, method="lucene", base_rate="auto")
scorer.index(corpus_tokens, show_progress=False)

doc_ids, probabilities = scorer.retrieve([["machine", "learning"]], k=3)

Combining Multiple Signals

import numpy as np
from bayesian_bm25 import log_odds_conjunction, prob_and, prob_or

signals = np.array([0.85, 0.70, 0.60])

prob_and(signals)                # 0.357 (shrinkage problem)
log_odds_conjunction(signals)    # 0.773 (agreement-aware)

Hybrid Text + Vector Search

import numpy as np
from bayesian_bm25 import cosine_to_probability, log_odds_conjunction

# BM25 probabilities (from Bayesian BM25)
bm25_probs = np.array([0.85, 0.60, 0.40])

# Vector search cosine similarities -> probabilities
cosine_scores = np.array([0.92, 0.35, 0.70])
vector_probs = cosine_to_probability(cosine_scores)  # [0.96, 0.675, 0.85]

# Fuse with reliability weights (BM25 weight=0.6, vector weight=0.4)
stacked = np.stack([bm25_probs, vector_probs], axis=-1)
fused = log_odds_conjunction(stacked, weights=np.array([0.6, 0.4]))

WAND Pruning with Bayesian Upper Bounds

from bayesian_bm25 import BayesianProbabilityTransform

transform = BayesianProbabilityTransform(alpha=1.5, beta=2.0, base_rate=0.01)

# Standard BM25 upper bound per query term
bm25_upper_bound = 5.0

# Bayesian upper bound for safe pruning -- any document's actual
# probability is guaranteed to be at most this value
bayesian_bound = transform.wand_upper_bound(bm25_upper_bound)

Online Learning from User Feedback

from bayesian_bm25 import BayesianProbabilityTransform

transform = BayesianProbabilityTransform(alpha=1.0, beta=0.0)

# Batch warmup on historical data
transform.fit(historical_scores, historical_labels)

# Online refinement from live feedback
for score, label in feedback_stream:
    transform.update(score, label, learning_rate=0.01, momentum=0.95)

# Use Polyak-averaged parameters for stable inference
alpha = transform.averaged_alpha
beta = transform.averaged_beta

Training Modes

from bayesian_bm25 import BayesianProbabilityTransform

transform = BayesianProbabilityTransform(alpha=1.0, beta=0.0)

# C1 (balanced, default): train on sigmoid likelihood
transform.fit(scores, labels, mode="balanced")

# C2 (prior-aware): train on full Bayesian posterior
transform.fit(scores, labels, mode="prior_aware", tfs=tfs, doc_len_ratios=ratios)

# C3 (prior-free): train on likelihood, inference uses prior=0.5
transform.fit(scores, labels, mode="prior_free")

Benchmarks

Evaluated on BEIR datasets (NFCorpus, SciFact) with k1=1.2, b=0.75, Lucene BM25. Queries are split 50/50 for training and evaluation. "Batch fit" uses gradient descent on training labels; all other Bayesian methods are unsupervised.

Ranking Quality

Base rate prior is a monotonic transform -- it does not change document ordering.

Method NFCorpus NDCG@10 NFCorpus MAP SciFact NDCG@10 SciFact MAP
Raw BM25 0.5023 0.4395 0.5900 0.5426
Bayesian (auto) 0.5050 0.4403 0.5791 0.5283
Bayesian (auto) + base rate 0.5050 0.4403 0.5791 0.5283
Bayesian (batch fit) 0.5041 0.4400 0.5826 0.5305
Bayesian (batch fit) + base rate 0.5041 0.4400 0.5826 0.5305

Probability Calibration

Expected Calibration Error (ECE) and Brier score. Lower is better.

Method NFCorpus ECE NFCorpus Brier SciFact ECE SciFact Brier
Bayesian (no base rate) 0.6519 0.4667 0.7989 0.6635
Bayesian (base_rate=auto) 0.1461 (-77.6%) 0.0619 0.2577 (-67.7%) 0.1308
Bayesian (base_rate=0.001) 0.0081 (-98.8%) 0.0114 0.0354 (-95.6%) 0.0157
Batch fit (no base rate) 0.0093 (-98.6%) 0.0114 0.0103 (-98.7%) 0.0051
Batch fit + base_rate=auto 0.0085 (-98.7%) 0.0096 0.0021 (-99.7%) 0.0013

Threshold Transfer

F1 scores using the best threshold found on training queries, applied to evaluation queries. Smaller gap indicates better generalization.

Method NFCorpus Train F1 NFCorpus Test F1 SciFact Train F1 SciFact Test F1
Bayesian (no base rate) 0.1607 0.1511 0.3374 0.2800
Batch fit (no base rate) 0.1577 0.1405 0.2358 0.2294
Batch fit + base_rate=auto 0.1559 0.1403 0.3316 0.3341

Reproduce with python benchmarks/base_rate.py (requires pip install ir_datasets). The base rate benchmark also includes Platt scaling, min-max normalization, and prior-aware/prior-free training mode comparisons.

Additional benchmarks (no external datasets required):

  • python benchmarks/weighted_fusion.py -- weighted vs uniform log-odds fusion across noise scenarios
  • python benchmarks/wand_upper_bound.py -- WAND upper bound tightness and skip rate analysis

Citation

If you use this work, please cite the following papers:

@preprint{Jeong2026BayesianBM25,
  author    = {Jeong, Jaepil},
  title     = {Bayesian {BM25}: {A} Probabilistic Framework for Hybrid Text
               and Vector Search},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18414940},
  url       = {https://doi.org/10.5281/zenodo.18414940}
}

@preprint{Jeong2026BayesianNeural,
  author    = {Jeong, Jaepil},
  title     = {From {Bayesian} Inference to Neural Computation: The Analytical
               Emergence of Neural Network Structure from Probabilistic
               Relevance Estimation},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18512411},
  url       = {https://doi.org/10.5281/zenodo.18512411}
}

License

This project is licensed under the Apache License 2.0.

Copyright (c) 2023-2026 Cognica, Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bayesian_bm25-0.3.0.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bayesian_bm25-0.3.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file bayesian_bm25-0.3.0.tar.gz.

File metadata

  • Download URL: bayesian_bm25-0.3.0.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for bayesian_bm25-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c120915d9d0f6c9eea5d1919ce962c3196576e0d94120c3cf1077a06b9348256
MD5 14a1216a11a8d1cc4a927669b5cb88b7
BLAKE2b-256 093b0ae52f5800cc26a223fa84291f048c5331dd2ead081a97c5d5f9ede66cb3

See more details on using hashes here.

File details

Details for the file bayesian_bm25-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: bayesian_bm25-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for bayesian_bm25-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b34ba98f530b6ba3238f69974b5fe8af063863adb95d3762d9f70a42a759b086
MD5 e064bf60dc06a0d8349260fa852c3246
BLAKE2b-256 d441e8a523ed2838ad0429430770aa959e0397886ad351cd73e84ac85913d9b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page