Skip to main content

Bayesian probability transforms for BM25 retrieval scores

Project description

Bayesian BM25

A probabilistic framework that converts raw BM25 retrieval scores into calibrated relevance probabilities using Bayesian inference.

Overview

Standard BM25 produces unbounded scores that lack consistent meaning across queries, making threshold-based filtering and multi-signal fusion unreliable. Bayesian BM25 addresses this by applying a sigmoid likelihood model with a composite prior (term frequency + document length normalization) and computing Bayesian posteriors that output well-calibrated probabilities in [0, 1].

Key capabilities:

  • Score-to-probability transform -- convert raw BM25 scores into calibrated relevance probabilities via sigmoid likelihood + composite prior + Bayesian posterior
  • Parameter learning -- batch gradient descent or online SGD with EMA-smoothed gradients and Polyak averaging
  • Probabilistic fusion -- combine multiple probability signals using log-odds conjunction, which resolves the shrinkage problem of naive probabilistic AND
  • Search integration -- drop-in scorer wrapping bm25s that returns probabilities instead of raw scores

Installation

pip install bayesian-bm25

To use the integrated search scorer (requires bm25s):

pip install bayesian-bm25[scorer]

Quick Start

Converting BM25 Scores to Probabilities

import numpy as np
from bayesian_bm25 import BayesianProbabilityTransform

transform = BayesianProbabilityTransform(alpha=1.5, beta=1.0)

scores = np.array([0.5, 1.0, 1.5, 2.0, 3.0])
tfs = np.array([1, 2, 3, 5, 8])
doc_len_ratios = np.array([0.3, 0.5, 0.8, 1.0, 1.5])

probabilities = transform.score_to_probability(scores, tfs, doc_len_ratios)

End-to-End Search with Probabilities

from bayesian_bm25 import BayesianBM25Scorer

corpus_tokens = [
    ["python", "machine", "learning"],
    ["deep", "learning", "neural", "networks"],
    ["data", "visualization", "tools"],
]

scorer = BayesianBM25Scorer(k1=1.2, b=0.75, method="lucene")
scorer.index(corpus_tokens, show_progress=False)

doc_ids, probabilities = scorer.retrieve([["machine", "learning"]], k=3)

Combining Multiple Signals

import numpy as np
from bayesian_bm25 import log_odds_conjunction, prob_and, prob_or

signals = np.array([0.85, 0.70, 0.60])

prob_and(signals)                # 0.357 (shrinkage problem)
log_odds_conjunction(signals)    # 0.773 (agreement-aware)

Online Learning from User Feedback

from bayesian_bm25 import BayesianProbabilityTransform

transform = BayesianProbabilityTransform(alpha=1.0, beta=0.0)

# Batch warmup on historical data
transform.fit(historical_scores, historical_labels)

# Online refinement from live feedback
for score, label in feedback_stream:
    transform.update(score, label, learning_rate=0.01, momentum=0.95)

# Use Polyak-averaged parameters for stable inference
alpha = transform.averaged_alpha
beta = transform.averaged_beta

Citation

If you use this work, please cite the following papers:

@preprint{Jeong2026BayesianBM25,
  author    = {Jeong, Jaepil},
  title     = {Bayesian {BM25}: {A} Probabilistic Framework for Hybrid Text
               and Vector Search},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18414940},
  url       = {https://doi.org/10.5281/zenodo.18414940}
}

@preprint{Jeong2026BayesianNeural,
  author    = {Jeong, Jaepil},
  title     = {From {Bayesian} Inference to Neural Computation: The Analytical
               Emergence of Neural Network Structure from Probabilistic
               Relevance Estimation},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18512411},
  url       = {https://doi.org/10.5281/zenodo.18512411}
}

License

This project is licensed under the Apache License 2.0.

Copyright (c) 2023-2026 Cognica, Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bayesian_bm25-0.1.1.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bayesian_bm25-0.1.1-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file bayesian_bm25-0.1.1.tar.gz.

File metadata

  • Download URL: bayesian_bm25-0.1.1.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for bayesian_bm25-0.1.1.tar.gz
Algorithm Hash digest
SHA256 50d169a654fbe9defbe928dfe0d103c425aeab72138a5d39ed115711b4f50c64
MD5 dee70ce00e6d2d31954d1ac33b8e937b
BLAKE2b-256 99b4b5c87e20ac0db4f04dcaf8d51c6b44a2b65abf1e1f95fc17947b1ab1a6e5

See more details on using hashes here.

File details

Details for the file bayesian_bm25-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: bayesian_bm25-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for bayesian_bm25-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 019ca6b10bed10b49218e0d0ed8124f59abe5227483f7a6c0fc6c2321b71e0db
MD5 68775acddbcd2f68c7b74d822de111f8
BLAKE2b-256 7b47a01be2fbcdc2d5a6c863cafdd3ea5e45f82a5c8ce0e6fbed592961181f82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page