Bayesian sequential benchmarking for LLMs and agents

These details have not been verified by PyPI

Project links

Project description

bayesbench

Bayesian sequential benchmarking for LLMs and agents.

Stop evaluating when you have enough evidence — not when you run out of problems.

⚠️ Pre-Alpha: This package is under active development. APIs may change without notice. Not recommended for production use.

bayesbench applies Bayesian sequential testing to LLM evaluation. Instead of running every model on every problem, it stops as soon as posterior evidence crosses a statistical confidence threshold — delivering the same rigorous conclusions at a fraction of the cost.

Based on "Bayesian Sequential Testing for Efficient LLM Benchmarking", submitted to the 40th International Workshop on Statistical Modelling, Oslo 2026.
Demonstrated a 98.7% cost reduction on NorEval — 410 problems evaluated out of 31,800.

How it works

Problem 1 → update posteriors → P(A>B) = 0.61  (inconclusive, continue)
Problem 2 → update posteriors → P(A>B) = 0.74  (inconclusive, continue)
Problem 3 → update posteriors → P(A>B) = 0.96  ✓ STOP — Model A wins

Maintains a conjugate posterior over each model's true performance metric.
After every problem, computes P(Model A beats Model B) analytically or via Monte Carlo.
Stops early the moment that probability crosses the confidence threshold (default 0.95).
Skips non-discriminating tasks automatically when both models perform indistinguishably.

Installation

pip install bayesbench

Optional framework integrations:

pip install bayesbench[openai]        # OpenAI, Groq, Together AI, Ollama, vLLM, …
pip install bayesbench[anthropic]     # Anthropic (Claude)
pip install bayesbench[huggingface]   # HuggingFace Inference API + datasets
pip install bayesbench[inspect]       # AISI Inspect eval framework
pip install bayesbench[mteb]          # MTEB embedding benchmark
pip install bayesbench[openclaw]      # OpenClaw agents
pip install bayesbench[all]           # everything above

Quick start

Pairwise comparison

from bayesbench import benchmark

@benchmark(
    model_a=lambda p: big_llm(p["question"]),
    model_b=lambda p: small_llm(p["question"]),
    dataset=problems,
    confidence=0.95,
)
def exact_match(problem, response):
    return response.strip() == problem["answer"]

result = exact_match.run()
print(result.winner)      # "model_a", "model_b", or None
print(result.efficiency)  # e.g. 0.87 → 87% of problems saved

Multi-task suite

from bayesbench import BayesianBenchmark

bench = BayesianBenchmark(confidence=0.95)

@bench.task(dataset=gsm8k,  name="gsm8k")
def math(problem):
    return model_a(problem["q"]) == problem["a"], \
           model_b(problem["q"]) == problem["a"]

@bench.task(dataset=mmlu,   name="mmlu")
def science(problem):
    return model_a(problem["q"]) == problem["a"], \
           model_b(problem["q"]) == problem["a"]

report = bench.run(verbose=True)   # tqdm progress bar
print(report.summary())
report.to_dataframe().to_csv("results.csv")

Class-based suite

from bayesbench import suite

@suite(confidence=0.95)
class EvalSuite:
    dataset = problems

    @staticmethod
    def task_reasoning(problem):
        return model_a(problem["q"]) == problem["a"], \
               model_b(problem["q"]) == problem["a"]

    @staticmethod
    def task_coding(problem):
        return run_tests(model_a, problem), run_tests(model_b, problem)

report = EvalSuite.run()

Rank N models

from bayesbench import BayesianRanker

ranker = BayesianRanker(confidence=0.95)
ranker.add_model("gpt-4o",       gpt4_fn)
ranker.add_model("gpt-4o-mini",  mini_fn)
ranker.add_model("llama-3-70b",  llama_fn)
ranker.add_model("mistral-large", mistral_fn)

result = ranker.rank(
    dataset=problems,
    score_fn=lambda p, r: r.strip() == p["answer"],
    verbose=True,
)
print(result.summary())
# Rank 1: gpt-4o         score=0.912  95%CI=[0.881, 0.943]  P(>gpt-4o-mini)=0.981
# Rank 2: llama-3-70b    score=0.884  95%CI=[0.850, 0.918]  P(>gpt-4o-mini)=0.963
# Rank 3: gpt-4o-mini    score=0.851  95%CI=[0.814, 0.888]  P(>mistral-large)=0.971
# Rank 4: mistral-large  score=0.803  95%CI=[0.762, 0.844]

Continuous scores (BLEU, ROUGE, LLM-judge)

from bayesbench import BayesianBenchmark
from bayesbench.posteriors import NormalPosterior

bench = BayesianBenchmark(
    confidence=0.95,
    posterior_factory=NormalPosterior,   # Normal-Inverse-Gamma conjugate model
)

result = bench.compare(
    model_a=big_llm,
    model_b=small_llm,
    score_fn=lambda p, r: compute_bleu(r, p["reference"]),  # returns float
    dataset=translation_problems,
)

Async models

result = await bench.compare_async(
    model_a=async_big_llm,
    model_b=async_small_llm,
    score_fn=lambda p, r: r == p["answer"],
    dataset=problems,
)

Framework adapters

All adapters return a plain callable(problem) -> str that plugs into any bayesbench API.

Adapter	Import	Works with
OpenAI-compatible	`from bayesbench.adapters.openai_compat import openai_model`	OpenAI, Groq, Together AI, Fireworks, Ollama, vLLM, Azure OpenAI
Anthropic	`from bayesbench.adapters.anthropic_adapter import anthropic_model`	Claude (all versions)
HuggingFace	`from bayesbench.adapters.huggingface import hf_model, hf_dataset`	Any HF Inference API endpoint
Inspect AI	`from bayesbench.adapters.inspect_ai import inspect_model, from_inspect_dataset`	AISI Inspect `Dataset`, `Task`, `Scorer`
MTEB	`from bayesbench.adapters.mteb import st_model, mteb_sts_dataset`	SentenceTransformers, MTEB STS + Classification

from bayesbench import BayesianRanker
from bayesbench.adapters.openai_compat import openai_model
from bayesbench.adapters.anthropic_adapter import anthropic_model

ranker = BayesianRanker(confidence=0.95)
ranker.add_model("gpt-4o",          openai_model("gpt-4o"))
ranker.add_model("claude-opus-4-6", anthropic_model("claude-opus-4-6"))
ranker.add_model("llama-3-groq",    openai_model("llama-3.1-70b-versatile",
                                        base_url="https://api.groq.com/openai/v1"))

result = ranker.rank(dataset=problems, score_fn=score)

Posteriors

Swap the Bayesian model to match your metric type:

Posterior	Use when	Import
`BetaPosterior`	Binary outcomes: exact match, pass/fail, multiple choice	`from bayesbench.posteriors import BetaPosterior`
`NormalPosterior`	Continuous scores: BLEU, ROUGE, cosine similarity, LLM-judge (0–1)	`from bayesbench.posteriors import NormalPosterior`
Custom	Any distribution — subclass `Posterior`	`from bayesbench.posteriors import Posterior`

# Custom prior: expect ~30% BLEU baseline
from bayesbench.posteriors import NormalPosterior
bench = BayesianBenchmark(posterior_factory=lambda: NormalPosterior(mu_0=0.30))

# Per-task posterior override
@bench.task(dataset=problems, posterior_factory=NormalPosterior)
def bleu_task(problem):
    return compute_bleu(model_a(problem), problem["ref"]), \
           compute_bleu(model_b(problem), problem["ref"])

Results & export

report = bench.run()

# Text summary
print(report.summary())

# Serialise to dict / JSON
import json
print(json.dumps(report.to_dict(), indent=2))

# Pandas DataFrame (requires pandas)
df = report.to_dataframe()
df.to_csv("results.csv", index=False)

# Individual task result
result = report.task_results[0]
print(result.winner)            # "model_a" | "model_b" | None
print(result.efficiency)        # 0.0 – 1.0
print(result.p_a_beats_b)       # posterior probability
lo, hi = result.posterior_a.credible_interval()

CLI

# Run all tasks in a benchmark file
bayesbench my_benchmark.py

# Override stopping thresholds
bayesbench my_benchmark.py --confidence 0.99 --min-samples 10 --skip-threshold 0.90

# Print version
bayesbench --version

The benchmark file must expose a bench = BayesianBenchmark(...) instance or a @suite-decorated class.

API reference

`BayesianBenchmark`

BayesianBenchmark(
    confidence: float = 0.95,           # P(A>B) threshold to declare winner
    skip_threshold: float = 0.85,       # skip non-discriminating tasks
    min_samples: int = 3,               # minimum evaluations before stopping
    posterior_factory: Callable = BetaPosterior,
)

Method	Returns	Description
`.task(name, dataset, posterior_factory)`	decorator	Register an evaluation function
`.compare(model_a, model_b, score_fn, dataset)`	`TaskResult`	Direct pairwise comparison
`.compare_async(...)`	`TaskResult`	Async pairwise comparison
`.run(verbose=False)`	`BenchmarkReport`	Run all registered tasks
`.run_async()`	`BenchmarkReport`	Async version

`BayesianRanker`

BayesianRanker(
    confidence: float = 0.95,
    skip_threshold: float = 0.85,
    min_samples: int = 5,
    posterior_factory: Callable = BetaPosterior,
)

Method	Returns	Description
`.add_model(name, fn)`	`self`	Register a model (chainable)
`.evaluate`	decorator	Set the scoring function
`.rank(dataset, score_fn, verbose=False)`	`RankingResult`	Rank all models
`.rank_async(dataset, score_fn)`	`RankingResult`	Async version

`TaskResult`

Attribute	Type	Description
`.winner`	`str \| None`	`"model_a"`, `"model_b"`, or `None`
`.efficiency`	`float`	Fraction of problems not evaluated
`.problems_tested`	`int`	Problems evaluated before stopping
`.total_problems`	`int`	Dataset size
`.p_a_beats_b`	`float`	Final P(A > B)
`.posterior_a`, `.posterior_b`	`Posterior`	Final posteriors
`.skipped`	`bool`	True if task was non-discriminating
`.to_dict()`	`dict`	Serialise to plain dict

`BenchmarkReport`

Attribute / Method	Description
`.task_results`	List of `TaskResult` objects
`.overall_efficiency`	Aggregate fraction of problems saved
`.winners`	`{task_name: winner}` dict
`.summary()`	Formatted text report
`.to_dict()`	Serialise to plain dict
`.to_dataframe()`	Returns a `pandas.DataFrame` (requires pandas)

Contributing

See CONTRIBUTING.md. In short:

git clone https://github.com/rymarinelli/bayesbench
cd bayesbench
pip install -e ".[dev]"
pytest          # run tests
ruff check .    # lint

Citation

@inproceedings{marinelli2026bayesian,
  title     = {Bayesian Sequential Testing for Efficient {LLM} Benchmarking},
  author    = {Marinelli, Ryan},
  booktitle = {Proceedings of the 40th International Workshop on Statistical Modelling},
  year      = {2026},
  address   = {Oslo, Norway},
}

Documentation

Project docs are built with MkDocs Material and can be deployed automatically through the Docs GitHub Actions workflow.

Start here: docs/index.md
Workflow guides (LLM + agentic benchmarking): docs/workflows.md

pip install -e ".[docs]"
mkdocs serve

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0.dev0 pre-release

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bayesbench-0.4.0.dev0-py3-none-any.whl (49.0 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file bayesbench-0.4.0.dev0-py3-none-any.whl.

File metadata

Download URL: bayesbench-0.4.0.dev0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 49.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bayesbench-0.4.0.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c392d155dc3966c03b2165e75e0fdfa72b8b7c3e02fdf8be01e5cfacc715e0c`
MD5	`9e8b840093b094709e873850eaf3dedc`
BLAKE2b-256	`61b62d55f0be724036d7d226faa0cd8c25f4941bbafa059264fe8e9924c8bd4b`

See more details on using hashes here.

bayesbench 0.4.0.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bayesbench

How it works

Installation

Quick start

Pairwise comparison

Multi-task suite

Class-based suite

Rank N models

Continuous scores (BLEU, ROUGE, LLM-judge)

Async models

Framework adapters

Posteriors

Results & export

CLI

API reference

BayesianBenchmark

BayesianRanker

TaskResult

BenchmarkReport

Contributing

Citation

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

`BayesianBenchmark`

`BayesianRanker`

`TaskResult`

`BenchmarkReport`