Calibration metrics with bootstrap confidence intervals — because a bare ECE is not enough.

These details have not been verified by PyPI

Project links

Project description

calibstats

Calibration metrics with confidence intervals — because a bare ECE number is not enough.

A model says it is 90% confident. Is it right 90% of the time? Calibration asks whether predicted probabilities match observed frequencies. The standard metric, ECE (Expected Calibration Error), is almost always reported as a single number — and that number is biased and noisy at the sample sizes real evals run on.

calibstats reports every metric as estimate ± CI, ships a bias-corrected estimator, and adds reliability diagrams with confidence bands, post-hoc recalibration, and subgroup-shift significance tests. It is framework-agnostic: feed it (predicted_prob, label) arrays from any model.

Why calibration needs confidence intervals

On a perfectly calibrated model (true ECE = 0), the plug-in ECE estimator still reports 0.12 at n = 100 and 0.05 at n = 500. That is pure estimator bias — the absolute value in |accuracy − confidence| cannot average to zero from finite, noisy bins. So:

An "ECE = 0.05" headline can mean a perfectly calibrated model at small n.
ECEs are not comparable across papers unless n and bin count match.
A point estimate with no interval hides whether you have signal or noise — at n = 200 the 95% CI for ECE is ~0.10 wide, often wider than the value itself.

The fix is not exotic: attach a bootstrap CI, correct the bias, and test differences instead of eyeballing them. That is the whole pitch. See STUDY.md for the quantitative demonstration and figures.

ECE bias vs sample size

Install

pip install calibstats          # core (numpy, scipy)
pip install calibstats[viz]     # + matplotlib for reliability_diagram

Or with uv from a clone:

uv sync --extra viz

Quick start

import calibstats as cs

# Binary: probs is 1-D P(y=1); labels in {0,1}.
# (Here a synthetic overconfident model with a known temperature.)
data = cs.make_binary(2000, temperature=2.0, seed=0)

# The whole picture, every metric with a shared bootstrap CI:
report = cs.calibration_report(data.probs, data.labels, n_boot=2000)
print(report)

Calibration report  (n=2000, 15 uniform bins)
----------------------------------------------------------
metric            estimate                95% CI
----------------------------------------------------------
ece                 0.0966      [0.0842, 0.1171]
ace                 0.0924      [0.0803, 0.1137]
mce                 0.1928      [0.1570, 0.2812]
debiased_ece        0.0970      [0.0853, 0.1242]
brier               0.1742      [0.1620, 0.1867]
nll                 0.5648      [0.5245, 0.6071]
----------------------------------------------------------
Brier decomposition (cal - res + unc = brier):
  calibration=0.0110  resolution=0.0862  uncertainty=0.2500

Individual metrics, each CI-ready

cs.ece(probs, labels)                 # Expected Calibration Error (equal-width, l1)
cs.ace(probs, labels)                 # Adaptive ECE (equal-mass / quantile bins)
cs.mce(probs, labels)                 # Maximum Calibration Error
cs.debiased_ece(probs, labels)        # bias-corrected (l2) estimator
cs.brier_score(probs, labels)         # mean squared error of probabilities
cs.brier_decomposition(probs, labels) # calibration / refinement / resolution / uncertainty
cs.nll(probs, labels)                 # negative log-likelihood (log loss)

# Wrap ANY of them in a bootstrap CI:
ci = cs.bootstrap_ci(probs, labels, cs.ece, n_boot=2000)
print(ci)                # 0.0966  [0.0842, 0.1171] (95% CI)
print(ci.estimate, ci.ci, ci.se)

Reliability diagram with confidence bands

import matplotlib.pyplot as plt
cs.reliability_diagram(probs, labels, n_bins=15)
plt.savefig("reliability.png")
# Or get the bins without plotting (Wilson 95% bands included):
bins = cs.reliability_data(probs, labels, n_bins=15)

Recalibration (fit on a holdout)

ts = cs.TemperatureScaler(input="probs").fit(probs_val, labels_val)
probs_test_cal = ts.transform(probs_test)
print("recovered T:", ts.temperature_)

# Binary-only, two-parameter alternative:
ps = cs.PlattScaler(input="probs").fit(probs_val, labels_val)

TemperatureScaler handles binary (1-D) and multiclass (2-D) and accepts either input="logits" or input="probs". It is accuracy-preserving — argmax never changes.

Calibration under shift / across subgroups

groups = cs.compare_subgroups({
    "in_domain":  (p_a, y_a),
    "out_domain": (p_b, y_b),
}, n_boot=2000)
for g in groups:
    print(g.name, g.ece)            # per-group ECE ± CI

# Is the ECE difference real? (paired=True when both score the same examples)
test = cs.ece_difference_test(p_a, y_a, p_b, y_b, n_boot=2000)
print(test)                         # Δ with CI and a bootstrap p-value

Multiclass

Pass a 2-D (n, n_classes) probability matrix and integer labels. Metrics use top-label (confidence) calibration: confidence = max_c p, correct = 1[argmax == label] — the standard multiclass ECE setting.

probs, labels = cs.make_multiclass(3000, n_classes=5, temperature=2.0)
cs.calibration_report(probs, labels)

What's implemented (and verified against references)

Area	Functions	Verified against
Binned errors	`ece`, `ace`, `mce`, `debiased_ece`, `calibration_error`	hand-computed cases
Proper scores	`brier_score`, `brier_decomposition`, `nll`	scikit-learn `brier_score_loss`, `log_loss`; Murphy identity
Uncertainty	`bootstrap_ci`, `bootstrap_metrics`, `CIResult`	coverage + `1/√n` shrinkage
Diagrams	`reliability_diagram`, `reliability_data`	Wilson score interval
Recalibration	`TemperatureScaler`, `PlattScaler`	known-T recovery; sklearn `LogisticRegression`
Shift	`compare_subgroups`, `ece_difference_test`	known-gap detection / null calibration

Full numerical detail and figures: STUDY.md.

Design notes

Binary convention. confidence = probs, accuracy = label — i.e. the reliability curve of predicted P(y=1) vs observed frequency, which is more informative than the max(p, 1−p) collapse.
Binning bias. Equal-width ECE is the default for comparability with the literature; ace uses equal-mass bins; debiased_ece corrects the small-sample inflation. The right move is usually to report all three.
Holdout discipline. Recalibrators are fit/transform objects so you cannot accidentally fit and evaluate on the same data.

What real model probabilities would add

This toolkit is exercised here on synthetic predictors — deliberately, since that gives a known ground-truth calibration to validate the estimators against. The code path for real (probability, label) arrays is identical. Plugging in real LLM or classifier outputs would extend the study in ways synthetic data cannot fully mimic:

Confidence mass piled near 1.0. Real classifiers (and LLM token probabilities) put most of their mass in the top bin, so equal-width ECE is dominated by one bin while equal-mass (ACE) and the bias correction matter far more — exactly the regime where a bare ECE is most misleading.
Genuine distribution shift. Replacing the tuned-temperature subgroups with real in-/out-of-domain slices (e.g. a model evaluated on a new dataset) would show miscalibration that isn't a single global temperature — where temperature scaling helps only partially and the subgroup tests earn their keep.
Class imbalance and rare events, where the Brier uncertainty term and the reliability/resolution split become the interesting story.
Keyless public sources of (prob, label) (released model logits, public probabilistic-forecast archives) would let the study use real outcomes; the API needs no changes to ingest them.

None of these require new metrics — they are the same estimators on heavier-tailed, shifted data, which is precisely the setting where reporting ECE ± CI instead of a bare ECE changes the conclusion.

Development

uv sync --extra viz
uv run pytest                 # 33 tests
uv run ruff check . && uv run ruff format --check .
uv run mypy src/calibstats
uv run python study/run_study.py

License

MIT — see LICENSE.

References

Naeini, Cooper & Hauskrecht (2015), Obtaining Well Calibrated Probabilities Using Bayesian Binning (ECE).
Guo, Pleiss, Sun & Weinberger (2017), On Calibration of Modern Neural Networks (temperature scaling, reliability diagrams).
Nixon et al. (2019), Measuring Calibration in Deep Learning (adaptive ECE).
Kumar, Liang & Ma (2019), Verified Uncertainty Calibration (debiased estimators).
Murphy (1973), A New Vector Partition of the Probability Score (Brier decomposition).
Platt (1999), Probabilistic Outputs for Support Vector Machines (Platt scaling).

calibstats is part of a statistical-rigor-for-AI-evals toolkit: deltagate (paired-delta validation for eval comparisons), agentrel (reliability stats for stochastic agent evals), leaderboard-ci (leaderboard re-ranking with CIs and tie bands). Full portfolio: github.com/yongzhe2160cs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

calibstats-0.1.0.tar.gz (16.0 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

calibstats-0.1.0-py3-none-any.whl (20.7 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file calibstats-0.1.0.tar.gz.

File metadata

Download URL: calibstats-0.1.0.tar.gz
Upload date: Jun 13, 2026
Size: 16.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for calibstats-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6c6aeb7f82740e279a26b262a5b70404b61c424e07d94d49a48b15189a86ca73`
MD5	`52634cb1fbe05ce7ae2d663063a5bf4a`
BLAKE2b-256	`2fba878d53c0211a80c72605e8edc3bfd4f8933ce226a9395a447cb0d26538b5`

See more details on using hashes here.

File details

Details for the file calibstats-0.1.0-py3-none-any.whl.

File metadata

Download URL: calibstats-0.1.0-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 20.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for calibstats-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2871e11fbfc8cda3d494d6c3da1ebd43e056b6f93f5f9aeacb60dd5a23dbb18`
MD5	`33cf6ff8ca165073dcc2e65229806e0f`
BLAKE2b-256	`8d200c6b9f53683bac3c869e6a4cfbba0c0eb3dcd180dd2c844c1481f294a9ca`

See more details on using hashes here.

calibstats 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

calibstats

Why calibration needs confidence intervals

Install

Quick start

Individual metrics, each CI-ready

Reliability diagram with confidence bands

Recalibration (fit on a holdout)

Calibration under shift / across subgroups

Multiclass

What's implemented (and verified against references)

Design notes

What real model probabilities would add

Development

License

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes