Skip to main content

Calibration metrics with bootstrap confidence intervals — because a bare ECE is not enough.

Project description

calibstats

Calibration metrics with confidence intervals — because a bare ECE number is not enough.

CI Python 3.10+ License: MIT

A model says it is 90% confident. Is it right 90% of the time? Calibration asks whether predicted probabilities match observed frequencies. The standard metric, ECE (Expected Calibration Error), is almost always reported as a single number — and that number is biased and noisy at the sample sizes real evals run on.

calibstats reports every metric as estimate ± CI, ships a bias-corrected estimator, and adds reliability diagrams with confidence bands, post-hoc recalibration, and subgroup-shift significance tests. It is framework-agnostic: feed it (predicted_prob, label) arrays from any model.

Why calibration needs confidence intervals

On a perfectly calibrated model (true ECE = 0), the plug-in ECE estimator still reports 0.12 at n = 100 and 0.05 at n = 500. That is pure estimator bias — the absolute value in |accuracy − confidence| cannot average to zero from finite, noisy bins. So:

  • An "ECE = 0.05" headline can mean a perfectly calibrated model at small n.
  • ECEs are not comparable across papers unless n and bin count match.
  • A point estimate with no interval hides whether you have signal or noise — at n = 200 the 95% CI for ECE is ~0.10 wide, often wider than the value itself.

The fix is not exotic: attach a bootstrap CI, correct the bias, and test differences instead of eyeballing them. That is the whole pitch. See STUDY.md for the quantitative demonstration and figures.

ECE bias vs sample size

Install

pip install calibstats          # core (numpy, scipy)
pip install calibstats[viz]     # + matplotlib for reliability_diagram

Or with uv from a clone:

uv sync --extra viz

Quick start

import calibstats as cs

# Binary: probs is 1-D P(y=1); labels in {0,1}.
# (Here a synthetic overconfident model with a known temperature.)
data = cs.make_binary(2000, temperature=2.0, seed=0)

# The whole picture, every metric with a shared bootstrap CI:
report = cs.calibration_report(data.probs, data.labels, n_boot=2000)
print(report)
Calibration report  (n=2000, 15 uniform bins)
----------------------------------------------------------
metric            estimate                95% CI
----------------------------------------------------------
ece                 0.0966      [0.0842, 0.1171]
ace                 0.0924      [0.0803, 0.1137]
mce                 0.1928      [0.1570, 0.2812]
debiased_ece        0.0970      [0.0853, 0.1242]
brier               0.1742      [0.1620, 0.1867]
nll                 0.5648      [0.5245, 0.6071]
----------------------------------------------------------
Brier decomposition (cal - res + unc = brier):
  calibration=0.0110  resolution=0.0862  uncertainty=0.2500

Individual metrics, each CI-ready

cs.ece(probs, labels)                 # Expected Calibration Error (equal-width, l1)
cs.ace(probs, labels)                 # Adaptive ECE (equal-mass / quantile bins)
cs.mce(probs, labels)                 # Maximum Calibration Error
cs.debiased_ece(probs, labels)        # bias-corrected (l2) estimator
cs.brier_score(probs, labels)         # mean squared error of probabilities
cs.brier_decomposition(probs, labels) # calibration / refinement / resolution / uncertainty
cs.nll(probs, labels)                 # negative log-likelihood (log loss)

# Wrap ANY of them in a bootstrap CI:
ci = cs.bootstrap_ci(probs, labels, cs.ece, n_boot=2000)
print(ci)                # 0.0966  [0.0842, 0.1171] (95% CI)
print(ci.estimate, ci.ci, ci.se)

Reliability diagram with confidence bands

import matplotlib.pyplot as plt
cs.reliability_diagram(probs, labels, n_bins=15)
plt.savefig("reliability.png")
# Or get the bins without plotting (Wilson 95% bands included):
bins = cs.reliability_data(probs, labels, n_bins=15)

Recalibration (fit on a holdout)

ts = cs.TemperatureScaler(input="probs").fit(probs_val, labels_val)
probs_test_cal = ts.transform(probs_test)
print("recovered T:", ts.temperature_)

# Binary-only, two-parameter alternative:
ps = cs.PlattScaler(input="probs").fit(probs_val, labels_val)

TemperatureScaler handles binary (1-D) and multiclass (2-D) and accepts either input="logits" or input="probs". It is accuracy-preserving — argmax never changes.

Calibration under shift / across subgroups

groups = cs.compare_subgroups({
    "in_domain":  (p_a, y_a),
    "out_domain": (p_b, y_b),
}, n_boot=2000)
for g in groups:
    print(g.name, g.ece)            # per-group ECE ± CI

# Is the ECE difference real? (paired=True when both score the same examples)
test = cs.ece_difference_test(p_a, y_a, p_b, y_b, n_boot=2000)
print(test)                         # Δ with CI and a bootstrap p-value

Multiclass

Pass a 2-D (n, n_classes) probability matrix and integer labels. Metrics use top-label (confidence) calibration: confidence = max_c p, correct = 1[argmax == label] — the standard multiclass ECE setting.

probs, labels = cs.make_multiclass(3000, n_classes=5, temperature=2.0)
cs.calibration_report(probs, labels)

What's implemented (and verified against references)

Area Functions Verified against
Binned errors ece, ace, mce, debiased_ece, calibration_error hand-computed cases
Proper scores brier_score, brier_decomposition, nll scikit-learn brier_score_loss, log_loss; Murphy identity
Uncertainty bootstrap_ci, bootstrap_metrics, CIResult coverage + 1/√n shrinkage
Diagrams reliability_diagram, reliability_data Wilson score interval
Recalibration TemperatureScaler, PlattScaler known-T recovery; sklearn LogisticRegression
Shift compare_subgroups, ece_difference_test known-gap detection / null calibration

Full numerical detail and figures: STUDY.md.

Design notes

  • Binary convention. confidence = probs, accuracy = label — i.e. the reliability curve of predicted P(y=1) vs observed frequency, which is more informative than the max(p, 1−p) collapse.
  • Binning bias. Equal-width ECE is the default for comparability with the literature; ace uses equal-mass bins; debiased_ece corrects the small-sample inflation. The right move is usually to report all three.
  • Holdout discipline. Recalibrators are fit/transform objects so you cannot accidentally fit and evaluate on the same data.

What real model probabilities would add

This toolkit is exercised here on synthetic predictors — deliberately, since that gives a known ground-truth calibration to validate the estimators against. The code path for real (probability, label) arrays is identical. Plugging in real LLM or classifier outputs would extend the study in ways synthetic data cannot fully mimic:

  • Confidence mass piled near 1.0. Real classifiers (and LLM token probabilities) put most of their mass in the top bin, so equal-width ECE is dominated by one bin while equal-mass (ACE) and the bias correction matter far more — exactly the regime where a bare ECE is most misleading.
  • Genuine distribution shift. Replacing the tuned-temperature subgroups with real in-/out-of-domain slices (e.g. a model evaluated on a new dataset) would show miscalibration that isn't a single global temperature — where temperature scaling helps only partially and the subgroup tests earn their keep.
  • Class imbalance and rare events, where the Brier uncertainty term and the reliability/resolution split become the interesting story.
  • Keyless public sources of (prob, label) (released model logits, public probabilistic-forecast archives) would let the study use real outcomes; the API needs no changes to ingest them.

None of these require new metrics — they are the same estimators on heavier-tailed, shifted data, which is precisely the setting where reporting ECE ± CI instead of a bare ECE changes the conclusion.

Development

uv sync --extra viz
uv run pytest                 # 33 tests
uv run ruff check . && uv run ruff format --check .
uv run mypy src/calibstats
uv run python study/run_study.py

License

MIT — see LICENSE.

References

  • Naeini, Cooper & Hauskrecht (2015), Obtaining Well Calibrated Probabilities Using Bayesian Binning (ECE).
  • Guo, Pleiss, Sun & Weinberger (2017), On Calibration of Modern Neural Networks (temperature scaling, reliability diagrams).
  • Nixon et al. (2019), Measuring Calibration in Deep Learning (adaptive ECE).
  • Kumar, Liang & Ma (2019), Verified Uncertainty Calibration (debiased estimators).
  • Murphy (1973), A New Vector Partition of the Probability Score (Brier decomposition).
  • Platt (1999), Probabilistic Outputs for Support Vector Machines (Platt scaling).

calibstats is part of a statistical-rigor-for-AI-evals toolkit: deltagate (paired-delta validation for eval comparisons), agentrel (reliability stats for stochastic agent evals), leaderboard-ci (leaderboard re-ranking with CIs and tie bands). Full portfolio: github.com/yongzhe2160cs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

calibstats-0.1.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

calibstats-0.1.0-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file calibstats-0.1.0.tar.gz.

File metadata

  • Download URL: calibstats-0.1.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for calibstats-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6c6aeb7f82740e279a26b262a5b70404b61c424e07d94d49a48b15189a86ca73
MD5 52634cb1fbe05ce7ae2d663063a5bf4a
BLAKE2b-256 2fba878d53c0211a80c72605e8edc3bfd4f8933ce226a9395a447cb0d26538b5

See more details on using hashes here.

File details

Details for the file calibstats-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: calibstats-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for calibstats-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f2871e11fbfc8cda3d494d6c3da1ebd43e056b6f93f5f9aeacb60dd5a23dbb18
MD5 33cf6ff8ca165073dcc2e65229806e0f
BLAKE2b-256 8d200c6b9f53683bac3c869e6a4cfbba0c0eb3dcd180dd2c844c1481f294a9ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page