Calibration metrics with bootstrap confidence intervals — because a bare ECE is not enough.
Project description
calibstats
Calibration metrics with confidence intervals — because a bare ECE number is not enough.
A model says it is 90% confident. Is it right 90% of the time? Calibration asks whether predicted probabilities match observed frequencies. The standard metric, ECE (Expected Calibration Error), is almost always reported as a single number — and that number is biased and noisy at the sample sizes real evals run on.
calibstats reports every metric as estimate ± CI, ships a bias-corrected
estimator, and adds reliability diagrams with confidence bands, post-hoc
recalibration, and subgroup-shift significance tests. It is framework-agnostic:
feed it (predicted_prob, label) arrays from any model.
Why calibration needs confidence intervals
On a perfectly calibrated model (true ECE = 0), the plug-in ECE estimator
still reports 0.12 at n = 100 and 0.05 at n = 500. That is pure
estimator bias — the absolute value in |accuracy − confidence| cannot average
to zero from finite, noisy bins. So:
- An "ECE = 0.05" headline can mean a perfectly calibrated model at small n.
- ECEs are not comparable across papers unless n and bin count match.
- A point estimate with no interval hides whether you have signal or noise — at n = 200 the 95% CI for ECE is ~0.10 wide, often wider than the value itself.
The fix is not exotic: attach a bootstrap CI, correct the bias, and test differences instead of eyeballing them. That is the whole pitch. See STUDY.md for the quantitative demonstration and figures.
Install
pip install calibstats # core (numpy, scipy)
pip install calibstats[viz] # + matplotlib for reliability_diagram
Or with uv from a clone:
uv sync --extra viz
Quick start
import calibstats as cs
# Binary: probs is 1-D P(y=1); labels in {0,1}.
# (Here a synthetic overconfident model with a known temperature.)
data = cs.make_binary(2000, temperature=2.0, seed=0)
# The whole picture, every metric with a shared bootstrap CI:
report = cs.calibration_report(data.probs, data.labels, n_boot=2000)
print(report)
Calibration report (n=2000, 15 uniform bins)
----------------------------------------------------------
metric estimate 95% CI
----------------------------------------------------------
ece 0.0966 [0.0842, 0.1171]
ace 0.0924 [0.0803, 0.1137]
mce 0.1928 [0.1570, 0.2812]
debiased_ece 0.0970 [0.0853, 0.1242]
brier 0.1742 [0.1620, 0.1867]
nll 0.5648 [0.5245, 0.6071]
----------------------------------------------------------
Brier decomposition (cal - res + unc = brier):
calibration=0.0110 resolution=0.0862 uncertainty=0.2500
Individual metrics, each CI-ready
cs.ece(probs, labels) # Expected Calibration Error (equal-width, l1)
cs.ace(probs, labels) # Adaptive ECE (equal-mass / quantile bins)
cs.mce(probs, labels) # Maximum Calibration Error
cs.debiased_ece(probs, labels) # bias-corrected (l2) estimator
cs.brier_score(probs, labels) # mean squared error of probabilities
cs.brier_decomposition(probs, labels) # calibration / refinement / resolution / uncertainty
cs.nll(probs, labels) # negative log-likelihood (log loss)
# Wrap ANY of them in a bootstrap CI:
ci = cs.bootstrap_ci(probs, labels, cs.ece, n_boot=2000)
print(ci) # 0.0966 [0.0842, 0.1171] (95% CI)
print(ci.estimate, ci.ci, ci.se)
Reliability diagram with confidence bands
import matplotlib.pyplot as plt
cs.reliability_diagram(probs, labels, n_bins=15)
plt.savefig("reliability.png")
# Or get the bins without plotting (Wilson 95% bands included):
bins = cs.reliability_data(probs, labels, n_bins=15)
Recalibration (fit on a holdout)
ts = cs.TemperatureScaler(input="probs").fit(probs_val, labels_val)
probs_test_cal = ts.transform(probs_test)
print("recovered T:", ts.temperature_)
# Binary-only, two-parameter alternative:
ps = cs.PlattScaler(input="probs").fit(probs_val, labels_val)
TemperatureScaler handles binary (1-D) and multiclass (2-D) and accepts either
input="logits" or input="probs". It is accuracy-preserving — argmax never
changes.
Calibration under shift / across subgroups
groups = cs.compare_subgroups({
"in_domain": (p_a, y_a),
"out_domain": (p_b, y_b),
}, n_boot=2000)
for g in groups:
print(g.name, g.ece) # per-group ECE ± CI
# Is the ECE difference real? (paired=True when both score the same examples)
test = cs.ece_difference_test(p_a, y_a, p_b, y_b, n_boot=2000)
print(test) # Δ with CI and a bootstrap p-value
Multiclass
Pass a 2-D (n, n_classes) probability matrix and integer labels. Metrics use
top-label (confidence) calibration: confidence = max_c p, correct = 1[argmax == label] — the standard multiclass ECE setting.
probs, labels = cs.make_multiclass(3000, n_classes=5, temperature=2.0)
cs.calibration_report(probs, labels)
What's implemented (and verified against references)
| Area | Functions | Verified against |
|---|---|---|
| Binned errors | ece, ace, mce, debiased_ece, calibration_error |
hand-computed cases |
| Proper scores | brier_score, brier_decomposition, nll |
scikit-learn brier_score_loss, log_loss; Murphy identity |
| Uncertainty | bootstrap_ci, bootstrap_metrics, CIResult |
coverage + 1/√n shrinkage |
| Diagrams | reliability_diagram, reliability_data |
Wilson score interval |
| Recalibration | TemperatureScaler, PlattScaler |
known-T recovery; sklearn LogisticRegression |
| Shift | compare_subgroups, ece_difference_test |
known-gap detection / null calibration |
Full numerical detail and figures: STUDY.md.
Design notes
- Binary convention.
confidence = probs,accuracy = label— i.e. the reliability curve of predicted P(y=1) vs observed frequency, which is more informative than themax(p, 1−p)collapse. - Binning bias. Equal-width ECE is the default for comparability with the
literature;
aceuses equal-mass bins;debiased_ececorrects the small-sample inflation. The right move is usually to report all three. - Holdout discipline. Recalibrators are fit/transform objects so you cannot accidentally fit and evaluate on the same data.
What real model probabilities would add
This toolkit is exercised here on synthetic predictors — deliberately, since
that gives a known ground-truth calibration to validate the estimators against.
The code path for real (probability, label) arrays is identical. Plugging in
real LLM or classifier outputs would extend the study in ways synthetic data
cannot fully mimic:
- Confidence mass piled near 1.0. Real classifiers (and LLM token probabilities) put most of their mass in the top bin, so equal-width ECE is dominated by one bin while equal-mass (ACE) and the bias correction matter far more — exactly the regime where a bare ECE is most misleading.
- Genuine distribution shift. Replacing the tuned-temperature subgroups with real in-/out-of-domain slices (e.g. a model evaluated on a new dataset) would show miscalibration that isn't a single global temperature — where temperature scaling helps only partially and the subgroup tests earn their keep.
- Class imbalance and rare events, where the Brier uncertainty term and the reliability/resolution split become the interesting story.
- Keyless public sources of
(prob, label)(released model logits, public probabilistic-forecast archives) would let the study use real outcomes; the API needs no changes to ingest them.
None of these require new metrics — they are the same estimators on heavier-tailed,
shifted data, which is precisely the setting where reporting ECE ± CI instead of
a bare ECE changes the conclusion.
Development
uv sync --extra viz
uv run pytest # 33 tests
uv run ruff check . && uv run ruff format --check .
uv run mypy src/calibstats
uv run python study/run_study.py
License
MIT — see LICENSE.
References
- Naeini, Cooper & Hauskrecht (2015), Obtaining Well Calibrated Probabilities Using Bayesian Binning (ECE).
- Guo, Pleiss, Sun & Weinberger (2017), On Calibration of Modern Neural Networks (temperature scaling, reliability diagrams).
- Nixon et al. (2019), Measuring Calibration in Deep Learning (adaptive ECE).
- Kumar, Liang & Ma (2019), Verified Uncertainty Calibration (debiased estimators).
- Murphy (1973), A New Vector Partition of the Probability Score (Brier decomposition).
- Platt (1999), Probabilistic Outputs for Support Vector Machines (Platt scaling).
calibstats is part of a statistical-rigor-for-AI-evals toolkit: deltagate (paired-delta validation for eval comparisons), agentrel (reliability stats for stochastic agent evals), leaderboard-ci (leaderboard re-ranking with CIs and tie bands). Full portfolio: github.com/yongzhe2160cs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file calibstats-0.1.0.tar.gz.
File metadata
- Download URL: calibstats-0.1.0.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c6aeb7f82740e279a26b262a5b70404b61c424e07d94d49a48b15189a86ca73
|
|
| MD5 |
52634cb1fbe05ce7ae2d663063a5bf4a
|
|
| BLAKE2b-256 |
2fba878d53c0211a80c72605e8edc3bfd4f8933ce226a9395a447cb0d26538b5
|
File details
Details for the file calibstats-0.1.0-py3-none-any.whl.
File metadata
- Download URL: calibstats-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2871e11fbfc8cda3d494d6c3da1ebd43e056b6f93f5f9aeacb60dd5a23dbb18
|
|
| MD5 |
33cf6ff8ca165073dcc2e65229806e0f
|
|
| BLAKE2b-256 |
8d200c6b9f53683bac3c869e6a4cfbba0c0eb3dcd180dd2c844c1481f294a9ca
|