Decision-grade statistics for AI evals: paired comparisons, cluster-aware uncertainty, and power analysis on top of existing eval frameworks.
Project description
evalconfidence
Decision-grade statistics for AI evals. A companion layer — not another framework — that adds paired comparisons, dependence-aware uncertainty, and power analysis on top of the eval stack you already use (Inspect AI, or anything that can produce a dataframe).
Status: v0.1.0 — first public release. The full statistics layer:
standard_error(),compare(),power(), the adapters, CI on Python 3.10–3.13, and a demo notebook (examples/demo.ipynb) on real GPQA Diamond results (198 items × 5 epochs × 2 models) that re-runs from the committed scores CSV with zero API keys.
The gap, stated honestly
Existing frameworks do quantify uncertainty: Inspect AI computes per-eval standard errors via the CLT, offers bootstrapping for non-mean statistics, and — since v0.3.64 (Feb 2025) — supports clustered standard errors via stderr(cluster=...) when you declare a grouping field. What they give you is a defensible standard error on a single score. What none of them give you (checked against the Inspect changelog and DeepEval metrics list, June 2026):
- Rigorous comparison between two systems — paired tests that exploit shared items, a CI on the difference, McNemar for binary scores. The universal practice is still eyeballing two separate intervals, which is an unpaired test at its maximum variance.
- Power / sample-size planning before you spend the inference budget — how many items to detect the gap you care about, or the smallest gap your benchmark can see at all.
On dependence-aware uncertainty the gap is narrower and we say so: Inspect can cluster if you name the grouping up front. This package adds the diagnostic framing — naive and cluster-robust side by side with the inflation factor, epoch structure auto-detected — and works on results from any framework, not just Inspect tasks configured with custom metrics.
That's the whole scope of this package: results in, rigorous comparison out. No model calls, no orchestration, no tracing.
Capability matrix
| Capability | Existing frameworks | evalconfidence |
|---|---|---|
| Run / orchestrate / trace / score evals | Yes | No (consumes results) |
| Single-score standard error | Yes (Inspect: CLT, bootstrap) | Re-derives, reported side by side |
| Clustered standard errors | Partial (Inspect stderr(cluster=...), declared field) |
Yes — auto-detected epochs, inflation factor, any framework |
| Paired comparison of two systems | No | Yes — paired-t / McNemar, CI on the difference |
| Power / minimum detectable effect | No | Yes — n ↔ MDE, pairing- and cluster-aware |
| Judge debiasing (PPI) | No | Planned (v2) |
For the full technical argument — how dependence-blind SEs manufacture false wins at a real α of ~25–30%, how unpaired comparisons silently bury real improvements, and why underpowered evals cause both errors — see docs/why-it-works.md.
How it works: the two-stage flow
This package never makes API calls — model_id is just a grouping label, never an endpoint. The flow has two stages, and the package only lives in the second:
- Generation (upstream, not this package). An eval framework runs the model against the benchmark and grades outputs. This is where API calls, keys, and cost live. Inspect AI saves its own durable record automatically — a
.evallog in./logs/with every prompt, response, and score per sample. A homegrown harness's CSV plays the same role. - Analysis (this package). An adapter reads that already-existing record into the normalized
ItemResultrows —from_inspect()for.evallogs,from_dataframe()for anything tabular — and the statistics functions compute on those fixed numbers. No model is ever consulted again.
This separation is what makes analyses cheaply reproducible: pay for stage 1 once, keep the log/CSV, and re-run stage 2 forever for free.
What gets saved: stage-1 artifacts are saved by whoever produced them (Inspect does this automatically). Stage-2 outputs are returned as in-memory dataclasses (SEResult, ...) — print them or serialize with dataclasses.asdict(); the package deliberately doesn't persist analysis results, because the saved stage-1 record is the thing worth keeping and the statistics re-run in milliseconds.
Quick example
from evalconfidence import from_inspect, compare, power, standard_error
results_a = from_inspect("logs/full/..._gpqa-diamond_....eval") # 198 items x 5 epochs
results_b = from_inspect("logs/full/..._gpqa-diamond_....eval")
print(compare(results_a, results_b)) # pairs on shared items automatically
print(standard_error(results_a)) # naive vs cluster-robust, side by side
print(power((results_a, results_b), mde=0.06)) # items needed to detect 6 points
Output — real GPQA Diamond results, gpt-5-nano vs gpt-5.4-mini at default settings (the committed scores CSV reproduces this without keys; see the demo notebook):
openai/gpt-5-nano-2025-08-07 is estimated to outperform openai/gpt-5.4-mini-2026-03-17
by 5.9 points, 95% CI [0.4, 11.3] (A−B). The difference is significant at alpha=0.05
(p=0.0363, paired_t).
Pairing reduced the comparison variance by 2.1x: the 198 paired items deliver
the precision of ~420 unpaired items.
Mean score: 0.6758 (n=990 observations)
Naive i.i.d. SE: 0.0149 -> 95% CI [0.6465, 0.7050]
Cluster-robust SE: 0.0276 -> 95% CI [0.6213, 0.7302] (198 clusters by item)
Inflation: 1.85x (design effect 3.44)
Detecting a 6.0 points gap at alpha=0.05 with 80% power requires ~334 paired items.
The same data, compared unpaired (the eyeball-the-two-intervals test), give 95% CI [−2.1, +13.8], p = 0.15 — a real 5.9-point edge written off as noise. The full story, with figures and the pilot-based power analysis that designed the run, is in the demo notebook.
Not on Inspect? Use the escape hatch:
from evalconfidence import from_dataframe
results = from_dataframe(df, item_id="qid", model_id="system", score="acc")
Install
pip install evalconfidence # core: numpy + scipy only
pip install "evalconfidence[inspect]" # + Inspect AI log reading
For development (from a clone):
pip install -e ".[dev]" # + pytest, pandas (for tests)
pip install -e ".[demo]" # + matplotlib, jupyter (for the demo notebook)
What's here (v0.1.0)
-
ItemResultnormalized representation +from_inspect+from_dataframe -
standard_error()— naive vs. cluster-robust side by side, inflation factor -
compare()— paired comparison of two systems (paired-t / McNemar), variance-reduction factor, unpaired fallback with warning -
power()— required n ↔ minimum detectable effect, pairing- and cluster-aware - Demo notebook — three figures (wrong winner / false confidence / budget planning) on real GPQA Diamond data, generated for ~$4 and re-runnable from the committed CSV with no keys: examples/demo.ipynb
On the roadmap
- PPI (prediction-powered inference) for debiasing LLM-judge scores
- Multiple-comparison correction for task suites
- Possible upstream contribution to Inspect AI (#4206 tracks a related proposal)
License: Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evalconfidence-0.1.0.tar.gz.
File metadata
- Download URL: evalconfidence-0.1.0.tar.gz
- Upload date:
- Size: 160.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c4f0d17cf92cb598acae9f96d0a60165ff725267caa969e5303c558e92ddada
|
|
| MD5 |
181f24fe1337614eddde2dfdec3f8c9d
|
|
| BLAKE2b-256 |
e441eb83bc49b061dd8a6566d84e593a5568ef045e8babd47a1ce1223001f3d7
|
File details
Details for the file evalconfidence-0.1.0-py3-none-any.whl.
File metadata
- Download URL: evalconfidence-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c06ba0219f858aeb34a11a24c51e77121e22434b3a548a8baf0c2292ef2eda8
|
|
| MD5 |
1b5a0f1c1487373452e3c6f7708d53c5
|
|
| BLAKE2b-256 |
32da50ac04ec60e3c62d10d82fe867afec70ac9acdb0c119a3ea5d8cb48cb889
|