Decision-grade statistics for AI evals: paired comparisons, cluster-aware uncertainty, and power analysis on top of existing eval frameworks.

These details have not been verified by PyPI

Project links

Project description

evalconfidence

Decision-grade statistics for AI evals. A companion layer — not another framework — that adds paired comparisons, dependence-aware uncertainty, and power analysis on top of the eval stack you already use (Inspect AI, or anything that can produce a dataframe).

Status: v0.1.0 — first public release. The full statistics layer: standard_error(), compare(), power(), the adapters, CI on Python 3.10–3.13, and a demo notebook (examples/demo.ipynb) on real GPQA Diamond results (198 items × 5 epochs × 2 models) that re-runs from the committed scores CSV with zero API keys.

The gap, stated honestly

Existing frameworks do quantify uncertainty: Inspect AI computes per-eval standard errors via the CLT, offers bootstrapping for non-mean statistics, and — since v0.3.64 (Feb 2025) — supports clustered standard errors via stderr(cluster=...) when you declare a grouping field. What they give you is a defensible standard error on a single score. What none of them give you (checked against the Inspect changelog and DeepEval metrics list, June 2026):

Rigorous comparison between two systems — paired tests that exploit shared items, a CI on the difference, McNemar for binary scores. The universal practice is still eyeballing two separate intervals, which is an unpaired test at its maximum variance.
Power / sample-size planning before you spend the inference budget — how many items to detect the gap you care about, or the smallest gap your benchmark can see at all.

On dependence-aware uncertainty the gap is narrower and we say so: Inspect can cluster if you name the grouping up front. This package adds the diagnostic framing — naive and cluster-robust side by side with the inflation factor, epoch structure auto-detected — and works on results from any framework, not just Inspect tasks configured with custom metrics.

That's the whole scope of this package: results in, rigorous comparison out. No model calls, no orchestration, no tracing.

Capability matrix

Capability	Existing frameworks	evalconfidence
Run / orchestrate / trace / score evals	Yes	No (consumes results)
Single-score standard error	Yes (Inspect: CLT, bootstrap)	Re-derives, reported side by side
Clustered standard errors	Partial (Inspect `stderr(cluster=...)`, declared field)	Yes — auto-detected epochs, inflation factor, any framework
Paired comparison of two systems	No	Yes — paired-t / McNemar, CI on the difference
Power / minimum detectable effect	No	Yes — n ↔ MDE, pairing- and cluster-aware
Judge debiasing (PPI)	No	Planned (v2)

For the full technical argument — how dependence-blind SEs manufacture false wins at a real α of ~25–30%, how unpaired comparisons silently bury real improvements, and why underpowered evals cause both errors — see docs/why-it-works.md.

How it works: the two-stage flow

This package never makes API calls — model_id is just a grouping label, never an endpoint. The flow has two stages, and the package only lives in the second:

Generation (upstream, not this package). An eval framework runs the model against the benchmark and grades outputs. This is where API calls, keys, and cost live. Inspect AI saves its own durable record automatically — a .eval log in ./logs/ with every prompt, response, and score per sample. A homegrown harness's CSV plays the same role.
Analysis (this package). An adapter reads that already-existing record into the normalized ItemResult rows — from_inspect() for .eval logs, from_dataframe() for anything tabular — and the statistics functions compute on those fixed numbers. No model is ever consulted again.

This separation is what makes analyses cheaply reproducible: pay for stage 1 once, keep the log/CSV, and re-run stage 2 forever for free.

What gets saved: stage-1 artifacts are saved by whoever produced them (Inspect does this automatically). Stage-2 outputs are returned as in-memory dataclasses (SEResult, ...) — print them or serialize with dataclasses.asdict(); the package deliberately doesn't persist analysis results, because the saved stage-1 record is the thing worth keeping and the statistics re-run in milliseconds.

Quick example

from evalconfidence import from_inspect, compare, power, standard_error

results_a = from_inspect("logs/full/..._gpqa-diamond_....eval")  # 198 items x 5 epochs
results_b = from_inspect("logs/full/..._gpqa-diamond_....eval")

print(compare(results_a, results_b))          # pairs on shared items automatically
print(standard_error(results_a))              # naive vs cluster-robust, side by side
print(power((results_a, results_b), mde=0.06))  # items needed to detect 6 points

Output — real GPQA Diamond results, gpt-5-nano vs gpt-5.4-mini at default settings (the committed scores CSV reproduces this without keys; see the demo notebook):

openai/gpt-5-nano-2025-08-07 is estimated to outperform openai/gpt-5.4-mini-2026-03-17
by 5.9 points, 95% CI [0.4, 11.3] (A−B). The difference is significant at alpha=0.05
(p=0.0363, paired_t).
Pairing reduced the comparison variance by 2.1x: the 198 paired items deliver
the precision of ~420 unpaired items.

Mean score: 0.6758  (n=990 observations)
  Naive i.i.d. SE:    0.0149  ->  95% CI [0.6465, 0.7050]
  Cluster-robust SE:  0.0276  ->  95% CI [0.6213, 0.7302]  (198 clusters by item)
  Inflation: 1.85x  (design effect 3.44)

Detecting a 6.0 points gap at alpha=0.05 with 80% power requires ~334 paired items.

The same data, compared unpaired (the eyeball-the-two-intervals test), give 95% CI [−2.1, +13.8], p = 0.15 — a real 5.9-point edge written off as noise. The full story, with figures and the pilot-based power analysis that designed the run, is in the demo notebook.

Not on Inspect? Use the escape hatch:

from evalconfidence import from_dataframe
results = from_dataframe(df, item_id="qid", model_id="system", score="acc")

Install

pip install evalconfidence              # core: numpy + scipy only
pip install "evalconfidence[inspect]"   # + Inspect AI log reading

For development (from a clone):

pip install -e ".[dev]"     # + pytest, pandas (for tests)
pip install -e ".[demo]"    # + matplotlib, jupyter (for the demo notebook)

What's here (v0.1.0)

ItemResult normalized representation + from_inspect + from_dataframe
standard_error() — naive vs. cluster-robust side by side, inflation factor
compare() — paired comparison of two systems (paired-t / McNemar), variance-reduction factor, unpaired fallback with warning
power() — required n ↔ minimum detectable effect, pairing- and cluster-aware
Demo notebook — three figures (wrong winner / false confidence / budget planning) on real GPQA Diamond data, generated for ~$4 and re-runnable from the committed CSV with no keys: examples/demo.ipynb

On the roadmap

PPI (prediction-powered inference) for debiasing LLM-judge scores
Multiple-comparison correction for task suites
Possible upstream contribution to Inspect AI (#4206 tracks a related proposal)

License: Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 14, 2026

This version

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalconfidence-0.1.0.tar.gz (160.6 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalconfidence-0.1.0-py3-none-any.whl (22.6 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file evalconfidence-0.1.0.tar.gz.

File metadata

Download URL: evalconfidence-0.1.0.tar.gz
Upload date: Jun 14, 2026
Size: 160.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for evalconfidence-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1c4f0d17cf92cb598acae9f96d0a60165ff725267caa969e5303c558e92ddada`
MD5	`181f24fe1337614eddde2dfdec3f8c9d`
BLAKE2b-256	`e441eb83bc49b061dd8a6566d84e593a5568ef045e8babd47a1ce1223001f3d7`

See more details on using hashes here.

File details

Details for the file evalconfidence-0.1.0-py3-none-any.whl.

File metadata

Download URL: evalconfidence-0.1.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 22.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for evalconfidence-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c06ba0219f858aeb34a11a24c51e77121e22434b3a548a8baf0c2292ef2eda8`
MD5	`1b5a0f1c1487373452e3c6f7708d53c5`
BLAKE2b-256	`32da50ac04ec60e3c62d10d82fe867afec70ac9acdb0c119a3ea5d8cb48cb889`

See more details on using hashes here.

evalconfidence 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

evalconfidence

The gap, stated honestly

Capability matrix

How it works: the two-stage flow

Quick example

Install

What's here (v0.1.0)

On the roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes