Skip to main content

Decision-grade statistics for AI evals: paired comparisons, cluster-aware uncertainty, and power analysis on top of existing eval frameworks.

Project description

evalconfidence

CI PyPI License: Apache 2.0

Decision-grade statistics for AI evals. A companion layer — not another framework — that adds paired comparisons, dependence-aware uncertainty, and power analysis on top of the eval stack you already use (Inspect AI, or anything that can produce a dataframe).

Status: v0.1.0 — first public release. The full statistics layer: standard_error(), compare(), power(), the adapters, CI on Python 3.10–3.13, and a demo notebook (examples/demo.ipynb) on real GPQA Diamond results (198 items × 5 epochs × 2 models) that re-runs from the committed scores CSV with zero API keys.

The gap, stated honestly

Existing frameworks do quantify uncertainty: Inspect AI computes per-eval standard errors via the CLT, offers bootstrapping for non-mean statistics, and — since v0.3.64 (Feb 2025) — supports clustered standard errors via stderr(cluster=...) when you declare a grouping field. What they give you is a defensible standard error on a single score. What none of them give you (checked against the Inspect changelog and DeepEval metrics list, June 2026):

  • Rigorous comparison between two systems — paired tests that exploit shared items, a CI on the difference, McNemar for binary scores. The universal practice is still eyeballing two separate intervals, which is an unpaired test at its maximum variance.
  • Power / sample-size planning before you spend the inference budget — how many items to detect the gap you care about, or the smallest gap your benchmark can see at all.

On dependence-aware uncertainty the gap is narrower and we say so: Inspect can cluster if you name the grouping up front. This package adds the diagnostic framing — naive and cluster-robust side by side with the inflation factor, epoch structure auto-detected — and works on results from any framework, not just Inspect tasks configured with custom metrics.

That's the whole scope of this package: results in, rigorous comparison out. No model calls, no orchestration, no tracing.

Capability matrix

Capability Existing frameworks evalconfidence
Run / orchestrate / trace / score evals Yes No (consumes results)
Single-score standard error Yes (Inspect: CLT, bootstrap) Re-derives, reported side by side
Clustered standard errors Partial (Inspect stderr(cluster=...), declared field) Yes — auto-detected epochs, inflation factor, any framework
Paired comparison of two systems No Yes — paired-t / McNemar, CI on the difference
Power / minimum detectable effect No Yes — n ↔ MDE, pairing- and cluster-aware
Judge debiasing (PPI) No Planned (v2)

For the full technical argument — how dependence-blind SEs manufacture false wins at a real α of ~25–30%, how unpaired comparisons silently bury real improvements, and why underpowered evals cause both errors — see docs/why-it-works.md.

How it works: the two-stage flow

This package never makes API calls — model_id is just a grouping label, never an endpoint. The flow has two stages, and the package only lives in the second:

  1. Generation (upstream, not this package). An eval framework runs the model against the benchmark and grades outputs. This is where API calls, keys, and cost live. Inspect AI saves its own durable record automatically — a .eval log in ./logs/ with every prompt, response, and score per sample. A homegrown harness's CSV plays the same role.
  2. Analysis (this package). An adapter reads that already-existing record into the normalized ItemResult rows — from_inspect() for .eval logs, from_dataframe() for anything tabular — and the statistics functions compute on those fixed numbers. No model is ever consulted again.

This separation is what makes analyses cheaply reproducible: pay for stage 1 once, keep the log/CSV, and re-run stage 2 forever for free.

What gets saved: stage-1 artifacts are saved by whoever produced them (Inspect does this automatically). Stage-2 outputs are returned as in-memory dataclasses (SEResult, ...) — print them or serialize with dataclasses.asdict(); the package deliberately doesn't persist analysis results, because the saved stage-1 record is the thing worth keeping and the statistics re-run in milliseconds.

Quick example

from evalconfidence import from_inspect, compare, power, standard_error

results_a = from_inspect("logs/full/..._gpqa-diamond_....eval")  # 198 items x 5 epochs
results_b = from_inspect("logs/full/..._gpqa-diamond_....eval")

print(compare(results_a, results_b))          # pairs on shared items automatically
print(standard_error(results_a))              # naive vs cluster-robust, side by side
print(power((results_a, results_b), mde=0.06))  # items needed to detect 6 points

Output — real GPQA Diamond results, gpt-5-nano vs gpt-5.4-mini at default settings (the committed scores CSV reproduces this without keys; see the demo notebook):

openai/gpt-5-nano-2025-08-07 is estimated to outperform openai/gpt-5.4-mini-2026-03-17
by 5.9 points, 95% CI [0.4, 11.3] (A−B). The difference is significant at alpha=0.05
(p=0.0363, paired_t).
Pairing reduced the comparison variance by 2.1x: the 198 paired items deliver
the precision of ~420 unpaired items.

Mean score: 0.6758  (n=990 observations)
  Naive i.i.d. SE:    0.0149  ->  95% CI [0.6465, 0.7050]
  Cluster-robust SE:  0.0276  ->  95% CI [0.6213, 0.7302]  (198 clusters by item)
  Inflation: 1.85x  (design effect 3.44)

Detecting a 6.0 points gap at alpha=0.05 with 80% power requires ~334 paired items.

The same data, compared unpaired (the eyeball-the-two-intervals test), give 95% CI [−2.1, +13.8], p = 0.15 — a real 5.9-point edge written off as noise. The full story, with figures and the pilot-based power analysis that designed the run, is in the demo notebook.

Not on Inspect? Use the escape hatch:

from evalconfidence import from_dataframe
results = from_dataframe(df, item_id="qid", model_id="system", score="acc")

Install

pip install evalconfidence              # core: numpy + scipy only
pip install "evalconfidence[inspect]"   # + Inspect AI log reading

For development (from a clone):

pip install -e ".[dev]"     # + pytest, pandas (for tests)
pip install -e ".[demo]"    # + matplotlib, jupyter (for the demo notebook)

What's here (v0.1.0)

  • ItemResult normalized representation + from_inspect + from_dataframe
  • standard_error() — naive vs. cluster-robust side by side, inflation factor
  • compare() — paired comparison of two systems (paired-t / McNemar), variance-reduction factor, unpaired fallback with warning
  • power() — required n ↔ minimum detectable effect, pairing- and cluster-aware
  • Demo notebook — three figures (wrong winner / false confidence / budget planning) on real GPQA Diamond data, generated for ~$4 and re-runnable from the committed CSV with no keys: examples/demo.ipynb

On the roadmap

  • PPI (prediction-powered inference) for debiasing LLM-judge scores
  • Multiple-comparison correction for task suites
  • Possible upstream contribution to Inspect AI (#4206 tracks a related proposal)

License: Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalconfidence-0.1.1.tar.gz (160.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalconfidence-0.1.1-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file evalconfidence-0.1.1.tar.gz.

File metadata

  • Download URL: evalconfidence-0.1.1.tar.gz
  • Upload date:
  • Size: 160.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for evalconfidence-0.1.1.tar.gz
Algorithm Hash digest
SHA256 93ee4c0ea4cada28a5150e4f99f28080eefa9a0f6bf7e1e45fd8dfe4b9a6a38a
MD5 6fe0e48cacbbe3433215ff723e1764d3
BLAKE2b-256 2357f2d11c89be19981169044b0323dc27ffc824d7af75c00be4a25590a19fd7

See more details on using hashes here.

File details

Details for the file evalconfidence-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: evalconfidence-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for evalconfidence-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2473715d893bbe9c662abbc9c5c6f9ce3a74798000b155ccfc723e9bc5143c16
MD5 fa3892239e81cef5e94d1e8e95107132
BLAKE2b-256 b85b031dfc07aa517b7be765c50dc6578d19104970d6d6a7d62e07cac310ce57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page