Statistical validation for LLM/ML eval comparisons: paired delta CIs, multiple-testing correction, deflated significance, power analysis, and noise diagnostics. Most reported eval deltas are noise — this gates them.
Project description
deltagate
Statistical validation for LLM/ML eval comparisons. Most reported eval deltas are noise — this gates them.
A model is declared "better" on a one-number delta. A suite of 12 tasks gets scanned for wins. A sample size is chosen by budget, and a 1-point gap is then reported as a finding. These claims usually carry no error bars, and the three ways they go wrong are specific and fixable:
- The comparison is paired, but analysed as independent. Two models run on the same samples share per-sample difficulty; the correct standard error comes from the per-sample differences, which can be several times tighter. Getting this wrong fails in both directions — the unpaired test misses real effects, while eyeballing two means manufactures fake ones.
- Multiple comparisons. Scan 10 null tasks at α=0.05 and a "significant win" appears every other suite, by construction.
- No power analysis. If the minimum detectable delta at your n is 3 points, an observed 1-point delta is unresolvable — more samples needed, not more discussion.
deltagate is a small, framework-agnostic library (numpy + stdlib, nothing
else) that does this statistics correctly and hands you a decomposable
verdict. It is the sibling of edgegate
(the same kind of statistical gate, for trading backtests) and generalizes the
eval-reliability toolkit the author contributed to
Inspect AI (the ci()
metric and paired/multiplicity/power helpers).
Install
pip install git+https://github.com/yongzhe2160cs/eval-reliability
# or from a clone: pip install -e ".[dev]"
Sixty seconds
import deltagate as dg
# Two models, same samples — sequences already aligned, or {sample_id: score} dicts:
report = dg.evaluate_comparison(scores_a, scores_b, name="math_word_problems")
print(report.render())
# == math_word_problems ==
# n=500 mean A=0.5820 mean B=0.5260 delta=+0.0560
# paired 95% CI [+0.0298, +0.0822] p=2.712e-05 (paired SE 0.0133)
# BCa bootstrap CI [+0.0303, +0.0820]
# standardized delta=0.188 P(real)=1.000
# min detectable delta at n=500: 0.0374
# verdict: REAL at alpha=0.05: delta +0.0560
# A whole suite, with Holm / Benjamini-Hochberg correction across tasks:
sr = dg.reliability_report({task: (a, b) for task, (a, b) in suite.items()})
print(sr.render())
Real eval outputs plug in through adapters (all stdlib-only):
from deltagate.adapters import (
LMEvalHarnessAdapter, InspectLogAdapter, RawScoresAdapter, compare_runs,
)
# lm-evaluation-harness --log_samples JSONL:
report = compare_runs(LMEvalHarnessAdapter(metric="acc"),
"samples_gsm8k_modelA.jsonl", "samples_gsm8k_modelB.jsonl")
# Inspect AI logs (JSON or .eval zip; C/I/P/N mapped like Inspect's value_to_float):
report = compare_runs(InspectLogAdapter(scorer="match"), "logs/a.eval", "logs/b.eval")
# Anything that can dump an (id, score) CSV or JSON:
report = compare_runs(RawScoresAdapter(), "a.csv", "b.csv")
The demo: "A beats B on 5 of 12 tasks!"
python examples/demo.py synthesizes realistic lm-evaluation-harness output
for two models across 12 tasks × 500 samples. The generator mirrors how real
model pairs behave: A answers like B on most samples and differs on a few
(fixing some wrong answers, breaking some right ones). Ground truth: A is
genuinely better on exactly 2 tasks (+5 points); the other 10 are identical
models, so every other gap is sampling noise. Four readings of the same
files:
reading 1 — the leaderboard (compare two means):
'A beats B' on 5/12 tasks: ['math_word_problems', 'code_completion',
'reading_comp', 'translation_fr', 'instruction_following']
reading 2 — unpaired t-test (the textbook test, WRONG for shared samples):
'significant' on 0/12 tasks: []
-> MISSES both real effects: ignoring the pairing throws away the per-sample
difficulty both models share, so the error bars are several times too wide.
reading 3 — paired tests, NO multiplicity control (p < .05 each):
'significant' on 3/12 tasks: ['math_word_problems', 'code_completion', 'translation_fr']
-> includes a false positive: scan 10 null tasks at alpha=.05 and flukes
are expected — this is what suite-level correction is for.
reading 4 — deltagate (paired CIs + Holm/BH across the suite):
== suite: 12 tasks, alpha=0.05 ==
naive per-task 'wins' : 3 ['math_word_problems', 'code_completion', 'translation_fr']
survive Holm (family-wise) : 2 ['math_word_problems', 'code_completion']
survive Benjamini-Hochberg : 2 ['math_word_problems', 'code_completion']
Exactly the two ground-truth effects survive; the fluke dies at the suite level; and the nulls are reported honestly, with the noise floor attached:
== translation_fr == <- the naive false positive (p=0.027)
n=500 mean A=0.7460 mean B=0.7200 delta=+0.0260
paired 95% CI [+0.0029, +0.0491] p=0.02739
-> killed by Holm/BH across the 12-task family
== table_qa == <- an honest null
n=500 delta=-0.0080 p=0.4144
min detectable delta at n=500: 0.0275 ** observed delta is below this **
verdict: UNRESOLVED: ... — more samples needed, not more discussion
The seed is fixed for reproducibility, not mined: across 40 seeds the unpaired test misses at least one real effect in ~90% of runs, and ~5% of null tasks per run clear uncorrected significance — exactly what α predicts.
Run it on your own files
examples/compare_runs.py is the reusable entry point — point it at any two
per-sample score files for the same task:
python examples/compare_runs.py # bundled sample data (lm-eval-shaped)
python examples/compare_runs.py A.jsonl B.jsonl --metric acc # lm-eval --log_samples
python examples/compare_runs.py a.eval b.eval --format inspect # Inspect AI logs
python examples/compare_runs.py a.csv b.csv --format raw # plain id,score files
python examples/compare_runs.py A.jsonl B.jsonl --n-trials 25 \
--trial-deltas "0.02,-0.01,..." # best-of-N selection correction
On the bundled sample pair (400 samples, a real +5.5-point effect):
== samples_gsm8k_modelA vs samples_gsm8k_modelB ==
n=400 mean A=0.6325 mean B=0.5775 delta=+0.0550
paired 95% CI [+0.0207, +0.0893] p=0.001657 (paired SE 0.0175)
BCa bootstrap CI [+0.0225, +0.0900]
standardized delta=0.157 P(real)=0.999
min detectable delta at n=400: 0.0490
verdict: REAL at alpha=0.05: delta +0.0550
If you claim a best-of-N selection correction without supplying the other trials' deltas, the verdict says so explicitly ("UNCORRECTED for selection") rather than silently pretending — the library refuses to guess the trial variance.
What's in the box
| API | What it gives you |
|---|---|
paired_delta, align_paired |
Paired CI + significance on per-sample differences (the correctness point), with strict id alignment |
holm_bonferroni, benjamini_hochberg |
Suite-level corrections — family-wise error / false discovery rate — with adjusted p-values |
min_samples_for_delta, power_for_samples |
Power analysis (textbook check: d/σ = 0.5 at 80% power ⇒ n = 32) |
bootstrap_ci, bootstrap_delta_ci, percentile_stat |
Percentile & BCa bootstrap CIs, incl. tail percentiles (p95 score, worst-decile delta) |
probabilistic_delta, deflated_delta, expected_max_std_delta |
Selection-bias-aware significance: "you tried 25 prompt variants and report the best — is the delta still real?" |
variance_components, minimum_detectable_delta, red_flags |
Noise diagnostics: clustered SE + design effect, the eval's noise floor, contamination red flags (identical runs, constant shifts, saturated benchmarks) |
evaluate_comparison, reliability_report |
One call from two runs (or a suite) to a decomposable verdict |
deltagate.adapters |
LMEvalHarnessAdapter, InspectLogAdapter, RawScoresAdapter, and a ScoresAdapter protocol for new frameworks |
Design choices worth knowing:
- Paired everywhere. Comparison APIs take per-sample scores aligned by id,
and
align_pairedrefuses mismatched id sets rather than silently intersecting them (a mismatch usually means a broken run, not a choice). - BCa with tie-aware bias correction. Binary accuracy makes bootstrap
distributions lumpy; the
z0estimate half-weights exact ties so discrete metrics don't pick up a spurious bias correction. - Deflation refuses to guess.
deflated_deltawithn_trials > 1returns NaN unless you supply the trial variance/deltas — a silently-guessed selection correction would be worse than none. - Verdicts decompose.
ComparisonReportexposes every number behind the verdict (paired stats, BCa bounds, P(real), minimum detectable delta, red flags) — "trust me" is the failure mode this library exists to end.
Statistical provenance
The math is ported from two bodies of prior, separately-verified work by the same author, not invented here:
- the Inspect AI eval-reliability contribution — paired delta,
Holm/Benjamini-Hochberg, power, variance components, each validated against
hand computations and
scipy.statsreferences in that work's test suite; - the
edgegatetrading-validation library — the Probabilistic/Deflated Sharpe Ratio machinery (Bailey & López de Prado, with the full skew/kurtosis correction) and the normal inverse-CDF, here adapted from return series to standardized score deltas.
This package's own 40-test suite re-asserts the same reference numbers: the hand-computed paired case (δ=0.75, SE=0.25, p=2Φ(−3)), the Holm/BH reference p-sets, the textbook power n=32, BCa-vs-normal agreement on symmetric data, BCa tail-percentile coverage, and DSR ≤ PSR monotonicity.
Honest limits: intervals and p-values use large-sample normal approximations
(fine at eval scale, n ≳ 50 pairs); per-sample scores are assumed exchangeable
within a task (use variance_components' cluster support when they aren't);
and red flags are heuristics to investigate, not verdicts.
Development
uv venv && uv pip install -e ".[dev]"
pytest -q # 40 tests
ruff check . && ruff format --check .
python examples/demo.py
Publishing
The package is PyPI-ready (python -m build produces a wheel + sdist that
pass twine check; the name deltagate was available at the time of
writing). Publishing requires the maintainer's PyPI token:
python -m build && twine upload dist/*
MIT license.
deltagate is part of a statistical-rigor-for-AI-evals toolkit: agentrel (reliability stats for stochastic agent evals), calibstats (calibration metrics with confidence intervals), leaderboard-ci (leaderboard re-ranking with CIs and tie bands). Full portfolio: github.com/yongzhe2160cs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deltagate-0.1.1.tar.gz.
File metadata
- Download URL: deltagate-0.1.1.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7b9a2e63ce9d041779fbea7dacba8d819f2a38127c86baf42d9875577dd42be
|
|
| MD5 |
56659eba084cf48d66d792943155fd29
|
|
| BLAKE2b-256 |
28d468a36286ddd6ed7fd55df7aa40846c4ed40a33cab793eeb536547cfcc166
|
File details
Details for the file deltagate-0.1.1-py3-none-any.whl.
File metadata
- Download URL: deltagate-0.1.1-py3-none-any.whl
- Upload date:
- Size: 35.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fb422b74abddbb24207eff40c79ba530d9dd05b44558aacc2ea6859c6f140c7
|
|
| MD5 |
ea99ff910081eb660462e1b11af1f3d0
|
|
| BLAKE2b-256 |
0efe94a0ea037900312efbfbc89998310f651b0d9da4287cb15d2c5fdd12e767
|