Statistical validation for LLM/ML eval comparisons: paired delta CIs, multiple-testing correction, deflated significance, power analysis, and noise diagnostics. Most reported eval deltas are noise — this gates them.

These details have not been verified by PyPI

Project links

Project description

deltagate

Statistical validation for LLM/ML eval comparisons. Most reported eval deltas are noise — this gates them.

A model is declared "better" on a one-number delta. A suite of 12 tasks gets scanned for wins. A sample size is chosen by budget, and a 1-point gap is then reported as a finding. These claims usually carry no error bars, and the three ways they go wrong are specific and fixable:

The comparison is paired, but analysed as independent. Two models run on the same samples share per-sample difficulty; the correct standard error comes from the per-sample differences, which can be several times tighter. Getting this wrong fails in both directions — the unpaired test misses real effects, while eyeballing two means manufactures fake ones.
Multiple comparisons. Scan 10 null tasks at α=0.05 and a "significant win" appears every other suite, by construction.
No power analysis. If the minimum detectable delta at your n is 3 points, an observed 1-point delta is unresolvable — more samples needed, not more discussion.

deltagate is a small, framework-agnostic library (numpy + stdlib, nothing else) that does this statistics correctly and hands you a decomposable verdict. It is the sibling of edgegate (the same kind of statistical gate, for trading backtests) and generalizes the eval-reliability toolkit the author contributed to Inspect AI (the ci() metric and paired/multiplicity/power helpers).

Install

pip install git+https://github.com/yongzhe2160cs/eval-reliability
# or from a clone:  pip install -e ".[dev]"

Sixty seconds

import deltagate as dg

# Two models, same samples — sequences already aligned, or {sample_id: score} dicts:
report = dg.evaluate_comparison(scores_a, scores_b, name="math_word_problems")
print(report.render())
# == math_word_problems ==
#   n=500  mean A=0.5820  mean B=0.5260  delta=+0.0560
#   paired 95% CI [+0.0298, +0.0822]  p=2.712e-05  (paired SE 0.0133)
#   BCa bootstrap CI [+0.0303, +0.0820]
#   standardized delta=0.188  P(real)=1.000
#   min detectable delta at n=500: 0.0374
#   verdict: REAL at alpha=0.05: delta +0.0560

# A whole suite, with Holm / Benjamini-Hochberg correction across tasks:
sr = dg.reliability_report({task: (a, b) for task, (a, b) in suite.items()})
print(sr.render())

Real eval outputs plug in through adapters (all stdlib-only):

from deltagate.adapters import (
    LMEvalHarnessAdapter, InspectLogAdapter, RawScoresAdapter, compare_runs,
)

# lm-evaluation-harness --log_samples JSONL:
report = compare_runs(LMEvalHarnessAdapter(metric="acc"),
                      "samples_gsm8k_modelA.jsonl", "samples_gsm8k_modelB.jsonl")

# Inspect AI logs (JSON or .eval zip; C/I/P/N mapped like Inspect's value_to_float):
report = compare_runs(InspectLogAdapter(scorer="match"), "logs/a.eval", "logs/b.eval")

# Anything that can dump an (id, score) CSV or JSON:
report = compare_runs(RawScoresAdapter(), "a.csv", "b.csv")

The demo: "A beats B on 5 of 12 tasks!"

python examples/demo.py synthesizes realistic lm-evaluation-harness output for two models across 12 tasks × 500 samples. The generator mirrors how real model pairs behave: A answers like B on most samples and differs on a few (fixing some wrong answers, breaking some right ones). Ground truth: A is genuinely better on exactly 2 tasks (+5 points); the other 10 are identical models, so every other gap is sampling noise. Four readings of the same files:

reading 1 — the leaderboard (compare two means):
  'A beats B' on 5/12 tasks: ['math_word_problems', 'code_completion',
                              'reading_comp', 'translation_fr', 'instruction_following']

reading 2 — unpaired t-test (the textbook test, WRONG for shared samples):
  'significant' on 0/12 tasks: []
  -> MISSES both real effects: ignoring the pairing throws away the per-sample
     difficulty both models share, so the error bars are several times too wide.

reading 3 — paired tests, NO multiplicity control (p < .05 each):
  'significant' on 3/12 tasks: ['math_word_problems', 'code_completion', 'translation_fr']
  -> includes a false positive: scan 10 null tasks at alpha=.05 and flukes
     are expected — this is what suite-level correction is for.

reading 4 — deltagate (paired CIs + Holm/BH across the suite):
  == suite: 12 tasks, alpha=0.05 ==
  naive per-task 'wins'        : 3 ['math_word_problems', 'code_completion', 'translation_fr']
  survive Holm (family-wise)   : 2 ['math_word_problems', 'code_completion']
  survive Benjamini-Hochberg   : 2 ['math_word_problems', 'code_completion']

Exactly the two ground-truth effects survive; the fluke dies at the suite level; and the nulls are reported honestly, with the noise floor attached:

== translation_fr ==                  <- the naive false positive (p=0.027)
  n=500  mean A=0.7460  mean B=0.7200  delta=+0.0260
  paired 95% CI [+0.0029, +0.0491]  p=0.02739
  -> killed by Holm/BH across the 12-task family

== table_qa ==                        <- an honest null
  n=500  delta=-0.0080  p=0.4144
  min detectable delta at n=500: 0.0275  ** observed delta is below this **
  verdict: UNRESOLVED: ... — more samples needed, not more discussion

The seed is fixed for reproducibility, not mined: across 40 seeds the unpaired test misses at least one real effect in ~90% of runs, and ~5% of null tasks per run clear uncorrected significance — exactly what α predicts.

Run it on your own files

examples/compare_runs.py is the reusable entry point — point it at any two per-sample score files for the same task:

python examples/compare_runs.py                      # bundled sample data (lm-eval-shaped)
python examples/compare_runs.py A.jsonl B.jsonl --metric acc            # lm-eval --log_samples
python examples/compare_runs.py a.eval b.eval --format inspect          # Inspect AI logs
python examples/compare_runs.py a.csv  b.csv  --format raw              # plain id,score files
python examples/compare_runs.py A.jsonl B.jsonl --n-trials 25 \
       --trial-deltas "0.02,-0.01,..."               # best-of-N selection correction

On the bundled sample pair (400 samples, a real +5.5-point effect):

== samples_gsm8k_modelA vs samples_gsm8k_modelB ==
  n=400  mean A=0.6325  mean B=0.5775  delta=+0.0550
  paired 95% CI [+0.0207, +0.0893]  p=0.001657  (paired SE 0.0175)
  BCa bootstrap CI [+0.0225, +0.0900]
  standardized delta=0.157  P(real)=0.999
  min detectable delta at n=400: 0.0490
  verdict: REAL at alpha=0.05: delta +0.0550

If you claim a best-of-N selection correction without supplying the other trials' deltas, the verdict says so explicitly ("UNCORRECTED for selection") rather than silently pretending — the library refuses to guess the trial variance.

What's in the box

API	What it gives you
`paired_delta`, `align_paired`	Paired CI + significance on per-sample differences (the correctness point), with strict id alignment
`holm_bonferroni`, `benjamini_hochberg`	Suite-level corrections — family-wise error / false discovery rate — with adjusted p-values
`min_samples_for_delta`, `power_for_samples`	Power analysis (textbook check: d/σ = 0.5 at 80% power ⇒ n = 32)
`bootstrap_ci`, `bootstrap_delta_ci`, `percentile_stat`	Percentile & BCa bootstrap CIs, incl. tail percentiles (p95 score, worst-decile delta)
`probabilistic_delta`, `deflated_delta`, `expected_max_std_delta`	Selection-bias-aware significance: "you tried 25 prompt variants and report the best — is the delta still real?"
`variance_components`, `minimum_detectable_delta`, `red_flags`	Noise diagnostics: clustered SE + design effect, the eval's noise floor, contamination red flags (identical runs, constant shifts, saturated benchmarks)
`evaluate_comparison`, `reliability_report`	One call from two runs (or a suite) to a decomposable verdict
`deltagate.adapters`	`LMEvalHarnessAdapter`, `InspectLogAdapter`, `RawScoresAdapter`, and a `ScoresAdapter` protocol for new frameworks

Design choices worth knowing:

Paired everywhere. Comparison APIs take per-sample scores aligned by id, and align_paired refuses mismatched id sets rather than silently intersecting them (a mismatch usually means a broken run, not a choice).
BCa with tie-aware bias correction. Binary accuracy makes bootstrap distributions lumpy; the z0 estimate half-weights exact ties so discrete metrics don't pick up a spurious bias correction.
Deflation refuses to guess. deflated_delta with n_trials > 1 returns NaN unless you supply the trial variance/deltas — a silently-guessed selection correction would be worse than none.
Verdicts decompose. ComparisonReport exposes every number behind the verdict (paired stats, BCa bounds, P(real), minimum detectable delta, red flags) — "trust me" is the failure mode this library exists to end.

Statistical provenance

The math is ported from two bodies of prior, separately-verified work by the same author, not invented here:

the Inspect AI eval-reliability contribution — paired delta, Holm/Benjamini-Hochberg, power, variance components, each validated against hand computations and scipy.stats references in that work's test suite;
the edgegate trading-validation library — the Probabilistic/Deflated Sharpe Ratio machinery (Bailey & López de Prado, with the full skew/kurtosis correction) and the normal inverse-CDF, here adapted from return series to standardized score deltas.

This package's own 40-test suite re-asserts the same reference numbers: the hand-computed paired case (δ=0.75, SE=0.25, p=2Φ(−3)), the Holm/BH reference p-sets, the textbook power n=32, BCa-vs-normal agreement on symmetric data, BCa tail-percentile coverage, and DSR ≤ PSR monotonicity.

Honest limits: intervals and p-values use large-sample normal approximations (fine at eval scale, n ≳ 50 pairs); per-sample scores are assumed exchangeable within a task (use variance_components' cluster support when they aren't); and red flags are heuristics to investigate, not verdicts.

Development

uv venv && uv pip install -e ".[dev]"
pytest -q                              # 40 tests
ruff check . && ruff format --check .
python examples/demo.py

Publishing

The package is PyPI-ready (python -m build produces a wheel + sdist that pass twine check; the name deltagate was available at the time of writing). Publishing requires the maintainer's PyPI token:

python -m build && twine upload dist/*

MIT license.

deltagate is part of a statistical-rigor-for-AI-evals toolkit: agentrel (reliability stats for stochastic agent evals), calibstats (calibration metrics with confidence intervals), leaderboard-ci (leaderboard re-ranking with CIs and tie bands). Full portfolio: github.com/yongzhe2160cs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 14, 2026

0.1.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltagate-0.1.1.tar.gz (40.3 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deltagate-0.1.1-py3-none-any.whl (35.5 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file deltagate-0.1.1.tar.gz.

File metadata

Download URL: deltagate-0.1.1.tar.gz
Upload date: Jun 14, 2026
Size: 40.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for deltagate-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e7b9a2e63ce9d041779fbea7dacba8d819f2a38127c86baf42d9875577dd42be`
MD5	`56659eba084cf48d66d792943155fd29`
BLAKE2b-256	`28d468a36286ddd6ed7fd55df7aa40846c4ed40a33cab793eeb536547cfcc166`

See more details on using hashes here.

File details

Details for the file deltagate-0.1.1-py3-none-any.whl.

File metadata

Download URL: deltagate-0.1.1-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 35.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for deltagate-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9fb422b74abddbb24207eff40c79ba530d9dd05b44558aacc2ea6859c6f140c7`
MD5	`ea99ff910081eb660462e1b11af1f3d0`
BLAKE2b-256	`0efe94a0ea037900312efbfbc89998310f651b0d9da4287cb15d2c5fdd12e767`

See more details on using hashes here.

deltagate 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

deltagate

Install

Sixty seconds

The demo: "A beats B on 5 of 12 tasks!"

Run it on your own files

What's in the box

Statistical provenance

Development

Publishing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes