Skip to main content

Statistical validation for LLM/ML eval comparisons: paired delta CIs, multiple-testing correction, deflated significance, power analysis, and noise diagnostics. Most reported eval deltas are noise — this gates them.

Project description

deltagate

Statistical validation for LLM/ML eval comparisons. Most reported eval deltas are noise — this gates them.

CI

A model is declared "better" on a one-number delta. A suite of 12 tasks gets scanned for wins. A sample size is chosen by budget, and a 1-point gap is then reported as a finding. These claims usually carry no error bars, and the three ways they go wrong are specific and fixable:

  1. The comparison is paired, but analysed as independent. Two models run on the same samples share per-sample difficulty; the correct standard error comes from the per-sample differences, which can be several times tighter. Getting this wrong fails in both directions — the unpaired test misses real effects, while eyeballing two means manufactures fake ones.
  2. Multiple comparisons. Scan 10 null tasks at α=0.05 and a "significant win" appears every other suite, by construction.
  3. No power analysis. If the minimum detectable delta at your n is 3 points, an observed 1-point delta is unresolvable — more samples needed, not more discussion.

deltagate is a small, framework-agnostic library (numpy + stdlib, nothing else) that does this statistics correctly and hands you a decomposable verdict. It is the sibling of edgegate (the same kind of statistical gate, for trading backtests) and generalizes the eval-reliability toolkit the author contributed to Inspect AI (the ci() metric and paired/multiplicity/power helpers).

Install

pip install git+https://github.com/yongzhe2160cs/eval-reliability
# or from a clone:  pip install -e ".[dev]"

Sixty seconds

import deltagate as dg

# Two models, same samples — sequences already aligned, or {sample_id: score} dicts:
report = dg.evaluate_comparison(scores_a, scores_b, name="math_word_problems")
print(report.render())
# == math_word_problems ==
#   n=500  mean A=0.5820  mean B=0.5260  delta=+0.0560
#   paired 95% CI [+0.0298, +0.0822]  p=2.712e-05  (paired SE 0.0133)
#   BCa bootstrap CI [+0.0303, +0.0820]
#   standardized delta=0.188  P(real)=1.000
#   min detectable delta at n=500: 0.0374
#   verdict: REAL at alpha=0.05: delta +0.0560

# A whole suite, with Holm / Benjamini-Hochberg correction across tasks:
sr = dg.reliability_report({task: (a, b) for task, (a, b) in suite.items()})
print(sr.render())

Real eval outputs plug in through adapters (all stdlib-only):

from deltagate.adapters import (
    LMEvalHarnessAdapter, InspectLogAdapter, RawScoresAdapter, compare_runs,
)

# lm-evaluation-harness --log_samples JSONL:
report = compare_runs(LMEvalHarnessAdapter(metric="acc"),
                      "samples_gsm8k_modelA.jsonl", "samples_gsm8k_modelB.jsonl")

# Inspect AI logs (JSON or .eval zip; C/I/P/N mapped like Inspect's value_to_float):
report = compare_runs(InspectLogAdapter(scorer="match"), "logs/a.eval", "logs/b.eval")

# Anything that can dump an (id, score) CSV or JSON:
report = compare_runs(RawScoresAdapter(), "a.csv", "b.csv")

The demo: "A beats B on 5 of 12 tasks!"

python examples/demo.py synthesizes realistic lm-evaluation-harness output for two models across 12 tasks × 500 samples. The generator mirrors how real model pairs behave: A answers like B on most samples and differs on a few (fixing some wrong answers, breaking some right ones). Ground truth: A is genuinely better on exactly 2 tasks (+5 points); the other 10 are identical models, so every other gap is sampling noise. Four readings of the same files:

reading 1 — the leaderboard (compare two means):
  'A beats B' on 5/12 tasks: ['math_word_problems', 'code_completion',
                              'reading_comp', 'translation_fr', 'instruction_following']

reading 2 — unpaired t-test (the textbook test, WRONG for shared samples):
  'significant' on 0/12 tasks: []
  -> MISSES both real effects: ignoring the pairing throws away the per-sample
     difficulty both models share, so the error bars are several times too wide.

reading 3 — paired tests, NO multiplicity control (p < .05 each):
  'significant' on 3/12 tasks: ['math_word_problems', 'code_completion', 'translation_fr']
  -> includes a false positive: scan 10 null tasks at alpha=.05 and flukes
     are expected — this is what suite-level correction is for.

reading 4 — deltagate (paired CIs + Holm/BH across the suite):
  == suite: 12 tasks, alpha=0.05 ==
  naive per-task 'wins'        : 3 ['math_word_problems', 'code_completion', 'translation_fr']
  survive Holm (family-wise)   : 2 ['math_word_problems', 'code_completion']
  survive Benjamini-Hochberg   : 2 ['math_word_problems', 'code_completion']

Exactly the two ground-truth effects survive; the fluke dies at the suite level; and the nulls are reported honestly, with the noise floor attached:

== translation_fr ==                  <- the naive false positive (p=0.027)
  n=500  mean A=0.7460  mean B=0.7200  delta=+0.0260
  paired 95% CI [+0.0029, +0.0491]  p=0.02739
  -> killed by Holm/BH across the 12-task family

== table_qa ==                        <- an honest null
  n=500  delta=-0.0080  p=0.4144
  min detectable delta at n=500: 0.0275  ** observed delta is below this **
  verdict: UNRESOLVED: ... — more samples needed, not more discussion

The seed is fixed for reproducibility, not mined: across 40 seeds the unpaired test misses at least one real effect in ~90% of runs, and ~5% of null tasks per run clear uncorrected significance — exactly what α predicts.

Run it on your own files

examples/compare_runs.py is the reusable entry point — point it at any two per-sample score files for the same task:

python examples/compare_runs.py                      # bundled sample data (lm-eval-shaped)
python examples/compare_runs.py A.jsonl B.jsonl --metric acc            # lm-eval --log_samples
python examples/compare_runs.py a.eval b.eval --format inspect          # Inspect AI logs
python examples/compare_runs.py a.csv  b.csv  --format raw              # plain id,score files
python examples/compare_runs.py A.jsonl B.jsonl --n-trials 25 \
       --trial-deltas "0.02,-0.01,..."               # best-of-N selection correction

On the bundled sample pair (400 samples, a real +5.5-point effect):

== samples_gsm8k_modelA vs samples_gsm8k_modelB ==
  n=400  mean A=0.6325  mean B=0.5775  delta=+0.0550
  paired 95% CI [+0.0207, +0.0893]  p=0.001657  (paired SE 0.0175)
  BCa bootstrap CI [+0.0225, +0.0900]
  standardized delta=0.157  P(real)=0.999
  min detectable delta at n=400: 0.0490
  verdict: REAL at alpha=0.05: delta +0.0550

If you claim a best-of-N selection correction without supplying the other trials' deltas, the verdict says so explicitly ("UNCORRECTED for selection") rather than silently pretending — the library refuses to guess the trial variance.

What's in the box

API What it gives you
paired_delta, align_paired Paired CI + significance on per-sample differences (the correctness point), with strict id alignment
holm_bonferroni, benjamini_hochberg Suite-level corrections — family-wise error / false discovery rate — with adjusted p-values
min_samples_for_delta, power_for_samples Power analysis (textbook check: d/σ = 0.5 at 80% power ⇒ n = 32)
bootstrap_ci, bootstrap_delta_ci, percentile_stat Percentile & BCa bootstrap CIs, incl. tail percentiles (p95 score, worst-decile delta)
probabilistic_delta, deflated_delta, expected_max_std_delta Selection-bias-aware significance: "you tried 25 prompt variants and report the best — is the delta still real?"
variance_components, minimum_detectable_delta, red_flags Noise diagnostics: clustered SE + design effect, the eval's noise floor, contamination red flags (identical runs, constant shifts, saturated benchmarks)
evaluate_comparison, reliability_report One call from two runs (or a suite) to a decomposable verdict
deltagate.adapters LMEvalHarnessAdapter, InspectLogAdapter, RawScoresAdapter, and a ScoresAdapter protocol for new frameworks

Design choices worth knowing:

  • Paired everywhere. Comparison APIs take per-sample scores aligned by id, and align_paired refuses mismatched id sets rather than silently intersecting them (a mismatch usually means a broken run, not a choice).
  • BCa with tie-aware bias correction. Binary accuracy makes bootstrap distributions lumpy; the z0 estimate half-weights exact ties so discrete metrics don't pick up a spurious bias correction.
  • Deflation refuses to guess. deflated_delta with n_trials > 1 returns NaN unless you supply the trial variance/deltas — a silently-guessed selection correction would be worse than none.
  • Verdicts decompose. ComparisonReport exposes every number behind the verdict (paired stats, BCa bounds, P(real), minimum detectable delta, red flags) — "trust me" is the failure mode this library exists to end.

Statistical provenance

The math is ported from two bodies of prior, separately-verified work by the same author, not invented here:

  • the Inspect AI eval-reliability contribution — paired delta, Holm/Benjamini-Hochberg, power, variance components, each validated against hand computations and scipy.stats references in that work's test suite;
  • the edgegate trading-validation library — the Probabilistic/Deflated Sharpe Ratio machinery (Bailey & López de Prado, with the full skew/kurtosis correction) and the normal inverse-CDF, here adapted from return series to standardized score deltas.

This package's own 40-test suite re-asserts the same reference numbers: the hand-computed paired case (δ=0.75, SE=0.25, p=2Φ(−3)), the Holm/BH reference p-sets, the textbook power n=32, BCa-vs-normal agreement on symmetric data, BCa tail-percentile coverage, and DSR ≤ PSR monotonicity.

Honest limits: intervals and p-values use large-sample normal approximations (fine at eval scale, n ≳ 50 pairs); per-sample scores are assumed exchangeable within a task (use variance_components' cluster support when they aren't); and red flags are heuristics to investigate, not verdicts.

Development

uv venv && uv pip install -e ".[dev]"
pytest -q                              # 40 tests
ruff check . && ruff format --check .
python examples/demo.py

Publishing

The package is PyPI-ready (python -m build produces a wheel + sdist that pass twine check; the name deltagate was available at the time of writing). Publishing requires the maintainer's PyPI token:

python -m build && twine upload dist/*

MIT license.


deltagate is part of a statistical-rigor-for-AI-evals toolkit: agentrel (reliability stats for stochastic agent evals), calibstats (calibration metrics with confidence intervals), leaderboard-ci (leaderboard re-ranking with CIs and tie bands). Full portfolio: github.com/yongzhe2160cs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltagate-0.1.0.tar.gz (35.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deltagate-0.1.0-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file deltagate-0.1.0.tar.gz.

File metadata

  • Download URL: deltagate-0.1.0.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for deltagate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7fb340d4fe31d73c3e0346f376d319fa6d42f3e7422a7ecc87f2329245a17f15
MD5 8e6f7e11951ae935e59405e2eff35508
BLAKE2b-256 b588aaa404a4d3db3d039f21075634187b02a8107e13dda664b7d7c30997c33b

See more details on using hashes here.

File details

Details for the file deltagate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: deltagate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for deltagate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1063a103f1841bf0f46d4cdd83ff3e1dceb81b3d79f0214f63bc0da843ac65b5
MD5 58c4ba78353900f36ee93bd6ee383ee5
BLAKE2b-256 7846885d5eb7dc4e3d7e3631655f1345e4778cb5124cfb78af675a3221d1433b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page