Skip to main content

An independent significance referee for LLM & agent evals — is your improvement real, or noise?

Project description

holdout

An independent significance referee for LLM & agent evals. Is your improvement real — or noise, multiple-comparisons inflation, or a model that quietly memorized your test set?

Most eval "wins" don't survive a paired significance test. holdout runs the three checks your eval dashboard skips, in your code or in CI:

  1. Is it signal? A paired test (exact McNemar for pass/fail, paired permutation for graded scores) with a real confidence interval — not a naked delta.
  2. Or did you just try a lot of things? The bar rises with how many variants you tried. The max of 37 noisy attempts is expected to look like a win.
  3. What would change the verdict? Power analysis: how many tasks you'd actually need.

The stats are open source (this repo). The hosted service (holdout.dev) adds the parts code can't promise: independence, a write-once holdout you can't re-tune against, a contamination scan, and a verifiable badge.

Install

pip install holdout-evals      # the import name is still `import holdout`

Quickstart — Python

from holdout import compare

# per-task scores for the SAME tasks, in the same order (0/1 for pass-fail, or floats)
res = compare(baseline_scores, candidate_scores, variants_tried=37)

print(res.report())
print(res.significant)        # False — gate on this
print(res.p_value, res.ci)    # the honest numbers

Quickstart — CLI (drop it in CI)

python examples/make_example.py     # writes a +4-point "win" that is actually noise

holdout check examples/v2.jsonl --baseline examples/v1.jsonl --variants 37
  Holdout - significance check                                  [FAIL]
  baseline 73.0%  ->  candidate 77.0%   (n = 200 tasks)
  effect          +4.0 pts        95% CI [-0.5, +8.5]
  test            mcnemar_exact   p = 0.134
  variants tried  37   ->  adjusted p = 1.000   (any-false-win risk 85%)
  paired counts   +15 fixed / -7 broke (net +8)

  VERDICT: WITHIN NOISE - not statistically significant.
  -> Don't ship on this alone; the gain is indistinguishable from sampling noise.
     You'd need ~967 tasks for an effect this size to be detectable.

holdout check exits non-zero when the improvement isn't a significant gain — so it blocks a "ship the noise" merge. As a GitHub Action:

- run: holdout check evals/candidate.jsonl --baseline evals/baseline.jsonl --variants ${{ env.N_VARIANTS }}

Input is JSONL of { "task_id": ..., "score": ... } (also accepts correct/pass/reward; booleans and 0/1 become 0.0/1.0). One file per system, joined on task_id — or a single --paired file with baseline and candidate columns.

How many tasks do I need?

holdout power --baseline-acc 0.75 --effect 0.03 --variants 37

Why not just compute it yourself?

You can — that's why the math is free. The point of the hosted service is the four things a local script can't credibly promise: an independent verdict (we didn't build the agent), a write-once holdout scored exactly once per config (no quiet re-tuning), a variants bar that spans your whole team's submissions, and a verifiable badge.

Reading

The methodology follows the published literature on eval rigor — paired tests (Dietterich 1998), multiple-comparisons control (Benjamini–Hochberg 1995), benchmark contamination (Zhang et al. 2024, GSM1k), and power for evals (Miller 2024, Adding Error Bars to Evals).

MIT licensed. Contributions and corrections welcome — that's the whole point.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

holdout_evals-0.1.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

holdout_evals-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file holdout_evals-0.1.0.tar.gz.

File metadata

  • Download URL: holdout_evals-0.1.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for holdout_evals-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e45854c09fa1bed156261efe8827a3e2db5388c5ec3f0699e7afea9f6d8f3642
MD5 5cb9e7ab59b1761c8230dabb511ffc94
BLAKE2b-256 1e8f5c48ae194c36c1f7d350d03244b04fdbca9f4a970960b083bccd49429ace

See more details on using hashes here.

File details

Details for the file holdout_evals-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: holdout_evals-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for holdout_evals-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2daa8bb5eefe751cf5a550cd03470e5e77db86cfcd7611718dbac006f1d250e0
MD5 3fc9b4ce3aa5b34bda7dd590d16ec78a
BLAKE2b-256 eacac78aeecbb8181a933203ffca9f065c6656ab4e492d5594f33d6c1c41ffab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page