An independent significance referee for LLM & agent evals — is your improvement real, or noise?

These details have not been verified by PyPI

Project links

Project description

holdout

An independent significance referee for LLM & agent evals. Is your improvement real — or noise, multiple-comparisons inflation, or a model that quietly memorized your test set?

Most eval "wins" don't survive a paired significance test. holdout runs the three checks your eval dashboard skips, in your code or in CI:

Is it signal? A paired test (exact McNemar for pass/fail, paired permutation for graded scores) with a real confidence interval — not a naked delta.
Or did you just try a lot of things? The bar rises with how many variants you tried. The max of 37 noisy attempts is expected to look like a win.
What would change the verdict? Power analysis: how many tasks you'd actually need.

The stats are open source (this repo). The hosted service (holdout.dev) adds the parts code can't promise: independence, a write-once holdout you can't re-tune against, a contamination scan, and a verifiable badge.

Install

pip install holdout-evals      # the import name is still `import holdout`

Quickstart — Python

from holdout import compare

# per-task scores for the SAME tasks, in the same order (0/1 for pass-fail, or floats)
res = compare(baseline_scores, candidate_scores, variants_tried=37)

print(res.report())
print(res.significant)        # False — gate on this
print(res.p_value, res.ci)    # the honest numbers

Quickstart — CLI (drop it in CI)

python examples/make_example.py     # writes a +4-point "win" that is actually noise

holdout check examples/v2.jsonl --baseline examples/v1.jsonl --variants 37

  Holdout - significance check                                  [FAIL]
  baseline 73.0%  ->  candidate 77.0%   (n = 200 tasks)
  effect          +4.0 pts        95% CI [-0.5, +8.5]
  test            mcnemar_exact   p = 0.134
  variants tried  37   ->  adjusted p = 1.000   (any-false-win risk 85%)
  paired counts   +15 fixed / -7 broke (net +8)

  VERDICT: WITHIN NOISE - not statistically significant.
  -> Don't ship on this alone; the gain is indistinguishable from sampling noise.
     You'd need ~967 tasks for an effect this size to be detectable.

holdout check exits non-zero when the improvement isn't a significant gain — so it blocks a "ship the noise" merge. As a GitHub Action:

- run: holdout check evals/candidate.jsonl --baseline evals/baseline.jsonl --variants ${{ env.N_VARIANTS }}

Input is JSONL of { "task_id": ..., "score": ... } (also accepts correct/pass/reward; booleans and 0/1 become 0.0/1.0). One file per system, joined on task_id — or a single --paired file with baseline and candidate columns.

How many tasks do I need?

holdout power --baseline-acc 0.75 --effect 0.03 --variants 37

Why not just compute it yourself?

You can — that's why the math is free. The point of the hosted service is the four things a local script can't credibly promise: an independent verdict (we didn't build the agent), a write-once holdout scored exactly once per config (no quiet re-tuning), a variants bar that spans your whole team's submissions, and a verifiable badge.

Reading

The methodology follows the published literature on eval rigor — paired tests (Dietterich 1998), multiple-comparisons control (Benjamini–Hochberg 1995), benchmark contamination (Zhang et al. 2024, GSM1k), and power for evals (Miller 2024, Adding Error Bars to Evals).

MIT licensed. Contributions and corrections welcome — that's the whole point.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

holdout_evals-0.1.0.tar.gz (14.2 kB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

holdout_evals-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file holdout_evals-0.1.0.tar.gz.

File metadata

Download URL: holdout_evals-0.1.0.tar.gz
Upload date: Jun 24, 2026
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for holdout_evals-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e45854c09fa1bed156261efe8827a3e2db5388c5ec3f0699e7afea9f6d8f3642`
MD5	`5cb9e7ab59b1761c8230dabb511ffc94`
BLAKE2b-256	`1e8f5c48ae194c36c1f7d350d03244b04fdbca9f4a970960b083bccd49429ace`

See more details on using hashes here.

File details

Details for the file holdout_evals-0.1.0-py3-none-any.whl.

File metadata

Download URL: holdout_evals-0.1.0-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for holdout_evals-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2daa8bb5eefe751cf5a550cd03470e5e77db86cfcd7611718dbac006f1d250e0`
MD5	`3fc9b4ce3aa5b34bda7dd590d16ec78a`
BLAKE2b-256	`eacac78aeecbb8181a933203ffca9f065c6656ab4e492d5594f33d6c1c41ffab`

See more details on using hashes here.

holdout-evals 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

holdout

Install

Quickstart — Python

Quickstart — CLI (drop it in CI)

How many tasks do I need?

Why not just compute it yourself?

Reading

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes