The vendor-neutral significance gate for AI evals. Most eval tools show a number moving; siggate tells you — in CI — whether the move is statistically real or within noise.

These details have not been verified by PyPI

Project links

Project description

significance-gate (`siggate`)

The vendor-neutral significance gate for AI evals. Most eval tools show a number moving; none tell you whether the move is real. siggate does — in CI.

You bump a prompt, swap a checkpoint, change a decoding config. The eval moves +1.8%. Ship it? Every dashboard, harness, and leaderboard will happily render that +1.8% as a green number. None of them answer the only question that matters:

Is +1.8% a real improvement, or is it noise?

siggate answers it, on every eval run, as a CI check:

🟡 UNDERPOWERED: +1.8% is within noise (p=0.34); you'd need ~420 more samples to call it.

Three verdicts, one line each:

	verdict	meaning	what to do
🟢	REAL	the change survives the statistics	ship it
⚪	NOISE	you had the power to find an effect; there isn't one	don't ship; it's not a change
🟡	UNDERPOWERED	you can't tell yet at this sample size	collect N more samples (siggate computes N)

It runs as a CLI, a CI exit-code gate, and a GitHub Action that posts the verdict as a PR comment. It is framework-neutral: adapters for lm-evaluation-harness (--log_samples), Inspect AI (.eval logs), and a generic CSV/JSON of per-item scores.

The moat: it doesn't reinvent the statistics

The hard part of this is the statistics, and they are already built and tested. siggate is a thin productization layer over deltagate — the eval-reliability toolkit, which generalizes the paired/multiplicity/power helpers the author contributed to Inspect AI. deltagate supplies, verbatim:

the paired-delta confidence interval (the correct paired analysis on shared samples — not the unpaired test everyone reaches for),
Holm-Bonferroni / Benjamini-Hochberg multiplicity correction across a suite,
power / minimum-sample math (the "N more samples" number), and
the selection-bias-aware "is this real?" probability (deflated, for best-of-N).

siggate wraps that decomposable report in a single labelled verdict, a one-line human summary, exit codes, a PR comment, and a config file. The stats are not reimplemented here — see gate.py.

Install

pip install git+https://github.com/yongzhe2160cs/significance-gate
# brings in deltagate (the stats engine) automatically.
# from a clone:  pip install -e ".[dev]"

CLI in sixty seconds

# Two lm-eval --log_samples files (auto-detected by extension):
siggate compare baseline.jsonl candidate.jsonl --metric acc

# Inspect logs / raw scores:
siggate compare run_a.eval run_b.eval --adapter inspect --metric match
siggate compare a.csv b.csv            # id,score columns

# A whole suite (two directories, paired by filename) with multiplicity control:
siggate compare runs_baseline/ runs_candidate/ --correction holm

🟡 underpowered: UNDERPOWERED
   baseline 66.11%  candidate 69.44%  delta +3.33pp  (n=180)
   paired CI [-3.73pp, +10.39pp]  p=0.3547  P(real)=0.82
   min detectable @ n=180: 10.09%   -> ~1,470 more samples for an effect this size
   => 🟡 UNDERPOWERED: +3.33pp is within noise (p=0.355); you'd need ~1,470 more samples (1,650 total) to call it at 80% power.

The exit code reflects the gate, so this is your CI check:

siggate compare baseline.jsonl candidate.jsonl   # exit 0 = passed, 1 = blocked, 2 = error

By default the gate fails (exit 1) when you report an improvement that is not statistically supported (unsupported-improvement), or on a real regression. Tune it in siggate.toml or with --fail-on (see Configuration).

--format {text,json,markdown,github} switches output; --output comment.md writes the rendered PR comment to a file.

GitHub Action quickstart

Add .github/workflows/significance-gate.yml (full sample here):

permissions:
  contents: read
  pull-requests: write   # so the Action can post the PR comment

jobs:
  significance-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      # ... your eval step produces baseline + candidate per-sample scores ...
      - uses: yongzhe2160cs/significance-gate@main
        with:
          baseline: eval-out/baseline.jsonl
          candidate: eval-out/candidate.jsonl
          adapter: lm-eval
          metric: acc
          fail-on: unsupported-improvement,regression,invalid
          github-token: ${{ secrets.GITHUB_TOKEN }}   # standard token; nothing hardcoded

On every PR the Action runs the gate, posts (and updates) a single comment, writes the verdict to the job summary, and fails the check when the gate trips:

🟡 Significance Gate — UNDERPOWERED

arc_challenge — +3.33pp is within noise (p=0.355); you'd need ~1,470 more samples (1,650 total) to call it at 80% power.

metric value

delta (candidate − baseline) +3.33pp

baseline → candidate 66.11% → 69.44%

95% CI [-3.73pp, +10.39pp]

p-value 0.355

samples (n) 180

min detectable @ n=180 10.09%

🟡 Underpowered. This delta is below what n=180 can resolve. You'd need ~1,470 more samples (1,650 total) to detect an effect this size at 80% power.

metric	value
delta (candidate − baseline)	+3.33pp
baseline → candidate	66.11% → 69.44%
95% CI	[-3.73pp, +10.39pp]
p-value	0.355
samples (n)	180
min detectable @ n=180	10.09%

It finds its prior comment by a hidden marker and edits in place, so a PR gets one living verdict instead of a pile of bot comments. The token is the standard ${{ secrets.GITHUB_TOKEN }} — no secrets are baked into the Action.

The three-way demo (the whole pitch)

python examples/make_demo_data.py   # regenerate the committed synthetic data
python examples/demo.py             # prints CLI output + PR comment for all cases

Bundled, realistic, lm-eval-shaped synthetic data drives three single-comparison cases and one suite. The data is paired (a shared per-item difficulty makes both models miss the same hard questions) — exactly the correlation a naive unpaired eyeball gets wrong.

(a) A real effect — 🟢 REAL (gate: PASS, exit 0)

🟢 REAL: candidate +3.67pp vs baseline (p=0.000666, 95% CI [+1.56pp, +5.78pp]) — a real change at α=0.05.

(b) A noise delta — ⚪ NOISE (gate: FAIL, exit 1)

⚪ NOISE: +0.73pp is within noise (p=0.488); n=1,500 was enough to detect a 5.00% effect and none is there.

(c) An underpowered case — 🟡 UNDERPOWERED (gate: FAIL, exit 1)

🟡 UNDERPOWERED: +3.33pp is within noise (p=0.355); you'd need ~1,470 more samples (1,650 total) to call it at 80% power.

(d) A suite of 4 tasks — multiplicity correction in action. Naively two tasks look like wins; after Holm correction only one survives:

suite: 4 tasks  alpha=0.05  correction=holm
   naive per-task wins : 2 ['hellaswag', 'mmlu']
   survive correction : 1 ['mmlu']

task	delta	p	adj p	verdict
`arc_challenge`	-1.67pp	0.091	0.181	⚪ NOISE
`gsm8k`	+0.42pp	0.731	0.731	⚪ NOISE
`hellaswag`	+1.92pp	0.029	0.086	⚪ NOISE
`mmlu`	+6.58pp	3.2e-08	1.3e-07	🟢 REAL

hellaswag clears a naive p<0.05 — and would be reported as a win on any dashboard — but doesn't survive correcting for four simultaneous comparisons. That single distinction is most of what siggate is for.

How a verdict is decided

REAL          significant (paired p < α, bootstrap CI excludes 0, and — if
              best-of-N — the selection-deflated probability clears 1-α);
              in a suite, also survives Holm/BH correction.
NOISE         not significant, AND the eval was powered to find a meaningful
              effect (min-detectable ≤ mei) — you looked, there's nothing there.
UNDERPOWERED  not significant, AND the eval could NOT resolve a meaningful
              effect (min-detectable > mei) — collect ~N more samples.
INVALID       a degenerate comparison (e.g. the two runs are byte-identical).

The NOISE-vs-UNDERPOWERED split needs an equivalence margin — the minimum effect of interest mei (default 0.05, i.e. 5 accuracy points; configurable in your metric's units). This is the honest distinction between "confidently no effect" and "can't tell yet", and it's the right knob for a team to set once.

Configuration

Drop a siggate.toml at your repo root (every field optional, shown with defaults):

[gate]
alpha   = 0.05
power   = 0.80
mei     = 0.05   # minimum effect of interest (metric units) → NOISE vs UNDERPOWERED
fail_on = ["unsupported-improvement", "regression", "invalid"]

[metric]
name    = "acc"   # lm-eval metric key / Inspect scorer
adapter = "auto"  # auto | lm-eval | inspect | raw

[suite]
correction = "holm"   # holm (family-wise) | bh (false-discovery-rate)

[selection]
n_trials = 1      # best-of-N variants tried before reporting this run

fail_on rules (any match blocks the build): unsupported-improvement (claimed a win that isn't real), regression (a real change in the wrong direction), not-real (strict: anything not REAL), noise, underpowered, invalid. CLI flags override the file; --fail-on none makes it report-only.

Library use

import siggate

v = siggate.compare(baseline_scores, candidate_scores, name="mmlu")
print(v.label)        # "REAL" | "NOISE" | "UNDERPOWERED" | "INVALID"
print(v.summary())    # one-line human verdict
print(v.n_more)       # samples still needed (None when already resolved)

suite = siggate.compare_suite({"mmlu": (base, cand), ...}, method="holm")
print([t.name for t in suite.survivors])   # tasks that survive correction

What's MVP vs. what a hosted product needs

This open-core CLI + Action is genuinely useful today and is honest about its edges:

Works now: all three adapters, paired CI + multiplicity + power + deflated verdict (via deltagate), exit-code gate, siggate.toml, the PR-comment Action (comment upsert via the standard GITHUB_TOKEN), suite mode, the three-way demo.

MVP limitations (documented, not hidden):

mei defaults to 0.05 and is calibrated for accuracy-style proportion metrics; set it in your metric's units for anything else.
Pairing requires aligned sample ids across runs (the adapters key on doc_id / Inspect id / your id column); siggate refuses to silently intersect mismatched runs.
The power / min-samples math uses the normal approximation deltagate ships (excellent for the sample sizes evals actually use; not exact small-n).
The Action installs from a git ref and runs the CLI; it is not yet a published Marketplace Action or a pinned PyPI release.

What the hosted "Significance Gate" app would add (the commercial layer, not in this repo): a GitHub App (org-wide install, no per-repo token wiring), a dashboard of eval history and power over time, status checks with required-gate branch protection, trend/regression alerting, storage of eval artifacts, and billing. None of that changes the math — the moat is the statistical rigor, and that lives in the open core.

Roadmap

Publish deltagate and siggate to PyPI; pin a Marketplace Action.
More adapters (OpenAI evals, HELM, raw HF datasets columns).
Per-metric mei presets and an equivalence-test (TOST) mode for "prove no regression".
Bootstrapped power for tiny-n / heavy-tailed metrics.
Hosted app: GitHub App auth, dashboard, required status checks, billing.

Related work by the same author

The eval-rigor portfolio this productizes and sits beside:

deltagate / eval-reliability — the statistics engine under siggate.
agent-eval-reliability — ICC, pass@k CIs, paired tests with multiplicity for agent evals.
calibration-toolkit — ECE±CI, debiased estimator, Brier decomposition, temperature/Platt scaling.
leaderboard-reliability — re-ranks LLM leaderboards with Wilson CIs and tie bands.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 14, 2026

This version

0.1.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

siggate-0.1.0.tar.gz (82.1 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

siggate-0.1.0-py3-none-any.whl (23.8 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file siggate-0.1.0.tar.gz.

File metadata

Download URL: siggate-0.1.0.tar.gz
Upload date: Jun 13, 2026
Size: 82.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for siggate-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`668babc5cb9f2c718e1bce010672476e209791af2f1e5a8b4aa91fc896ab99ae`
MD5	`35cd824dd040830f766794d559343cb5`
BLAKE2b-256	`113582e53155815c4e6fae697237cb11febbf821d46b277cb47975ee174eef20`

See more details on using hashes here.

File details

Details for the file siggate-0.1.0-py3-none-any.whl.

File metadata

Download URL: siggate-0.1.0-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 23.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for siggate-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c350f567732d0e4474170dcd0071d3cf3ca718e8155dabd9ed3e92f9546ebec`
MD5	`e23550a5436407416567d4bdd4cc7439`
BLAKE2b-256	`e86e31a473bc9710fe0308928326562af5bb1d585e7259313808569966eaca04`

See more details on using hashes here.

siggate 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

significance-gate (`siggate`)

The moat: it doesn't reinvent the statistics

Install

CLI in sixty seconds

GitHub Action quickstart

🟡 Significance Gate — UNDERPOWERED

The three-way demo (the whole pitch)

How a verdict is decided

Configuration

Library use

What's MVP vs. what a hosted product needs

Roadmap

Related work by the same author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

siggate 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

significance-gate (siggate)

The moat: it doesn't reinvent the statistics

Install

CLI in sixty seconds

GitHub Action quickstart

🟡 Significance Gate — UNDERPOWERED

The three-way demo (the whole pitch)

How a verdict is decided

Configuration

Library use

What's MVP vs. what a hosted product needs

Roadmap

Related work by the same author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

significance-gate (`siggate`)