Eval-gated CI/CD for AI products: gate merges on the confidence-interval lower bound of a bias-corrected LLM-judge score, per failure-mode axis.
Project description
🚦 CIGate
Gate your CI/CD on the confidence interval, not the vibes.
Eval-gated CI/CD for AI products: block a merge when answer quality statistically regresses — per failure-mode, with the LLM-judge's bias corrected for, under a cost budget.
The problem (a real, expensive one)
Teams shipping LLM and agent products change prompts, retrieval, tools, and model versions constantly — and those changes have non-local effects: fixing one answer silently breaks ten others. The industry default is "vibes-based" shipping, and it produces silent quality regressions that reach production and cost real money.
The system-design case study this implements opens with the canonical disaster: a well-meaning prompt tweak quietly degraded answer quality on contract questions, an LLM-judge metric drifted 11 points undetected, and the company lost a $4M renewal before anyone noticed.
The gap nobody fills
The popular eval tools — Promptfoo, Braintrust, Langfuse, DeepEval — gate on raw LLM-judge scores. But in-domain LLM-judge accuracy is only ~75–88%, so the judge's observed pass rate is biased. Gating on it means you either:
- over-block (false alarms → developers route around the gate), or
- under-block (real regressions still ship).
CIGate gates on the bias-corrected pass rate's confidence-interval lower bound, per failure-mode axis. That's the part the case study makes its centerpiece and the mainstream tools skip. (The "CI" pun is the whole thesis: gate your Continuous Integration on a Confidence Interval.)
What it looks like on a PR
A one-line prompt change ships answer_v2 ("be helpful and complete, citations
optional"). CIGate runs on the PR and posts this — then blocks the merge:
❌ CIGate: merge blocked — 2 axis regression(s)
prompt=answer_v2·judge=mock·sample=60/300·cost=$0.00
Axis Raw judge Corrected 95% CI Baseline Δ Verdict 🔴 hallucination45.0% 37.0% [11.1%, 59.0%] 98.9% −61.9 pp REGRESSED 🟢 retrieval_miss85.0% 91.1% [69.8%, 100%] 100% −8.9 pp ok 🔴 citation_error61.7% 65.5% [47.6%, 82.3%] 100% −34.5 pp REGRESSED 🟢 refusal71.7% 76.5% [55.9%, 91.9%] 93.2% −16.7 pp ok 🟢 format_violation100% 95.0% [88.2%, 100%] 98.9% −3.9 pp ok
The regression is isolated to the two axes the change actually hurt — a single
composite score would have hidden it. A clean change goes green and merges. Full samples:
docs/samples/.
How it works
flowchart LR
PR[Pull request] --> RUN[Run SUT over a<br/>stratified golden-set sample]
RUN --> CODE[Code evaluators<br/>schema · citations · retrieval]
RUN --> JUDGE[LLM judge<br/>hallucination · refusal · ...]
CODE --> DET[Per-axis detector]
JUDGE --> DET
CAL[Human-labeled<br/>calibration set] --> CORR
DET --> CORR[Statistical correction<br/>Rogan–Gladen + CI]
CORR --> GATE{Drop vs main<br/>baseline > tolerance?}
GATE -- yes --> BLOCK[🔴 block + per-axis report]
GATE -- no --> PASS[🟢 merge allowed]
- Sample the golden set, stratified so every failure-mode axis is represented (cost control — a per-PR run touches a fraction, nightly runs the full set).
- Score each case two ways: cheap deterministic code checks (citations, schema, retrieval) and an LLM-as-judge for subjective axes.
- Correct the judge's bias: using its sensitivity/specificity measured on a human-labeled calibration set, recover the true pass rate with a confidence interval. Deterministic axes skip correction (they're unbiased) and use an exact binomial interval.
- Gate per axis with a one-sided two-sample drop test vs the committed
mainbaseline (Bonferroni-corrected across axes): block only when we're confident the drop exceeds tolerance. Identical builds never false-block, regardless of CI width.
The statistical core (the part that matters)
Raw judge pass rate p_obs is biased. With judge sensitivity (TPR) and specificity
(TNR) measured on a labeled calibration set, the Rogan–Gladen estimator recovers the
true rate:
p_obs + TNR − 1
θ̂ = clip( ───────────────── , 0, 1 )
TPR + TNR − 1
The confidence interval uses the adjusted-Wald delta method (Lee, Zeng et al.,
arXiv:2511.21140), combining all three uncertainty
sources — evaluated-sample, sensitivity, and specificity — with correct ~95% coverage
even on small calibration sets. We cross-check it against the
judgy library in the test suite, and the
implementation reproduces the paper's worked example exactly. See
docs/METHODOLOGY.md.
If the judge is no better than chance (TPR + TNR ≤ 1) or the CI is too wide, CIGate
refuses to gate that axis rather than guess.
Try it in 60 seconds ($0, offline)
Everything runs in a deterministic mock mode — no API key, no spend — which is also what powers the test suite and the demo CI.
git clone https://github.com/awesome-pro/cigate && cd cigate
pip install -e ".[dev]"
cigate baseline --promote # establish a 'good' baseline (full run)
BUILD_FLAVOR=regressed cigate gate # → blocks: hallucination + citation_error red
BUILD_FLAVOR=good cigate gate # → passes: all axes within tolerance
pytest -q # 26 tests, all green, $0
Want the real thing? Set ANTHROPIC_API_KEY and install the extra — the same pipeline
now uses Claude as generator and judge:
pip install -e ".[real]"
export ANTHROPIC_API_KEY=sk-...
cigate gate # real Claude answers, scored + corrected
Two datasets: synthetic + real
synthetic_contract— a generated contract/insurance support set (300 cases, 50 policy docs). Fully controlled; powers the deterministic demo and tests.cuad_real— built from CUAD (real commercial contracts with expert clause annotations, CC BY 4.0). Its headline use: the LLM judge is calibrated against real human expert labels, so the correction's confusion matrix is measured from real bias, not assumed. (evalconfig_cuad.yaml)
Use it on your own product
CIGate is product-agnostic. Point evalconfig.yaml at any callable
(question) -> SUTOutput and bring your own golden set:
sut: "yourapp.bot:answer" # module:callable
goldenset: "goldensets/yours.yaml"
axes: [hallucination, citation_error, ...]
gate: { tolerance: 0.02, confidence_level: 0.95 }
Then drop the GitHub Action into your pipeline (see .github/):
- uses: awesome-pro/cigate/.github/actions/eval-gate@v0.1
with:
config: evalconfig.yaml
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }} # omit -> $0 mock mode
What's in here
| Path | What |
|---|---|
src/cigate/stats.py |
Rogan–Gladen + adjusted-Wald CI — the correction core |
src/cigate/gate.py |
per-axis two-sample drop test vs baseline |
src/cigate/{runner,evaluators,calibrate}.py |
eval execution, code+judge scoring, drift |
src/refbot/ |
the demo RAG bot (BM25 + Claude/mock generator) |
.github/ |
composite Action + PR / nightly workflows |
dashboard/app.py |
Streamlit dashboard (per-axis, calibration, live gate) |
docs/ |
architecture, methodology, auditor pack, article, demo script |
More
- 🏗 Architecture · 📐 Methodology · 🧾 Auditor pack sample · 🎬 Demo script
- Built from the "Eval-Gated CI/CD" system-design case study. Grounded in the evals work of Hamel Husain, Shreya Shankar, and Eugene Yan.
License
MIT. CUAD data under CC BY 4.0 — see data/cuad/ATTRIBUTION.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cigate-0.1.0.tar.gz.
File metadata
- Download URL: cigate-0.1.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
803bd2e1f682e22480795182b18aaf5d4d7b1757ca78b0c9f2e201cf1e5b270b
|
|
| MD5 |
dbbf1a4ba85f8eb56349e7bd2ac59403
|
|
| BLAKE2b-256 |
58d554a49603a77cf8c043d774240b2b15fdff09dbb3b7156da046dd9e09ce4c
|
Provenance
The following attestation bundles were made for cigate-0.1.0.tar.gz:
Publisher:
pypi.yml on awesome-pro/cigate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cigate-0.1.0.tar.gz -
Subject digest:
803bd2e1f682e22480795182b18aaf5d4d7b1757ca78b0c9f2e201cf1e5b270b - Sigstore transparency entry: 1998492851
- Sigstore integration time:
-
Permalink:
awesome-pro/cigate@17c8b84745dd2358c89d73e11ab3196af994d6a5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/awesome-pro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@17c8b84745dd2358c89d73e11ab3196af994d6a5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file cigate-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cigate-0.1.0-py3-none-any.whl
- Upload date:
- Size: 65.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a32acd30efc1f57b07aebc7645928e87f6fc6708ad25f0df877ec5281c4e74fc
|
|
| MD5 |
5724b6be7b41210140b1c0faa95e0f66
|
|
| BLAKE2b-256 |
d46b421417845569a8ac90183ede0c76f71ce7ade1255c0e8b1879bbf45c8d19
|
Provenance
The following attestation bundles were made for cigate-0.1.0-py3-none-any.whl:
Publisher:
pypi.yml on awesome-pro/cigate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cigate-0.1.0-py3-none-any.whl -
Subject digest:
a32acd30efc1f57b07aebc7645928e87f6fc6708ad25f0df877ec5281c4e74fc - Sigstore transparency entry: 1998492958
- Sigstore integration time:
-
Permalink:
awesome-pro/cigate@17c8b84745dd2358c89d73e11ab3196af994d6a5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/awesome-pro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@17c8b84745dd2358c89d73e11ab3196af994d6a5 -
Trigger Event:
release
-
Statement type: