Eval-gated CI/CD for AI products: gate merges on the confidence-interval lower bound of a bias-corrected LLM-judge score, per failure-mode axis.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abhibuilds

These details have not been verified by PyPI

Project description

🚦 CIGate

Gate your CI/CD on the confidence interval, not the vibes.

Eval-gated CI/CD for AI products: block a merge when answer quality statistically regresses — per failure-mode, with the LLM-judge's bias corrected for, under a cost budget.

The problem (a real, expensive one)

Teams shipping LLM and agent products change prompts, retrieval, tools, and model versions constantly — and those changes have non-local effects: fixing one answer silently breaks ten others. The industry default is "vibes-based" shipping, and it produces silent quality regressions that reach production and cost real money.

The system-design case study this implements opens with the canonical disaster: a well-meaning prompt tweak quietly degraded answer quality on contract questions, an LLM-judge metric drifted 11 points undetected, and the company lost a $4M renewal before anyone noticed.

The gap nobody fills

The popular eval tools — Promptfoo, Braintrust, Langfuse, DeepEval — gate on raw LLM-judge scores. But in-domain LLM-judge accuracy is only ~75–88%, so the judge's observed pass rate is biased. Gating on it means you either:

over-block (false alarms → developers route around the gate), or
under-block (real regressions still ship).

CIGate gates on the bias-corrected pass rate's confidence-interval lower bound, per failure-mode axis. That's the part the case study makes its centerpiece and the mainstream tools skip. (The "CI" pun is the whole thesis: gate your Continuous Integration on a Confidence Interval.)

What it looks like on a PR

A one-line prompt change ships answer_v2 ("be helpful and complete, citations optional"). CIGate runs on the PR and posts this — then blocks the merge:

❌ CIGate: merge blocked — 2 axis regression(s)

prompt=answer_v2 · judge=mock · sample=60/300 · cost=$0.00

Axis Raw judge Corrected 95% CI Baseline Δ Verdict

🔴 hallucination 45.0% 37.0% [11.1%, 59.0%] 98.9% −61.9 pp REGRESSED

🟢 retrieval_miss 85.0% 91.1% [69.8%, 100%] 100% −8.9 pp ok

🔴 citation_error 61.7% 65.5% [47.6%, 82.3%] 100% −34.5 pp REGRESSED

🟢 refusal 71.7% 76.5% [55.9%, 91.9%] 93.2% −16.7 pp ok

🟢 format_violation 100% 95.0% [88.2%, 100%] 98.9% −3.9 pp ok

	Axis	Raw judge	Corrected	95% CI	Baseline	Δ	Verdict
🔴	`hallucination`	45.0%	37.0%	[11.1%, 59.0%]	98.9%	−61.9 pp	REGRESSED
🟢	`retrieval_miss`	85.0%	91.1%	[69.8%, 100%]	100%	−8.9 pp	ok
🔴	`citation_error`	61.7%	65.5%	[47.6%, 82.3%]	100%	−34.5 pp	REGRESSED
🟢	`refusal`	71.7%	76.5%	[55.9%, 91.9%]	93.2%	−16.7 pp	ok
🟢	`format_violation`	100%	95.0%	[88.2%, 100%]	98.9%	−3.9 pp	ok

The regression is isolated to the two axes the change actually hurt — a single composite score would have hidden it. A clean change goes green and merges. Full samples: docs/samples/.

How it works

flowchart LR
    PR[Pull request] --> RUN[Run SUT over a<br/>stratified golden-set sample]
    RUN --> CODE[Code evaluators<br/>schema · citations · retrieval]
    RUN --> JUDGE[LLM judge<br/>hallucination · refusal · ...]
    CODE --> DET[Per-axis detector]
    JUDGE --> DET
    CAL[Human-labeled<br/>calibration set] --> CORR
    DET --> CORR[Statistical correction<br/>Rogan–Gladen + CI]
    CORR --> GATE{Drop vs main<br/>baseline > tolerance?}
    GATE -- yes --> BLOCK[🔴 block + per-axis report]
    GATE -- no --> PASS[🟢 merge allowed]

Sample the golden set, stratified so every failure-mode axis is represented (cost control — a per-PR run touches a fraction, nightly runs the full set).
Score each case two ways: cheap deterministic code checks (citations, schema, retrieval) and an LLM-as-judge for subjective axes.
Correct the judge's bias: using its sensitivity/specificity measured on a human-labeled calibration set, recover the true pass rate with a confidence interval. Deterministic axes skip correction (they're unbiased) and use an exact binomial interval.
Gate per axis with a one-sided two-sample drop test vs the committed main baseline (Bonferroni-corrected across axes): block only when we're confident the drop exceeds tolerance. Identical builds never false-block, regardless of CI width.

The statistical core (the part that matters)

Raw judge pass rate p_obs is biased. With judge sensitivity (TPR) and specificity (TNR) measured on a labeled calibration set, the Rogan–Gladen estimator recovers the true rate:

            p_obs + TNR − 1
θ̂  =  clip( ───────────────── , 0, 1 )
             TPR + TNR − 1

The confidence interval uses the adjusted-Wald delta method (Lee, Zeng et al., arXiv:2511.21140), combining all three uncertainty sources — evaluated-sample, sensitivity, and specificity — with correct ~95% coverage even on small calibration sets. We cross-check it against the judgy library in the test suite, and the implementation reproduces the paper's worked example exactly. See docs/METHODOLOGY.md.

If the judge is no better than chance (TPR + TNR ≤ 1) or the CI is too wide, CIGate refuses to gate that axis rather than guess.

Try it in 60 seconds ($0, offline)

Everything runs in a deterministic mock mode — no API key, no spend — which is also what powers the test suite and the demo CI.

git clone https://github.com/awesome-pro/cigate && cd cigate
pip install -e ".[dev]"

cigate baseline --promote                 # establish a 'good' baseline (full run)
BUILD_FLAVOR=regressed cigate gate        # → blocks: hallucination + citation_error red
BUILD_FLAVOR=good      cigate gate        # → passes: all axes within tolerance
pytest -q                                 # 26 tests, all green, $0

Want the real thing? Set ANTHROPIC_API_KEY and install the extra — the same pipeline now uses Claude as generator and judge:

pip install -e ".[real]"
export ANTHROPIC_API_KEY=sk-...
cigate gate                               # real Claude answers, scored + corrected

Two datasets: synthetic + real

synthetic_contract — a generated contract/insurance support set (300 cases, 50 policy docs). Fully controlled; powers the deterministic demo and tests.
cuad_real — built from CUAD (real commercial contracts with expert clause annotations, CC BY 4.0). Its headline use: the LLM judge is calibrated against real human expert labels, so the correction's confusion matrix is measured from real bias, not assumed. (evalconfig_cuad.yaml)

Use it on your own product

CIGate is product-agnostic. Point evalconfig.yaml at any callable (question) -> SUTOutput and bring your own golden set:

sut: "yourapp.bot:answer"          # module:callable
goldenset: "goldensets/yours.yaml"
axes: [hallucination, citation_error, ...]
gate: { tolerance: 0.02, confidence_level: 0.95 }

Then drop the GitHub Action into your pipeline (see .github/):

- uses: awesome-pro/cigate/.github/actions/eval-gate@v0.1
  with:
    config: evalconfig.yaml
    anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}   # omit -> $0 mock mode

What's in here

Path	What
`src/cigate/stats.py`	Rogan–Gladen + adjusted-Wald CI — the correction core
`src/cigate/gate.py`	per-axis two-sample drop test vs baseline
`src/cigate/{runner,evaluators,calibrate}.py`	eval execution, code+judge scoring, drift
`src/refbot/`	the demo RAG bot (BM25 + Claude/mock generator)
`.github/`	composite Action + PR / nightly workflows
`dashboard/app.py`	Streamlit dashboard (per-axis, calibration, live gate)
`docs/`	architecture, methodology, auditor pack, article, demo script

🏗 Architecture · 📐 Methodology · 🧾 Auditor pack sample · 🎬 Demo script
Built from the "Eval-Gated CI/CD" system-design case study. Grounded in the evals work of Hamel Husain, Shreya Shankar, and Eugene Yan.

License

MIT. CUAD data under CC BY 4.0 — see data/cuad/ATTRIBUTION.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abhibuilds

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cigate-0.1.0.tar.gz (1.1 MB view details)

Uploaded Jun 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cigate-0.1.0-py3-none-any.whl (65.4 kB view details)

Uploaded Jun 28, 2026 Python 3

File details

Details for the file cigate-0.1.0.tar.gz.

File metadata

Download URL: cigate-0.1.0.tar.gz
Upload date: Jun 28, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cigate-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`803bd2e1f682e22480795182b18aaf5d4d7b1757ca78b0c9f2e201cf1e5b270b`
MD5	`dbbf1a4ba85f8eb56349e7bd2ac59403`
BLAKE2b-256	`58d554a49603a77cf8c043d774240b2b15fdff09dbb3b7156da046dd9e09ce4c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cigate-0.1.0.tar.gz:

Publisher: pypi.yml on awesome-pro/cigate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cigate-0.1.0.tar.gz
- Subject digest: 803bd2e1f682e22480795182b18aaf5d4d7b1757ca78b0c9f2e201cf1e5b270b
- Sigstore transparency entry: 1998492851
- Sigstore integration time: Jun 28, 2026
Source repository:
- Permalink: awesome-pro/cigate@17c8b84745dd2358c89d73e11ab3196af994d6a5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/awesome-pro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@17c8b84745dd2358c89d73e11ab3196af994d6a5
- Trigger Event: release

File details

Details for the file cigate-0.1.0-py3-none-any.whl.

File metadata

Download URL: cigate-0.1.0-py3-none-any.whl
Upload date: Jun 28, 2026
Size: 65.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cigate-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a32acd30efc1f57b07aebc7645928e87f6fc6708ad25f0df877ec5281c4e74fc`
MD5	`5724b6be7b41210140b1c0faa95e0f66`
BLAKE2b-256	`d46b421417845569a8ac90183ede0c76f71ce7ade1255c0e8b1879bbf45c8d19`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cigate-0.1.0-py3-none-any.whl:

Publisher: pypi.yml on awesome-pro/cigate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cigate-0.1.0-py3-none-any.whl
- Subject digest: a32acd30efc1f57b07aebc7645928e87f6fc6708ad25f0df877ec5281c4e74fc
- Sigstore transparency entry: 1998492958
- Sigstore integration time: Jun 28, 2026
Source repository:
- Permalink: awesome-pro/cigate@17c8b84745dd2358c89d73e11ab3196af994d6a5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/awesome-pro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@17c8b84745dd2358c89d73e11ab3196af994d6a5
- Trigger Event: release

cigate 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🚦 CIGate

Gate your CI/CD on the confidence interval, not the vibes.

The problem (a real, expensive one)

The gap nobody fills

What it looks like on a PR

❌ CIGate: merge blocked — 2 axis regression(s)

How it works

The statistical core (the part that matters)

Try it in 60 seconds ($0, offline)

Two datasets: synthetic + real

Use it on your own product

What's in here

More

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance