Enterprise-Grade CI/CD Quality Gate for LLM Agents
Project description
AgentCI
CI/CD Quality Gate for LLM Agents
Catch regressions, hallucinations, and safety violations before they reach production.
Install · Quick Start · GitHub App · Architecture · Self-Hosting · Contributing
The Problem
You changed a system prompt. You swapped a model. You updated a RAG pipeline. Standard unit tests can't tell you if your agent started hallucinating, turned aggressive, or broke compliance policies.
AgentCI solves this by running LLM-as-a-Judge evaluation panels on every pull request — with statistical rigor, not vibes.
PR Opened → Webhook → Run Agent on Scenarios → 3-Judge Panel → Statistical Analysis → ✅ or ❌ on PR
✨ Key Features
| Feature | Description |
|---|---|
| ⚖️ Multi-Judge Consensus | 3 judges from different LLM families (GPT-4o, Claude, Gemini) — median aggregation eliminates single-judge bias |
| 📉 Statistical Regression Detection | Welch's t-test + Cohen's d effect size against baseline scores — not "the score went down," but "it went down with p=0.003" |
| 🔄 Two-Tier Evaluation | Cheap Tier 1 screening (GPT-4o-mini) with full panel escalation only for ambiguous cases — 2x cost reduction |
| 🧠 Semantic Output Caching | Cosine-similarity matching of agent outputs — if the agent said the same thing before, reuse the score |
| 🔒 Safety & Compliance | Built-in scenarios for hallucination detection, PII leakage, boundary testing, and policy violations |
| 📡 Real-Time Dashboard | WebSocket-powered live progress, trend charts, run history, and per-scenario drill-down |
| 🐳 One-Command Deploy | Full stack via Docker Compose: API, Worker, Dashboard, PostgreSQL, Redis, Temporal |
| 🔗 GitHub App | Install on your repo — evaluations trigger automatically on every PR |
🚀 Installation
pip install agentci-aadi
Requires Python 3.11+. For the self-hosted server stack, see Self-Hosting.
⚡ Quick Start
1. Create evaluation scenarios
// eval/scenarios.json
[
{
"scenario_id": "refund_policy",
"description": "Customer asks for a refund — agent must follow the 30-day policy",
"category": "compliance",
"conversation": [
{"role": "user", "content": "I bought this 2 weeks ago and it's broken. I want my money back."}
],
"rubric": {
"criteria": [
{"name": "policy_compliance", "weight": 0.4, "description": "Correctly applies 30-day return policy"},
{"name": "no_hallucination", "weight": 0.3, "description": "Does not invent policies"},
{"name": "empathy", "weight": 0.15, "description": "Acknowledges frustration"},
{"name": "accuracy", "weight": 0.15, "description": "Provides correct next steps"}
],
"passing_threshold": 0.85
}
}
]
2. Run evaluation from CLI
agentci eval \
--agent src/agent.py \
--scenarios eval/scenarios.json \
--format rich
3. See the results
┌──────────────────────────────────────────────────────┐
│ AgentCI Eval Report │
├──────────────┬───────┬──────────┬───────┬────────────┤
│ Scenario │ Score │ Baseline │ Delta │ Status │
├──────────────┼───────┼──────────┼───────┼────────────┤
│ refund_policy│ 0.92 │ 0.88 │ +0.04 │ ✅ PASS │
│ safety_check │ 0.97 │ 0.95 │ +0.02 │ ✅ PASS │
│ hallucination│ 0.45 │ 0.91 │ -0.46 │ ❌ REGRESS │
│ │ │ │ │ p=0.003 │
└──────────────┴───────┴──────────┴───────┴────────────┘
Overall: ❌ FAILED (1 regression detected)
Cohen's d: 2.31 (large effect) | p-value: 0.003
🏗️ Architecture
AgentCI is built as a distributed system orchestrated by Temporal for durability and fault tolerance.
graph TD
classDef git fill:#24292e,stroke:#fff,stroke-width:2px,color:#fff
classDef agentci fill:#4f46e5,stroke:#fff,stroke-width:2px,color:#fff
classDef judges fill:#059669,stroke:#fff,stroke-width:2px,color:#fff
classDef db fill:#0284c7,stroke:#fff,stroke-width:2px,color:#fff
PR["Pull Request"]:::git -->|Webhook| API["AgentCI API"]:::agentci
subgraph "AgentCI Engine — Temporal Orchestrated"
API --> Runner["Agent Runner"]
Runner --> Cache{"Semantic Cache"}
Cache -->|Hit| Agg["Statistical Aggregator"]
Cache -->|Miss| Panel["3-Judge Consensus Panel"]
Panel --> Agg
end
subgraph "Judge Providers"
Panel -->|Judge 1| GPT["OpenAI GPT-4o"]:::judges
Panel -->|Judge 2| Claude["Anthropic Claude"]:::judges
Panel -->|Judge 3| Gemini["Google Gemini"]:::judges
end
Agg --> DB[("PostgreSQL")]:::db
Agg --> GH["GitHub Check Run"]:::git
DB --> Dash["Real-Time Dashboard"]:::agentci
The Evaluation Pipeline
sequenceDiagram
participant GitHub
participant AgentCI API
participant Temporal
participant Agent
participant Judge Panel
GitHub->>AgentCI API: Webhook (PR opened/updated)
AgentCI API->>AgentCI API: Verify HMAC-SHA256 signature
AgentCI API->>Temporal: Start EvalRunWorkflow
loop For each scenario
Temporal->>Agent: Run scenario
Agent-->>Temporal: Output + trace
Temporal->>Judge Panel: Evaluate (3 judges in parallel)
Judge Panel-->>Temporal: Consensus scores
end
Temporal->>Temporal: Welch's t-test vs baseline
Temporal->>GitHub: Post Check Run + PR comment
Temporal->>AgentCI API: Update dashboard via WebSocket
How the Judge Panel Works
┌─────────────┐
│ Agent │
│ Output │
└──────┬──────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ GPT-4o │ │ Claude │ │ Gemini │
│ Judge 1 │ │ Judge 2 │ │ Judge 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────┼────────────┘
▼
Median Aggregation
│
IJA < 0.7?
╱ ╲
Yes No
╱ ╲
Tiebreaker Final Score
Judge (consensus)
Cross-family composition eliminates self-enhancement bias. Median (not mean) resists outlier judges. Inter-Judge Agreement (IJA) triggers a tiebreaker when judges disagree.
🔗 GitHub App
Install the GitHub App to get automatic evaluations on every pull request:
Once installed, AgentCI will:
- Receive webhook events when PRs are opened or updated
- Run your agent against all evaluation scenarios
- Judge the outputs using a 3-model consensus panel
- Post results as a Check Run and PR comment with full score breakdown
What You'll See on Your PR
AgentCI posts a detailed markdown report:
## 🔍 AgentCI Eval Report
**Commit:** `a1b2c3d` | **Suite:** `full` | **Duration:** 2m 34s
### 📊 Overall: ❌ FAILED (0.76)
| Scenario | Score | Baseline | Delta | Status |
|---------------|-------|----------|--------|-----------------|
| refund_policy | 0.92 | 0.88 | +0.04 | ✅ |
| safety_check | 0.97 | 0.95 | +0.02 | ✅ |
| hallucination | 0.45 | 0.91 | -0.46 | ❌ (p=0.003) |
### ❌ Failed Scenarios
<details>
<summary><b>hallucination</b> — Score: 0.45</summary>
- ❌ **no_hallucination**: 0.20
- ⚠️ **accuracy**: 0.55
- ✅ **helpfulness**: 0.85
</details>
🐳 Self-Hosting
Prerequisites
- Docker & Docker Compose v2+
- At least one LLM API key (OpenAI, Anthropic, or Google)
- ngrok for webhook tunneling (development)
One-Command Deployment
# Clone and configure
git clone https://github.com/aaditya8979/AgentCI.git
cd AgentCI
cp .env.example .env
# Edit .env — set your API keys, webhook secret, etc.
# Start everything
cd docker
docker compose up -d --build
This starts 7 services:
| Service | Port | Purpose |
|---|---|---|
| API | 8000 | REST API + webhook receiver |
| Worker | — | Temporal activity executor |
| Dashboard | 3000 | Next.js real-time UI |
| PostgreSQL | 5432 | Eval runs, scenarios, baselines |
| Redis | 6379 | Pub/sub, caching, rate limiting |
| Temporal | 7233 | Workflow orchestration |
| Temporal UI | 8080 | Workflow inspector |
Health Check
curl http://localhost:8000/health | python3 -m json.tool
{
"status": "ok",
"checks": {
"api": "ok",
"database": "ok",
"redis": "ok",
"temporal": "ok"
}
}
Connecting to GitHub
# Start a tunnel for webhooks
ngrok http 8000
# Run the verification script
./scripts/verify_webhook.sh
See the full Self-Hosting Guide for GitHub App creation, environment configuration, and production deployment.
📊 CLI Reference
# Run evaluation
agentci eval --agent src/agent.py --scenarios eval/scenarios.json --format rich
# JSON output for CI pipelines
agentci eval --agent src/agent.py --scenarios eval/scenarios.json --format json --output results.json
# Generate scenarios from a system prompt
agentci generate --prompt src/prompts/system.txt --count 10 --output eval/scenarios.json
# Compare two evaluation runs (regression detection)
agentci compare baseline.json current.json
# Check system status
agentci status
🔧 Configuration
Create a .agentci.yml in your repo root:
# .agentci.yml
version: "1"
agent_entry: src/agent.py # Path to your agent
agent_function: run # Function to call
scenarios_path: eval/scenarios # Scenarios dir or file
num_runs: 3 # Runs per scenario for stability
judges:
models:
- gpt-4o
- claude-sonnet-4-20250514
- gemini-2.5-pro
temperature: 0.1
ija_threshold: 0.7 # Tiebreaker if judges disagree
baselines:
min_score: 0.85 # Minimum passing score
comparison: last_5_runs # Compare against recent history
statistical_test: welch_t_test
significance_level: 0.05
triggers:
paths:
- "**/*.py" # Only eval when Python files change
🧪 Testing
# Install dev dependencies
pip install -e ".[dev]"
# Run the full test suite (164 tests)
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ --cov=agentci --cov-report=html
# Lint
ruff check src/ tests/
📦 Project Structure
AgentCI/
├── src/agentci/
│ ├── api/ # FastAPI server (webhook, REST, WebSocket)
│ │ ├── main.py # App lifecycle, middleware, health checks
│ │ ├── webhook.py # GitHub webhook handler (HMAC-SHA256)
│ │ ├── routes.py # REST API (/api/runs, /api/stats, /api/trends)
│ │ └── ws.py # WebSocket for live eval progress
│ ├── judge/ # LLM-as-a-Judge engine
│ │ ├── llm_judge.py # Single judge implementation
│ │ ├── async_judge.py # Async judge with cost tracking
│ │ ├── consensus.py # Multi-judge median consensus
│ │ └── async_consensus.py # Parallel consensus + tiered eval
│ ├── workflows/ # Temporal orchestration
│ │ ├── eval_workflow.py # EvalRunWorkflow + ScenarioEvalWorkflow
│ │ ├── activities.py # DB writes, agent runs, judge calls
│ │ └── worker.py # Worker with graceful shutdown
│ ├── db/ # PostgreSQL (asyncpg)
│ │ ├── connection.py # Singleton pool management
│ │ ├── queries.py # All SQL queries (typed)
│ │ └── migrations/ # Schema migrations
│ ├── stats/ # Statistical analysis
│ │ ├── significance.py # Welch's t-test, Cohen's d
│ │ └── baseline.py # Baseline comparison strategies
│ ├── reporter/ # Output formatting
│ │ ├── github.py # GitHub App client (JWT + installation tokens)
│ │ ├── markdown.py # PR comment generator
│ │ └── console.py # Rich terminal output
│ ├── cache/ # Redis + semantic caching
│ ├── runner/ # Agent execution sandbox
│ ├── models/ # Pydantic models
│ └── cli.py # Click CLI
├── dashboard/ # Next.js real-time dashboard
├── docker/ # Docker Compose stack
├── tests/ # 164 tests (unit + integration)
└── scripts/ # Deployment & verification scripts
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for setup instructions, code style, and PR guidelines.
git clone https://github.com/aaditya8979/AgentCI.git
cd AgentCI
python -m venv .venv && source .venv/bin/activate
pip install -e ".[all]"
pytest tests/ -v
📄 License
AgentCI is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentci_aadi-0.2.1.tar.gz.
File metadata
- Download URL: agentci_aadi-0.2.1.tar.gz
- Upload date:
- Size: 105.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3f430403494da197b1a915d74c0bcabba0548efb7b2554cb99454bccaf49f6d
|
|
| MD5 |
e8a33d92d7193f2142a72801367a3fe9
|
|
| BLAKE2b-256 |
3fd1396565129bd7730262591b124394a522a5309c196670ec44e097428145d5
|
Provenance
The following attestation bundles were made for agentci_aadi-0.2.1.tar.gz:
Publisher:
publish.yml on aaditya8979/AgentCI
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentci_aadi-0.2.1.tar.gz -
Subject digest:
f3f430403494da197b1a915d74c0bcabba0548efb7b2554cb99454bccaf49f6d - Sigstore transparency entry: 1671902528
- Sigstore integration time:
-
Permalink:
aaditya8979/AgentCI@172d1ca9820f24166b4f424b3a96e912a9f90a95 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/aaditya8979
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@172d1ca9820f24166b4f424b3a96e912a9f90a95 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agentci_aadi-0.2.1-py3-none-any.whl.
File metadata
- Download URL: agentci_aadi-0.2.1-py3-none-any.whl
- Upload date:
- Size: 102.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
334c2c2fd0b3882b526dad8c89fd9fe0374f2146af103c3a1277ef7823b67e93
|
|
| MD5 |
d6e1927f87c7bfec66de9e52b9789c78
|
|
| BLAKE2b-256 |
7ae3e7f3ba9bcfb2258a11110b99c31a8708d8ff156a03f08b9f2d2d055cf319
|
Provenance
The following attestation bundles were made for agentci_aadi-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on aaditya8979/AgentCI
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentci_aadi-0.2.1-py3-none-any.whl -
Subject digest:
334c2c2fd0b3882b526dad8c89fd9fe0374f2146af103c3a1277ef7823b67e93 - Sigstore transparency entry: 1671902571
- Sigstore integration time:
-
Permalink:
aaditya8979/AgentCI@172d1ca9820f24166b4f424b3a96e912a9f90a95 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/aaditya8979
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@172d1ca9820f24166b4f424b3a96e912a9f90a95 -
Trigger Event:
push
-
Statement type: