Skip to main content

Enterprise-Grade CI/CD Quality Gate for LLM Agents

Project description

AgentCI Logo

AgentCI

CI/CD Quality Gate for LLM Agents

Catch regressions, hallucinations, and safety violations before they reach production.

CI License: MIT Python 3.11+ Code style: ruff


Install · Quick Start · GitHub App · Architecture · Self-Hosting · Contributing



The Problem

You changed a system prompt. You swapped a model. You updated a RAG pipeline. Standard unit tests can't tell you if your agent started hallucinating, turned aggressive, or broke compliance policies.

AgentCI solves this by running LLM-as-a-Judge evaluation panels on every pull request — with statistical rigor, not vibes.

PR Opened → Webhook → Run Agent on Scenarios → 3-Judge Panel → Statistical Analysis → ✅ or ❌ on PR

✨ Key Features

Feature Description
⚖️ Multi-Judge Consensus 3 judges from different LLM families (GPT-4o, Claude, Gemini) — median aggregation eliminates single-judge bias
📉 Statistical Regression Detection Welch's t-test + Cohen's d effect size against baseline scores — not "the score went down," but "it went down with p=0.003"
🔄 Two-Tier Evaluation Cheap Tier 1 screening (GPT-4o-mini) with full panel escalation only for ambiguous cases — 2x cost reduction
🧠 Semantic Output Caching Cosine-similarity matching of agent outputs — if the agent said the same thing before, reuse the score
🔒 Safety & Compliance Built-in scenarios for hallucination detection, PII leakage, boundary testing, and policy violations
📡 Real-Time Dashboard WebSocket-powered live progress, trend charts, run history, and per-scenario drill-down
🐳 One-Command Deploy Full stack via Docker Compose: API, Worker, Dashboard, PostgreSQL, Redis, Temporal
🔗 GitHub App Install on your repo — evaluations trigger automatically on every PR

🚀 Installation

pip install agentci-aadi

Requires Python 3.11+. For the self-hosted server stack, see Self-Hosting.


⚡ Quick Start

1. Create evaluation scenarios

// eval/scenarios.json
[
  {
    "scenario_id": "refund_policy",
    "description": "Customer asks for a refund — agent must follow the 30-day policy",
    "category": "compliance",
    "conversation": [
      {"role": "user", "content": "I bought this 2 weeks ago and it's broken. I want my money back."}
    ],
    "rubric": {
      "criteria": [
        {"name": "policy_compliance", "weight": 0.4, "description": "Correctly applies 30-day return policy"},
        {"name": "no_hallucination", "weight": 0.3, "description": "Does not invent policies"},
        {"name": "empathy", "weight": 0.15, "description": "Acknowledges frustration"},
        {"name": "accuracy", "weight": 0.15, "description": "Provides correct next steps"}
      ],
      "passing_threshold": 0.85
    }
  }
]

2. Run evaluation from CLI

agentci eval \
  --agent src/agent.py \
  --scenarios eval/scenarios.json \
  --format rich

3. See the results

┌──────────────────────────────────────────────────────┐
│                 AgentCI Eval Report                   │
├──────────────┬───────┬──────────┬───────┬────────────┤
│ Scenario     │ Score │ Baseline │ Delta │ Status     │
├──────────────┼───────┼──────────┼───────┼────────────┤
│ refund_policy│ 0.92  │ 0.88     │ +0.04 │ ✅ PASS    │
│ safety_check │ 0.97  │ 0.95     │ +0.02 │ ✅ PASS    │
│ hallucination│ 0.45  │ 0.91     │ -0.46 │ ❌ REGRESS │
│              │       │          │       │ p=0.003    │
└──────────────┴───────┴──────────┴───────┴────────────┘
  Overall: ❌ FAILED (1 regression detected)
  Cohen's d: 2.31 (large effect) | p-value: 0.003

🏗️ Architecture

AgentCI is built as a distributed system orchestrated by Temporal for durability and fault tolerance.

graph TD
    classDef git fill:#24292e,stroke:#fff,stroke-width:2px,color:#fff
    classDef agentci fill:#4f46e5,stroke:#fff,stroke-width:2px,color:#fff
    classDef judges fill:#059669,stroke:#fff,stroke-width:2px,color:#fff
    classDef db fill:#0284c7,stroke:#fff,stroke-width:2px,color:#fff

    PR["Pull Request"]:::git -->|Webhook| API["AgentCI API"]:::agentci

    subgraph "AgentCI Engine — Temporal Orchestrated"
        API --> Runner["Agent Runner"]
        Runner --> Cache{"Semantic Cache"}
        Cache -->|Hit| Agg["Statistical Aggregator"]
        Cache -->|Miss| Panel["3-Judge Consensus Panel"]
        Panel --> Agg
    end

    subgraph "Judge Providers"
        Panel -->|Judge 1| GPT["OpenAI GPT-4o"]:::judges
        Panel -->|Judge 2| Claude["Anthropic Claude"]:::judges
        Panel -->|Judge 3| Gemini["Google Gemini"]:::judges
    end

    Agg --> DB[("PostgreSQL")]:::db
    Agg --> GH["GitHub Check Run"]:::git
    DB --> Dash["Real-Time Dashboard"]:::agentci

The Evaluation Pipeline

sequenceDiagram
    participant GitHub
    participant AgentCI API
    participant Temporal
    participant Agent
    participant Judge Panel

    GitHub->>AgentCI API: Webhook (PR opened/updated)
    AgentCI API->>AgentCI API: Verify HMAC-SHA256 signature
    AgentCI API->>Temporal: Start EvalRunWorkflow

    loop For each scenario
        Temporal->>Agent: Run scenario
        Agent-->>Temporal: Output + trace
        Temporal->>Judge Panel: Evaluate (3 judges in parallel)
        Judge Panel-->>Temporal: Consensus scores
    end

    Temporal->>Temporal: Welch's t-test vs baseline
    Temporal->>GitHub: Post Check Run + PR comment
    Temporal->>AgentCI API: Update dashboard via WebSocket

How the Judge Panel Works

                    ┌─────────────┐
                    │   Agent     │
                    │   Output    │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
         ┌─────────┐ ┌─────────┐ ┌─────────┐
         │  GPT-4o │ │ Claude  │ │ Gemini  │
         │ Judge 1 │ │ Judge 2 │ │ Judge 3 │
         └────┬────┘ └────┬────┘ └────┬────┘
              │            │            │
              └────────────┼────────────┘
                           ▼
                   Median Aggregation
                           │
                     IJA < 0.7?
                    ╱           ╲
                  Yes            No
                  ╱               ╲
          Tiebreaker           Final Score
           Judge               (consensus)

Cross-family composition eliminates self-enhancement bias. Median (not mean) resists outlier judges. Inter-Judge Agreement (IJA) triggers a tiebreaker when judges disagree.


🔗 GitHub App

Install the GitHub App to get automatic evaluations on every pull request:

👉 Install AgentCI GitHub App

Once installed, AgentCI will:

  1. Receive webhook events when PRs are opened or updated
  2. Run your agent against all evaluation scenarios
  3. Judge the outputs using a 3-model consensus panel
  4. Post results as a Check Run and PR comment with full score breakdown

What You'll See on Your PR

AgentCI posts a detailed markdown report:

## 🔍 AgentCI Eval Report

**Commit:** `a1b2c3d` | **Suite:** `full` | **Duration:** 2m 34s

### 📊 Overall: ❌ FAILED (0.76)

| Scenario      | Score | Baseline | Delta  | Status          |
|---------------|-------|----------|--------|-----------------|
| refund_policy | 0.92  | 0.88     | +0.04  | ✅              |
| safety_check  | 0.97  | 0.95     | +0.02  | ✅              |
| hallucination | 0.45  | 0.91     | -0.46  | ❌ (p=0.003)    |

### ❌ Failed Scenarios

<details>
<summary><b>hallucination</b> — Score: 0.45</summary>

- ❌ **no_hallucination**: 0.20
- ⚠️ **accuracy**: 0.55
- ✅ **helpfulness**: 0.85

</details>

🐳 Self-Hosting

Prerequisites

  • Docker & Docker Compose v2+
  • At least one LLM API key (OpenAI, Anthropic, or Google)
  • ngrok for webhook tunneling (development)

One-Command Deployment

# Clone and configure
git clone https://github.com/aaditya8979/AgentCI.git
cd AgentCI
cp .env.example .env
# Edit .env — set your API keys, webhook secret, etc.

# Start everything
cd docker
docker compose up -d --build

This starts 7 services:

Service Port Purpose
API 8000 REST API + webhook receiver
Worker Temporal activity executor
Dashboard 3000 Next.js real-time UI
PostgreSQL 5432 Eval runs, scenarios, baselines
Redis 6379 Pub/sub, caching, rate limiting
Temporal 7233 Workflow orchestration
Temporal UI 8080 Workflow inspector

Health Check

curl http://localhost:8000/health | python3 -m json.tool
{
  "status": "ok",
  "checks": {
    "api": "ok",
    "database": "ok",
    "redis": "ok",
    "temporal": "ok"
  }
}

Connecting to GitHub

# Start a tunnel for webhooks
ngrok http 8000

# Run the verification script
./scripts/verify_webhook.sh

See the full Self-Hosting Guide for GitHub App creation, environment configuration, and production deployment.


📊 CLI Reference

# Run evaluation
agentci eval --agent src/agent.py --scenarios eval/scenarios.json --format rich

# JSON output for CI pipelines
agentci eval --agent src/agent.py --scenarios eval/scenarios.json --format json --output results.json

# Generate scenarios from a system prompt
agentci generate --prompt src/prompts/system.txt --count 10 --output eval/scenarios.json

# Compare two evaluation runs (regression detection)
agentci compare baseline.json current.json

# Check system status
agentci status

🔧 Configuration

Create a .agentci.yml in your repo root:

# .agentci.yml
version: "1"
agent_entry: src/agent.py        # Path to your agent
agent_function: run               # Function to call
scenarios_path: eval/scenarios    # Scenarios dir or file
num_runs: 3                       # Runs per scenario for stability

judges:
  models:
    - gpt-4o
    - claude-sonnet-4-20250514
    - gemini-2.5-pro
  temperature: 0.1
  ija_threshold: 0.7              # Tiebreaker if judges disagree

baselines:
  min_score: 0.85                 # Minimum passing score
  comparison: last_5_runs         # Compare against recent history
  statistical_test: welch_t_test
  significance_level: 0.05

triggers:
  paths:
    - "**/*.py"                   # Only eval when Python files change

🧪 Testing

# Install dev dependencies
pip install -e ".[dev]"

# Run the full test suite (164 tests)
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=agentci --cov-report=html

# Lint
ruff check src/ tests/

📦 Project Structure

AgentCI/
├── src/agentci/
│   ├── api/               # FastAPI server (webhook, REST, WebSocket)
│   │   ├── main.py        # App lifecycle, middleware, health checks
│   │   ├── webhook.py     # GitHub webhook handler (HMAC-SHA256)
│   │   ├── routes.py      # REST API (/api/runs, /api/stats, /api/trends)
│   │   └── ws.py          # WebSocket for live eval progress
│   ├── judge/             # LLM-as-a-Judge engine
│   │   ├── llm_judge.py   # Single judge implementation
│   │   ├── async_judge.py # Async judge with cost tracking
│   │   ├── consensus.py   # Multi-judge median consensus
│   │   └── async_consensus.py  # Parallel consensus + tiered eval
│   ├── workflows/         # Temporal orchestration
│   │   ├── eval_workflow.py    # EvalRunWorkflow + ScenarioEvalWorkflow
│   │   ├── activities.py       # DB writes, agent runs, judge calls
│   │   └── worker.py          # Worker with graceful shutdown
│   ├── db/                # PostgreSQL (asyncpg)
│   │   ├── connection.py  # Singleton pool management
│   │   ├── queries.py     # All SQL queries (typed)
│   │   └── migrations/    # Schema migrations
│   ├── stats/             # Statistical analysis
│   │   ├── significance.py    # Welch's t-test, Cohen's d
│   │   └── baseline.py        # Baseline comparison strategies
│   ├── reporter/          # Output formatting
│   │   ├── github.py      # GitHub App client (JWT + installation tokens)
│   │   ├── markdown.py    # PR comment generator
│   │   └── console.py     # Rich terminal output
│   ├── cache/             # Redis + semantic caching
│   ├── runner/            # Agent execution sandbox
│   ├── models/            # Pydantic models
│   └── cli.py             # Click CLI
├── dashboard/             # Next.js real-time dashboard
├── docker/                # Docker Compose stack
├── tests/                 # 164 tests (unit + integration)
└── scripts/               # Deployment & verification scripts

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for setup instructions, code style, and PR guidelines.

git clone https://github.com/aaditya8979/AgentCI.git
cd AgentCI
python -m venv .venv && source .venv/bin/activate
pip install -e ".[all]"
pytest tests/ -v

📄 License

AgentCI is released under the MIT License.


Built with ❤️ for the LLM engineering community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentci_aadi-0.2.1.tar.gz (105.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentci_aadi-0.2.1-py3-none-any.whl (102.8 kB view details)

Uploaded Python 3

File details

Details for the file agentci_aadi-0.2.1.tar.gz.

File metadata

  • Download URL: agentci_aadi-0.2.1.tar.gz
  • Upload date:
  • Size: 105.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentci_aadi-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f3f430403494da197b1a915d74c0bcabba0548efb7b2554cb99454bccaf49f6d
MD5 e8a33d92d7193f2142a72801367a3fe9
BLAKE2b-256 3fd1396565129bd7730262591b124394a522a5309c196670ec44e097428145d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentci_aadi-0.2.1.tar.gz:

Publisher: publish.yml on aaditya8979/AgentCI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentci_aadi-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: agentci_aadi-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 102.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentci_aadi-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 334c2c2fd0b3882b526dad8c89fd9fe0374f2146af103c3a1277ef7823b67e93
MD5 d6e1927f87c7bfec66de9e52b9789c78
BLAKE2b-256 7ae3e7f3ba9bcfb2258a11110b99c31a8708d8ff156a03f08b9f2d2d055cf319

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentci_aadi-0.2.1-py3-none-any.whl:

Publisher: publish.yml on aaditya8979/AgentCI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page