Open-source CI contract testing for tool-using AI agents.
Project description
AgentGuard CI
Open-source CI contract testing for tool-using AI agents.
AgentGuard CI helps teams block risky prompt, tool, and agent changes before they reach production. Define trace-level contracts, mock tools, run impacted tests only, enforce latency/token budgets, and fail CI when agent behavior regresses.
Why AgentGuard CI Exists
Final-answer evals are not enough for production agents. A response can look acceptable while the trace is unsafe: the wrong tool was called, a required policy lookup was skipped, a refund tool ran before confirmation, or token usage doubled.
AgentGuard CI focuses on deterministic engineering contracts:
- Required and forbidden tool calls.
- Tool call ordering.
- Routing decisions.
- Structured output schemas.
- Latency, token, cost, and tool-call budgets.
- Mocked tool results for safe offline CI.
- Baseline comparison for regression prevention.
LLM-as-judge is available as an optional path, but deterministic assertions are the default.
Installation
pip install agentguard
For local development from this repository:
pip install -e ".[dev]"
OpenAI judge support is optional:
pip install "agentguard[openai]"
Quick Start
agentguard init
agentguard test
Starter test:
suite: "calendar-agent"
tests:
- id: "uses_calendar_tool"
input: "Book a meeting tomorrow at 5 PM."
assert:
trace:
must_call:
- tool: "calendar.create_event"
output:
contains:
- "meeting"
CLI
agentguard --help
agentguard --version
agentguard init
agentguard test
agentguard test --config agentguard.yml
agentguard test --suite calendar-agent
agentguard test --case uses_calendar_tool
agentguard test --tag smoke
agentguard test --changed-only --base origin/main
agentguard test --report json --report junit
agentguard test --fail-fast
agentguard test --update-baseline
agentguard test --strict
agentguard list
agentguard diff
agentguard baseline update
agentguard baseline list
agentguard validate
python -m agentguard --help
Exit codes:
0: all blocking tests passed.1: at least one blocking test failed.2: configuration error.3: agent runtime error.
Agent Entrypoint Contract
Your agent exposes a sync or async Python function:
from agentguard import AgentRequest, AgentResult, TraceEvent, Usage
async def run_agent(request: AgentRequest) -> AgentResult:
trace = [
TraceEvent(
type="tool_call",
name="calendar.create_event",
input={"time": "5 PM"},
)
]
return AgentResult(
output="Meeting booked for 5 PM.",
trace=trace,
usage=Usage(total_tokens=512, latency_ms=1200),
)
Configure it in agentguard.yml:
version: "0.1"
project:
name: "customer-support-agents"
agent:
entrypoint: "my_package.agent:run_agent"
timeout_seconds: 30
paths:
tests: "agentguard-tests"
baselines: ".agentguard/baselines"
reports: ".agentguard/reports"
mocks:
require_registered: false
Tool Mocking
AgentGuard does not monkeypatch tools in the MVP. The agent voluntarily reads mocks from the
request or uses get_mock.
mocks:
orders.get_order:
match_args:
order_id: "ORD-123"
output:
order_id: "ORD-123"
status: "delivered"
from agentguard import get_mock
tool_output = get_mock(
request,
"orders.get_order",
args={"order_id": "ORD-123"},
)
Assertions
Trace assertions:
assert:
trace:
must_call:
- tool: "orders.get_order"
must_not_call:
- tool: "payments.issue_refund"
ordered:
- tool: "orders.get_order"
- tool: "policy.lookup_refund_policy"
max_tool_calls: 5
Output assertions:
assert:
output:
contains:
- "confirmation"
not_contains:
- "refunded"
json_schema:
type: object
required: ["category"]
properties:
category:
type: string
enum: ["billing", "technical", "account", "other"]
jsonpath:
- path: "$.category"
equals: "billing"
Budget assertions:
assert:
budgets:
max_latency_ms: 5000
max_total_tokens: 3000
max_cost_usd: 0.05
Impact-Aware Testing
Create agentguard-impact.yml:
mappings:
- files:
- "agents/refund/**"
- "prompts/refund/**"
- "tools/refund.py"
tests:
- "refund-agent"
- "refund-regression"
Run impacted suites only:
agentguard test --changed-only --base origin/main
If no mapping matches, AgentGuard runs smoke-tagged tests when present.
Baseline Regression Testing
Store baselines for passing tests:
agentguard test --update-baseline
Enable baseline comparison:
assert:
regression:
compare_to_baseline: true
allowed_output_similarity_drop: 0.1
allowed_extra_tool_calls: 1
allowed_latency_increase_pct: 30
Baselines are stored under:
.agentguard/baselines/{suite}/{test_id}.json
Reports
AgentGuard prints a terminal report and can write machine-readable reports:
agentguard test --report json
agentguard test --report junit
Outputs:
.agentguard/reports/report.json
.agentguard/reports/junit.xml
GitHub Actions
name: AgentGuard CI
on:
pull_request:
branches: [main]
jobs:
agentguard:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install agentguard
- name: Run impacted agent tests
run: |
agentguard test --changed-only --base origin/main --report junit
Branch protection setup:
- Open repository settings.
- Enable branch protection for
main. - Require status checks before merging.
- Select the AgentGuard CI workflow.
Optional LLM Judge
LLM judge tests are opt-in and should be used selectively because provider calls can be slower, more expensive, and less deterministic than trace contracts.
assert:
llm_judge:
enabled: true
rubric: |
Score whether the answer correctly explains the refund policy and does not invent
unsupported exceptions.
threshold: 0.85
Use the fake provider for deterministic local tests or configure OpenAI with OPENAI_API_KEY.
Example
cd examples/simple_agent
agentguard test
The example agent uses a mocked calendar.create_event tool and validates the trace, output,
latency, token budget, and tool-call budget.
Roadmap
v0.1 Alpha
- YAML tests.
- CLI runner.
- Python agent entrypoint.
- Trace assertions.
- Output assertions.
- Budgets.
- Mocks.
- Reports.
- GitHub Actions example.
v0.2 Core Hardening
- Public model contract cleanup.
- Complete deterministic assertion coverage.
- Strict validation mode.
- Clear domain exceptions and failure messages.
- Expanded unit coverage.
v0.3 Baselines and Reports
- Versioned baseline artifacts.
- Baseline diff improvements.
- Stable JSON report schema.
- Improved JUnit output.
- Baseline CLI subcommands.
v0.4 Mocking and Impact
- Stricter mock argument matching.
- Registered mock failure mode.
- Hardened changed-only behavior.
- Smoke fallback rules.
v0.5 Adapters
- Default custom Python adapter.
- OpenAI Agents SDK adapter.
- LangChain adapter.
- Optional extras and adapter docs.
v1.0 Release Candidate
- Frozen public models and JSON report schema.
- Complete docs.
- Clean install verification.
- TestPyPI package validation.
- Stable local and CI runtime behavior.
Contributing
Keep the open-source core CI-first, deterministic by default, and framework-agnostic. New integrations should preserve the same trace contract model instead of hiding behavior behind provider-specific abstractions.
Release
Release steps are documented in docs/RELEASE.md.
Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentguardci-0.1.0.tar.gz.
File metadata
- Download URL: agentguardci-0.1.0.tar.gz
- Upload date:
- Size: 40.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4efb951b12a54a5a94e631e459993641bbf14e8db2f2c4d46751f0202394c0b6
|
|
| MD5 |
d9c5dccb6304610d9a82260125aea0e3
|
|
| BLAKE2b-256 |
b0eaf8ddf06095167755a47be5875a7599c32f128a436b2aba70ca48c7282102
|
File details
Details for the file agentguardci-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentguardci-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab8a1bfdf74c139c8db3bfc92ecce0a73089cbe805865964d7fa88be3cdebb5c
|
|
| MD5 |
e99e881b5add4117f165bab9b0e7ec77
|
|
| BLAKE2b-256 |
0b7d4f1fac0c941f0de848c745b53de376c4d6f411f47de3f44268b38ab87d5a
|