Skip to main content

Open-source CI contract testing for tool-using AI agents.

Project description

AgentGuard CI logo

AgentGuard CI

Open-source CI contract testing for tool-using AI agents.

Early Beta Latest Release GitHub Stars License Python 3.11+

English · 简体中文 · 日本語 · 한국어

AgentGuard CI helps teams block risky prompt, tool, and agent changes before they reach production. Define trace-level contracts, mock tools, run impacted tests only, enforce latency/token budgets, and fail CI when agent behavior regresses.

Why AgentGuard CI Exists

Final-answer evals are not enough for production agents. A response can look acceptable while the trace is unsafe: the wrong tool was called, a required policy lookup was skipped, a refund tool ran before confirmation, or token usage doubled.

AgentGuard CI focuses on deterministic engineering contracts:

  • Required and forbidden tool calls.
  • Tool call ordering.
  • Routing decisions.
  • Structured output schemas.
  • Latency, token, cost, and tool-call budgets.
  • Mocked tool results for safe offline CI.
  • Baseline comparison for regression prevention.

LLM-as-judge is available as an optional path, but deterministic assertions are the default.

Installation

pip install agentguard

For local development from this repository:

pip install -e ".[dev]"

OpenAI judge support is optional:

pip install "agentguard[openai]"

Quick Start

agentguard init
agentguard test

Starter test:

suite: "calendar-agent"

tests:
  - id: "uses_calendar_tool"
    input: "Book a meeting tomorrow at 5 PM."
    assert:
      trace:
        must_call:
          - tool: "calendar.create_event"
      output:
        contains:
          - "meeting"

CLI

agentguard --help
agentguard --version
agentguard init
agentguard test
agentguard test --config agentguard.yml
agentguard test --suite calendar-agent
agentguard test --case uses_calendar_tool
agentguard test --tag smoke
agentguard test --changed-only --base origin/main
agentguard test --report json --report junit
agentguard test --fail-fast
agentguard test --update-baseline
agentguard test --strict
agentguard list
agentguard diff
agentguard baseline update
agentguard baseline list
agentguard validate
python -m agentguard --help

Exit codes:

  • 0: all blocking tests passed.
  • 1: at least one blocking test failed.
  • 2: configuration error.
  • 3: agent runtime error.

Agent Entrypoint Contract

Your agent exposes a sync or async Python function:

from agentguard import AgentRequest, AgentResult, TraceEvent, Usage


async def run_agent(request: AgentRequest) -> AgentResult:
    trace = [
        TraceEvent(
            type="tool_call",
            name="calendar.create_event",
            input={"time": "5 PM"},
        )
    ]
    return AgentResult(
        output="Meeting booked for 5 PM.",
        trace=trace,
        usage=Usage(total_tokens=512, latency_ms=1200),
    )

Configure it in agentguard.yml:

version: "0.1"

project:
  name: "customer-support-agents"

agent:
  entrypoint: "my_package.agent:run_agent"
  timeout_seconds: 30

paths:
  tests: "agentguard-tests"
  baselines: ".agentguard/baselines"
  reports: ".agentguard/reports"

mocks:
  require_registered: false

Tool Mocking

AgentGuard does not monkeypatch tools in the MVP. The agent voluntarily reads mocks from the request or uses get_mock.

mocks:
  orders.get_order:
    match_args:
      order_id: "ORD-123"
    output:
      order_id: "ORD-123"
      status: "delivered"
from agentguard import get_mock

tool_output = get_mock(
    request,
    "orders.get_order",
    args={"order_id": "ORD-123"},
)

Assertions

Trace assertions:

assert:
  trace:
    must_call:
      - tool: "orders.get_order"
    must_not_call:
      - tool: "payments.issue_refund"
    ordered:
      - tool: "orders.get_order"
      - tool: "policy.lookup_refund_policy"
    max_tool_calls: 5

Output assertions:

assert:
  output:
    contains:
      - "confirmation"
    not_contains:
      - "refunded"
    json_schema:
      type: object
      required: ["category"]
      properties:
        category:
          type: string
          enum: ["billing", "technical", "account", "other"]
    jsonpath:
      - path: "$.category"
        equals: "billing"

Budget assertions:

assert:
  budgets:
    max_latency_ms: 5000
    max_total_tokens: 3000
    max_cost_usd: 0.05

Impact-Aware Testing

Create agentguard-impact.yml:

mappings:
  - files:
      - "agents/refund/**"
      - "prompts/refund/**"
      - "tools/refund.py"
    tests:
      - "refund-agent"
      - "refund-regression"

Run impacted suites only:

agentguard test --changed-only --base origin/main

If no mapping matches, AgentGuard runs smoke-tagged tests when present.

Baseline Regression Testing

Store baselines for passing tests:

agentguard test --update-baseline

Enable baseline comparison:

assert:
  regression:
    compare_to_baseline: true
    allowed_output_similarity_drop: 0.1
    allowed_extra_tool_calls: 1
    allowed_latency_increase_pct: 30

Baselines are stored under:

.agentguard/baselines/{suite}/{test_id}.json

Reports

AgentGuard prints a terminal report and can write machine-readable reports:

agentguard test --report json
agentguard test --report junit

Outputs:

.agentguard/reports/report.json
.agentguard/reports/junit.xml

GitHub Actions

name: AgentGuard CI

on:
  pull_request:
    branches: [main]

jobs:
  agentguard:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install agentguard

      - name: Run impacted agent tests
        run: |
          agentguard test --changed-only --base origin/main --report junit

Branch protection setup:

  1. Open repository settings.
  2. Enable branch protection for main.
  3. Require status checks before merging.
  4. Select the AgentGuard CI workflow.

Optional LLM Judge

LLM judge tests are opt-in and should be used selectively because provider calls can be slower, more expensive, and less deterministic than trace contracts.

assert:
  llm_judge:
    enabled: true
    rubric: |
      Score whether the answer correctly explains the refund policy and does not invent
      unsupported exceptions.
    threshold: 0.85

Use the fake provider for deterministic local tests or configure OpenAI with OPENAI_API_KEY.

Example

cd examples/simple_agent
agentguard test

The example agent uses a mocked calendar.create_event tool and validates the trace, output, latency, token budget, and tool-call budget.

Roadmap

v0.1 Alpha

  • YAML tests.
  • CLI runner.
  • Python agent entrypoint.
  • Trace assertions.
  • Output assertions.
  • Budgets.
  • Mocks.
  • Reports.
  • GitHub Actions example.

v0.2 Core Hardening

  • Public model contract cleanup.
  • Complete deterministic assertion coverage.
  • Strict validation mode.
  • Clear domain exceptions and failure messages.
  • Expanded unit coverage.

v0.3 Baselines and Reports

  • Versioned baseline artifacts.
  • Baseline diff improvements.
  • Stable JSON report schema.
  • Improved JUnit output.
  • Baseline CLI subcommands.

v0.4 Mocking and Impact

  • Stricter mock argument matching.
  • Registered mock failure mode.
  • Hardened changed-only behavior.
  • Smoke fallback rules.

v0.5 Adapters

  • Default custom Python adapter.
  • OpenAI Agents SDK adapter.
  • LangChain adapter.
  • Optional extras and adapter docs.

v1.0 Release Candidate

  • Frozen public models and JSON report schema.
  • Complete docs.
  • Clean install verification.
  • TestPyPI package validation.
  • Stable local and CI runtime behavior.

Contributing

Keep the open-source core CI-first, deterministic by default, and framework-agnostic. New integrations should preserve the same trace contract model instead of hiding behavior behind provider-specific abstractions.

Release

Release steps are documented in docs/RELEASE.md.

Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentguardci-0.1.0.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentguardci-0.1.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file agentguardci-0.1.0.tar.gz.

File metadata

  • Download URL: agentguardci-0.1.0.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for agentguardci-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4efb951b12a54a5a94e631e459993641bbf14e8db2f2c4d46751f0202394c0b6
MD5 d9c5dccb6304610d9a82260125aea0e3
BLAKE2b-256 b0eaf8ddf06095167755a47be5875a7599c32f128a436b2aba70ca48c7282102

See more details on using hashes here.

File details

Details for the file agentguardci-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentguardci-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for agentguardci-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab8a1bfdf74c139c8db3bfc92ecce0a73089cbe805865964d7fa88be3cdebb5c
MD5 e99e881b5add4117f165bab9b0e7ec77
BLAKE2b-256 0b7d4f1fac0c941f0de848c745b53de376c4d6f411f47de3f44268b38ab87d5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page