Skip to main content

Unit tests for agents. Define an agent with capa's capabilities.yaml, write scenarios and assertions, mock MCP servers, run in CI.

Project description

agenteval: Unit Tests for AI Agents

PyPI CI Python License: MIT

agenteval treats an agent run like a unit test. You declare the agent once with capa's capabilities.yaml, write scenarios (a prompt, fake MCP servers seeded with known state, and assertions), and run the whole suite from one CI-friendly command. Providers are pluggable, and Claude Code ships first.

Why agenteval?

Testing an AI agent is awkward. It calls real SaaS APIs, burns tokens on every run, and answers a little differently each time. So most agent "tests" are a human eyeballing the output once and calling it good. There is no red/green, nothing to gate a PR on, nothing that tells you a prompt edit quietly broke the Slack integration.

agenteval gives you the parts that are tedious to build yourself. You declare the agent once in capabilities.yaml, then write scenarios next to it: a prompt, a set of mock MCP servers seeded with state you control, and assertions about what should happen. Each run gets its own throwaway workspace, talks only to local mocks, and is checked against those assertions. The mocks record every tool call, so you can assert the agent posted to #releases exactly once, or never touched a delete_user tool at all.

One command runs the whole matrix of scenarios, models, and repeats, aggregates the results, and exits nonzero when something regresses. JUnit XML drops straight into CI.

What it does

  • MCP mock infrastructure: stand up fake MCP servers from a mock.yaml with verbatim tool schemas, seed state, declarative responses, and an optional Python escape hatch for stateful tools.
  • Every tool call is recorded, so assertions can check what the agent did, not just what it said.
  • Ephemeral workspace per run, capa install to compile capabilities.yaml into provider config, the agent invocation, repeats, and result aggregation.
  • Declarative assertions (mock_state, tool_called, file and output checks) plus an optional LLM judge scored against a rubric.
  • Reports as JSON, JUnit XML for native CI gating, markdown for PR comments, and a console summary.
  • Pluggable providers: a new harness is a Provider subclass that picks a different capa install -p target, with no changes to the runner.

How it works

For every (scenario, model, repeat) cell, agenteval builds an isolated environment, lets capa bootstrap the agent, runs it against local mocks, then checks the side effects. The mocks start before install so capa can validate tools against a live server.

flowchart TD
  A["agenteval run"] --> B["load config + scenarios"]
  B --> C{{"for each scenario x model x repeat"}}

  subgraph Cell["one cell"]
    direction TB
    D["start MCP mocks + workspace"] --> E["capa install -p claude-code"]
    E --> F["run agent"]
    F --> G["assertions + optional LLM judge"]
    G --> H["record result, teardown"]
  end

  C --> Cell
  H --> C
  C -->|all done| I["aggregate + emit reports"]
  I --> J["exit nonzero on failure"]

The framework owns the workspace, the mocks, and the URL rewiring that connects them; it delegates all harness provisioning to capa install -p <provider>.

Installation

pip install agenteval-framework

The distribution is agenteval-framework; the import package and the CLI are both agenteval.

Prerequisites (external, documented by their projects):

  • The capa binary on PATH. agenteval uses it to compile capabilities.yaml.
  • The provider CLI. For Claude that is the claude CLI (Claude Code), authenticated via subscription or ANTHROPIC_API_KEY.

Quick start

1. Scaffold a project

agenteval init my-eval
cd my-eval

This drops an agenteval.yaml, an agent, and one example scenario next to your code.

2. Add your credentials

cp .env.example .env        # add ANTHROPIC_API_KEY if your CLI needs it

3. Validate and run

agenteval validate          # structural check, no model calls
agenteval run -v            # run the suite

Or run the bundled, fully working example:

agenteval run --root examples/my-eval -v

Expected shape of the output:

  scenario 'slack-release-note'  provider=claude  model=claude-opus-4-8
    trial 1/1: PASS  $0.1783  turns=4  13.2s  judge=1.00
  suite: 1/1 runs passed  ->  PASS

Project layout

my-eval/
  agenteval.yaml              # provider, models, defaults, judge, report config
  .env                        # ANTHROPIC_API_KEY etc (you bring; never committed)
  agents/
    release-bot/
      capabilities.yaml       # capa spec -> provider config
      CLAUDE.md               # agent base prompt (referenced by capabilities.yaml)
      skills/slack/SKILL.md   # local skills (optional)
  scenarios/
    slack-release-note/
      scenario.yaml           # agent ref, prompt, mcp list, assertions, judge
      input/prompt.md         # the task prompt
      rubric.md               # LLM-judge rubric (optional)
      assets/                 # files copied into the ephemeral workspace (optional)
      mcp/
        slack/
          mock.yaml           # schema ref + seed state + declarative responses
          schema.json         # verbatim tools/list (optional; can inline)
          handler.py          # optional Python hooks for stateful tools
  reports/                    # output (gitignored)

Defining a scenario

scenario.yaml:

id: slack-release-note
agent: release-bot
prompt_file: input/prompt.md
mcp: [slack]                  # which mocks (dirs under mcp/) to start

run:
  repeats: 1
  max_turns: 15

assertions:
- kind: mock_state            # a message landed in #releases
  server: slack
  jsonpath: "$.messages[?(@.channel_name=='releases')]"
  min_count: 1
- kind: tool_called           # the agent actually posted
  server: slack
  tool: slack_send_message
  min_count: 1

judge:
  rubric_file: rubric.md      # LLM-judged message quality
  min_score: 0.6

Assertion kinds

kind params checks
mock_state server, jsonpath, min_count, max_count JSONPath matches against a mock's final state
tool_called tool, server, min_count, max_count a tool was (or was never, with max_count: 0) called
file_exists path a file exists in the workspace
file_contains path, values, mode file contains all/any of values
dir_has_new_file path, matches, contents_include a matching file with given content exists
output_contains values, mode, ignore_case final answer contains all/any of values
output_matches pattern, ignore_case final answer matches a regex

Each assertion is required: true by default; a required failure fails the run and sets a nonzero exit code. The jsonpath for mock_state supports the Goessner filter form ($.coll[?(@.field=='value')]) and plain paths natively, and falls back to jsonpath-ng for richer expressions.

Defining a mock

mcp/<server>/mock.yaml is hybrid: declarative by default, with a Python escape hatch for stateful behaviour.

name: slack
schema: schema.json          # verbatim tools/list, or inline `tools:`
handler: handler.py          # optional; functions override named tools

seed:                        # initial state, reset before every run
  channels:
  - { id: C1, name: releases }
  messages: []

responses:                   # declarative dispatch (used when no handler matches)
  list_channels:
    result: { ok: true, channels: "{{ state.channels }}" }
  post_message:
    mutate:
    - append: { path: messages, value: { channel_name: releases, text: "{{ args.text }}" } }
    result: { ok: true, ts: "{{ now }}" }

Templating is Jinja2 over { args, state, now, uuid }. A value that is a single {{ expr }} keeps its native JSON type. Mutations support append, extend, set, and increment. A handler.py may export a HANDLERS dict (or tool_<name> functions); each handler is fn(args, ctx) and mutates ctx.state.data.

CLI

agenteval run        run the suite, emit reports, exit nonzero on failure
agenteval list       list discovered scenarios
agenteval validate   check structure without calling any model
agenteval init       scaffold a new eval project

run flags: --root, --filter <substr>, --provider, --model (repeatable), --repeat N, --report-dir, --no-judge, --keep-workspace, -v.

Reports

agenteval run writes to report.dir (default reports/):

  • results.json: full per-run records plus per-cell aggregates (mean/stddev/median per metric, pass rate, judge score).
  • junit.xml: one testcase per repeat, for native CI gating.
  • report.md: a table plus a failure breakdown for PR comments.

The judge defaults to the claude-cli backend, which reuses the Claude Code CLI's auth (so it works without exporting ANTHROPIC_API_KEY). Set judge.backend: anthropic-api in agenteval.yaml to use the Anthropic SDK with ANTHROPIC_API_KEY instead.

CI

See examples/my-eval/.github/workflows/agenteval.yml for a template. Copy it to your repo root .github/workflows/, provide ANTHROPIC_API_KEY as a secret, install capa and the claude CLI, then run agenteval run. The nonzero exit on failure gates the PR; upload reports/ as an artifact and publish junit.xml.

Releasing

CI (.github/workflows/ci.yml) runs on every push and PR: a hermetic smoke test (agenteval --help, validate, init) across Python 3.9 to 3.12, plus a build and twine check. None of it calls a model, so it needs no secrets.

Releasing is tag driven. Pushing a tag like v0.1.0 runs .github/workflows/publish.yml, which derives the version from the tag with setuptools-scm, publishes to PyPI via Trusted Publishing (OIDC) so no API tokens are stored, and creates the GitHub Release with autogenerated notes and the built artifacts attached.

git tag v0.1.0
git push origin v0.1.0
# that's it

One-time setup (done once, then never again):

  1. On PyPI, add a pending trusted publisher for the project agenteval-framework (Account -> Publishing): owner Minitour, repo agenteval, workflow publish.yml, environment pypi.
  2. In the GitHub repo, create an Environment named pypi.

You never edit a version by hand; the git tag is the single source of truth. To cut a release locally without CI you can still python -m build and twine upload dist/* with your own credentials.

Adding a provider

Subclass agenteval.providers.base.Provider (implement install, run, preflight), then register it:

from agenteval.providers.registry import register
register(MyProvider)

install rewrites the servers: URLs in capabilities.yaml to the local mock endpoints and compiles them for the target harness; run invokes the agent and returns a normalized ProviderRunOutput. No core changes required.

Contributing

See CONTRIBUTING.md. Security reports go through SECURITY.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenteval_framework-0.1.0.tar.gz (59.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agenteval_framework-0.1.0-py3-none-any.whl (44.1 kB view details)

Uploaded Python 3

File details

Details for the file agenteval_framework-0.1.0.tar.gz.

File metadata

  • Download URL: agenteval_framework-0.1.0.tar.gz
  • Upload date:
  • Size: 59.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenteval_framework-0.1.0.tar.gz
Algorithm Hash digest
SHA256 17130ed5c1337bff52d07e78100e8a7318a8691ba9a821d39cca87c6ea461ca2
MD5 b53e1f6fff8385a329d4d071f9b27415
BLAKE2b-256 33a940a3dd8f4be9fb840a8693dd850c23aa45f03b159c53904d1fff75041fc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for agenteval_framework-0.1.0.tar.gz:

Publisher: publish.yml on Minitour/agenteval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agenteval_framework-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agenteval_framework-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2dc91bd0b24344e09be964c1111c9d14cd4e77798fd252cd89346aa58227062
MD5 874b61896d79453c53196881c7a22bfc
BLAKE2b-256 fc34da822441888ab9c68ce717922037ef9373357f9496c79de3a93b13577b2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for agenteval_framework-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Minitour/agenteval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page