Statistical evaluation framework for AI agents - pytest for agent trajectories
Project description
agenteval
Your agent passes Monday, fails Wednesday. agenteval tells you why.
Statistical evaluation framework for AI agents. Like pytest, but for non-deterministic systems.
Why agenteval?
- Statistical rigor: Every test runs N trials with Wilson confidence intervals — not "it passed once"
- Trajectory analysis: Step-by-step failure attribution identifies exactly which tool call diverged
- Cost tracking: Real API costs from model metadata, not estimates
- CI/CD ready: GitHub Action that blocks PRs when reliability drops
Quick Start
pip install agentrial
Create a test file agenteval.yml:
suite: my-agent-tests
agent: my_agent.agent # Python import path
trials: 10
threshold: 0.85
cases:
- name: basic-math
input:
query: "What is 15 * 37?"
expected:
output_contains: ["555"]
tool_calls:
- tool: calculate
Run:
agentrial run
Output:
╭──────────────────────────────────────────────────────────────────────────────╮
│ my-agent-tests - PASSED │
╰────────────────────────────────────────────────────── Threshold: 85.0% ──────╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test Case ┃ Pass Rate┃ 95% CI ┃ Avg Cost ┃ Avg Latency┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━┩
│ basic-math │ 100.0% │ (72.2%-100.0%) │ $0.0005 │ 1.59s │
└────────────────────────┴──────────┴─────────────────┴──────────┴────────────┘
Overall Pass Rate: 100.0% (72.2%-100.0%)
Total Cost: $0.005
Real-World Results
Tested with Claude 3 Haiku on a 3-tool agent (calculator, country lookup, temperature conversion):
| Test Complexity | Pass Rate | 95% CI | Avg Cost | Avg Latency | Avg Tokens |
|---|---|---|---|---|---|
| Easy (direct tool call) | 100% | 72.2%-100% | $0.0005 | 1.6s | 1,513 |
| Medium (inference + tool) | 100% | 72.2%-100% | $0.0006 | 2.6s | 1,926 |
| Hard (multi-step reasoning) | 100% | 72.2%-100% | $0.0010 | 3.5s | 2,986 |
100 trials total, $0.06 total cost, full trajectory capture.
Test Case Options
cases:
- name: my-test
input:
query: "User question"
context: {} # Optional context dict
expected:
# All strings must be present (AND logic)
output_contains: ["expected", "words"]
# At least one string must be present (OR logic)
output_contains_any: ["option1", "option2", "option3"]
# Regex pattern
regex: "\\d+ results found"
# Expected tool calls
tool_calls:
- tool: search
params_contain: {query: "expected"}
CI/CD Integration
Add to .github/workflows/agent-eval.yml:
name: Agent Evaluation
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install agentrial
- run: agentrial run --threshold 0.85
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
LangGraph Integration
from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from agenteval.runner.adapters import wrap_langgraph_agent
# Your LangGraph agent
llm = ChatAnthropic(model="claude-3-haiku-20240307")
graph = create_react_agent(llm, tools=[...])
# Wrap for agenteval
agent = wrap_langgraph_agent(graph)
Then reference in your test file:
agent: my_module.agent
CLI Reference
agentrial run [PATH] # Run tests (default: current directory)
agentrial run --trials 20 # Override trial count
agentrial run --threshold 0.9 # Override pass threshold
agentrial run -o results.json # Export JSON report
agentrial compare -b baseline.json current.json # Compare runs
agentrial init # Initialize project
Supported Frameworks
- LangGraph (native adapter with full trajectory capture)
- CrewAI (coming soon)
- AutoGen (coming soon)
- Pydantic AI (coming soon)
Statistical Methods
- Pass rate CI: Wilson score interval (accurate for small N and extreme proportions)
- Cost/latency CI: Bootstrap resampling (500 iterations)
- Regression detection: Fisher exact test for pass rate, Mann-Whitney U for metrics
- Failure attribution: Trajectory divergence analysis between passed/failed trials
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentrial-0.1.1.tar.gz.
File metadata
- Download URL: agentrial-0.1.1.tar.gz
- Upload date:
- Size: 56.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9304b9bf2a434310780a133a88316fc52e10de9345c98b626cab278faab0bce
|
|
| MD5 |
fa494047fe2414c8152bcd12cc454a48
|
|
| BLAKE2b-256 |
3c986618a53999fb01cfff682e7b373a85a5009aaceeeb631261a6e434c6d5a2
|
File details
Details for the file agentrial-0.1.1-py3-none-any.whl.
File metadata
- Download URL: agentrial-0.1.1-py3-none-any.whl
- Upload date:
- Size: 50.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4885426dc804ec6ab26e4381fba00b63e6bf5141f00ec0f3b10691726c0a4106
|
|
| MD5 |
ad220b3c76f0c664ac54302eb2f7f362
|
|
| BLAKE2b-256 |
5d8c5d752406d88bfb14750e8f77ecdae026d6aeb004ecb3890fee47f1f6bd57
|