Statistical evaluation framework for AI agents - pytest for agent trajectories
Project description
agentrial
The pytest for AI agents. Run your agent 100 times, get confidence intervals instead of anecdotes.
Your agent passes Monday, fails Wednesday. Same prompt, same model. agentrial tells you why.
Quickstart
pip install agentrial
agentrial init
agentrial run
That's it. You'll see real results in seconds:
╭──────────────────────────────────────────────────────────────────────╮
│ sample-demo - PASSED │
╰───────────────────────────────────────────────────────── Threshold: 80% ─╯
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test Case ┃ Pass Rate ┃ 95% CI ┃ Avg Cost ┃ Avg Latency ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ greeting │ 100.0% │ (56.6%-100.0%) │ $0.0000 │ 0ms │
│ capital-france │ 100.0% │ (56.6%-100.0%) │ $0.0000 │ 0ms │
│ capital-japan │ 100.0% │ (56.6%-100.0%) │ $0.0000 │ 0ms │
│ basic-math │ 100.0% │ (56.6%-100.0%) │ $0.0000 │ 0ms │
└──────────────────┴───────────┴──────────────────┴──────────┴─────────────┘
Overall Pass Rate: 100.0% (85.0%-100.0%)
Total Cost: $0.0000
Replace sample_agent.py with your own agent, update tests/test_sample.yml, and you're evaluating real agents.
Why this exists
Every agent framework ships with benchmarks showing 90%+ accuracy. But run those same agents 100 times on the same task, and you'll see pass rates drop to 60-80% with wide variance. The benchmarks measure one run; production sees thousands.
No existing tool gives you statistically rigorous, framework-agnostic agent testing that runs in CI/CD. LangSmith requires a paid account and locks you to LangChain. Promptfoo doesn't do multi-trial with confidence intervals. DeepEval and Arize don't do trajectory-level failure attribution. agentrial fills that gap: open-source, free, local-first, works with any agent framework.
What it does
- Multi-trial execution — Run every test N times automatically. A single pass means nothing for non-deterministic agents.
- Wilson confidence intervals — Statistically accurate pass rates, even with small samples and extreme proportions (0% or 100%).
- Step-level failure attribution — Pinpoints which tool call diverges between passing and failing runs using Fisher exact test.
- Real cost tracking — Actual API costs from model metadata, 45+ models supported across Anthropic, OpenAI, Google, Mistral, Meta, DeepSeek.
- Regression detection — Fisher exact test catches reliability drops between versions. Blocks PRs in CI when quality degrades.
- Local-first — Your data never leaves your machine. No accounts, no SaaS, no telemetry.
- Agent Reliability Score — A single 0-100 composite metric that combines accuracy, consistency, cost efficiency, latency, trajectory quality, and failure recovery. Weighted scoring with transparent breakdown — one number to track across releases.
- Production monitoring — Deploy
agentrial monitoras a cron job or sidecar. CUSUM and Page-Hinkley detectors catch drift in pass rate, cost, and latency. Kolmogorov-Smirnov test detects distribution shifts. Alerts before users notice.
Writing Tests
Tests are YAML files. Define what your agent receives and what it should produce:
suite: my-agent-tests
agent: my_module.agent # Python import path to your wrapped agent
trials: 10
threshold: 0.85 # Minimum pass rate
cases:
- name: basic-math
input:
query: "What is 15 * 37?"
expected:
output_contains: ["555"]
tool_calls:
- tool: calculate
- name: capital-lookup
input:
query: "What is the capital of Japan?"
expected:
output_contains: ["Tokyo"]
- name: error-handling
input:
query: "Divide 10 by 0"
expected:
output_contains_any: ["undefined", "cannot", "error"]
max_cost: 0.05
max_latency_ms: 5000
All assertion types
expected:
output_contains: ["word1", "word2"] # AND — all must be present
output_contains_any: ["option1", "option2"] # OR — at least one
exact_match: "exact output string"
regex: "\\d+ results found"
tool_calls:
- tool: search
params_contain:
query: "expected term"
# Per-step expectations
step_expectations:
- step_index: 0
tool_name: search
params_contain:
query: "search term"
output_contains: ["result"]
Test discovery
agentrial auto-discovers test files:
agentrial run tests/ # Finds test_*.yml, test_*.yaml
agentrial run agentrial.yml # Run a specific file
Wrapping Your Agent
agentrial needs a callable: AgentInput -> AgentOutput. Use an adapter for your framework.
LangGraph
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
from agentrial.runner.adapters import wrap_langgraph_agent
@tool
def calculate(expression: str) -> str:
"""Evaluate a math expression."""
return str(eval(expression))
llm = ChatAnthropic(model="claude-3-haiku-20240307", temperature=0)
graph = create_react_agent(llm, tools=[calculate])
# This is what your YAML points to
agent = wrap_langgraph_agent(graph)
The LangGraph adapter automatically captures full trajectory, token usage, real API cost, and execution duration.
Custom agents
Implement the protocol directly:
from agentrial.types import AgentInput, AgentOutput, AgentMetadata
def agent(input: AgentInput) -> AgentOutput:
# Your agent logic
return AgentOutput(
output="result",
steps=[],
metadata=AgentMetadata(total_tokens=100, cost=0.001, duration_ms=500.0),
success=True,
)
Fluent Assertion API
For Python-defined tests:
from agentrial import expect
result = agent(AgentInput(query="Book a flight to Rome"))
e = expect(result).succeeded() \
.tool_called("search_flights", params_contain={"destination": "FCO"}) \
.cost_below(0.15) \
.latency_below(5000)
# Output checks return OutputExpectation (separate chain)
e.output.contains("confirmed", "Rome")
# Step checks return StepExpectation (separate chain)
e.step(0).tool_name("search_flights").params_contain(destination="FCO")
assert e.passed()
| Method | Description |
|---|---|
.succeeded() |
Agent completed without error |
.output.contains(*strings) |
Output contains all substrings |
.output.equals(string) |
Exact match |
.output.matches(regex) |
Regex match |
.tool_called(name, params_contain={}) |
Tool was called with params |
.step(i).tool_name(name) |
Step i called named tool |
.step(i).params_contain(**kw) |
Step i had params matching kw |
.cost_below(max_usd) |
Cost under threshold |
.latency_below(max_ms) |
Latency under threshold |
.tokens_below(max_tokens) |
Tokens under threshold |
.trajectory_length(min, max) |
Step count within bounds |
.passed() |
Returns True if all pass |
.get_failures() |
Returns failure messages |
CI/CD Integration
GitHub Actions
name: Agent Evaluation
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install agentrial && pip install -e .
- run: agentrial run --trials 10 --threshold 0.85 -o results.json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- uses: actions/upload-artifact@v4
if: always()
with:
name: agentrial-results
path: results.json
Regression detection in CI
- run: agentrial run -o results.json
- run: agentrial compare results.json --baseline baseline.json
Fisher's exact test (p < 0.05) detects statistically significant regressions. Exit code 1 blocks the PR.
Advanced features
Trajectory flame graphs
Visualize agent execution paths across trials. Identify where passing and failing runs diverge.
agentrial run --flamegraph # Terminal visualization
agentrial run --flamegraph --html flamegraph.html # Interactive HTML export
LLM-as-Judge
Use a second LLM to evaluate response quality with calibrated scoring.
agentrial run --judge # Add judge evaluation
Implements Krippendorff's alpha for inter-rater reliability and t-distribution CI for score estimates. Calibration protocol ensures judge consistency before scoring.
Snapshot testing
Capture baseline behavior and detect regressions automatically.
agentrial snapshot update # Save current behavior as baseline
agentrial snapshot check # Compare against baseline
Uses Fisher exact test on pass rates and Mann-Whitney U on cost/latency, with Benjamini-Hochberg correction across all comparisons.
MCP security scanner
Audit MCP server configurations for 6 vulnerability classes: prompt injection, tool shadowing, data exfiltration, permission escalation, rug pull, and configuration weakness.
agentrial security scan --mcp-config servers.json
Multi-agent evaluation
Evaluate multi-agent systems with delegation accuracy, handoff fidelity, redundancy rate, and cascade failure metrics.
Pareto frontier analysis
Find the optimal cost-accuracy trade-off across models.
agentrial pareto --models claude-3-haiku,gpt-4o-mini,gemini-flash
Prompt version control
Track, diff, and manage prompt versions with statistical comparison between versions.
agentrial prompt track prompts/v2.txt
agentrial prompt diff v1 v2
agentrial prompt list
Agent Reliability Score
A composite 0-100 metric combining 6 weighted components: accuracy (40%), consistency (20%), cost efficiency (10%), latency (10%), trajectory quality (10%), and recovery (10%).
agentrial ars results.json
agentrial ars results.json --cost-ceiling 0.5
Benchmark registry
Publish evaluation results as verifiable, shareable benchmark files with SHA-256 integrity checksums.
agentrial publish results.json --agent-name my-agent --agent-version 1.0.0
agentrial verify --agent-name my-agent --agent-version 1.0.0 --suite-name my-suite
Eval packs
Domain-specific evaluation packages distributed as Python packages via entry points. Install a pack, get specialized test suites and evaluators.
agentrial packs list # Show installed packs
Dashboard
Local FastAPI dashboard for browsing results, comparing runs, and tracking trends.
agentrial dashboard # Start at http://localhost:8080
VS Code extension
Browse test suites, run evaluations, view flame graphs, and compare snapshots — all from your editor.
Install from the VS Code Marketplace or search "agentrial" in VS Code extensions.
Features:
- Suite explorer sidebar with test case tree
- Run suites and individual test cases with one click
- Interactive trajectory flame graph visualization
- Snapshot comparison for regression detection
- MCP security scan integration
- Auto-refresh on YAML file changes
Statistical Methods
agentrial uses real statistical tests, not simple averages.
| Method | What it does |
|---|---|
| Wilson score interval | Confidence intervals for pass rates — accurate at boundaries (0%, 100%) and small samples |
| Bootstrap resampling | CI for cost/latency — non-parametric, no normality assumption (500 iterations) |
| Fisher exact test | Regression detection — compares pass rates between two runs (p < 0.05) |
| Mann-Whitney U test | Compares cost/latency distributions between versions |
| Benjamini-Hochberg | Controls false discovery rate when comparing multiple metrics |
| CUSUM / Page-Hinkley | Sequential change-point detection for production monitoring |
| Kolmogorov-Smirnov | Distribution shift detection for cost and latency |
| Krippendorff's alpha | Inter-rater reliability for LLM-as-Judge with t-distribution CI |
Failure attribution
When tests fail, agentrial analyzes trajectory divergence:
- Groups trials by pass/fail
- At each step, compares distribution of tool calls
- Fisher exact test identifies the step with significant divergence
- Reports the divergent step with a recommendation
CLI Reference
agentrial init # Scaffold sample project
agentrial run # Run all tests
agentrial run tests/ --trials 20 # Custom trials
agentrial run -o results.json # JSON export
agentrial run --flamegraph # Trajectory flame graphs
agentrial run --judge # LLM-as-Judge evaluation
agentrial compare results.json -b base.json # Regression detection
agentrial baseline results.json # Save baseline
agentrial snapshot update / check # Snapshot testing
agentrial security scan --mcp-config c.json # MCP security scan
agentrial pareto --models m1,m2,m3 # Cost-accuracy Pareto frontier
agentrial prompt track/diff/list # Prompt version control
agentrial monitor --baseline snap.json # Production drift detection
agentrial ars results.json # Agent Reliability Score
agentrial publish results.json --agent-name me --agent-version 1.0 # Publish benchmark
agentrial verify --agent-name me --agent-version 1.0 --suite-name s # Verify integrity
agentrial packs list # List installed eval packs
agentrial dashboard # Start local dashboard
agentrial config # Show configuration
| Flag | Short | Description | Default |
|---|---|---|---|
--config |
-c |
Config file path | agentrial.yml |
--trials |
-n |
Trials per test case | 10 |
--threshold |
-t |
Min pass rate (0-1) | 0.85 |
--output |
-o |
JSON output path | — |
--json |
JSON to stdout | false |
|
--flamegraph |
Show trajectory flame graphs | false |
|
--html |
Export flame graph HTML | — | |
--judge |
Enable LLM-as-Judge | false |
|
--update-snapshots |
Save as snapshot baseline | false |
How it compares
| agentrial | Promptfoo | LangSmith | DeepEval | Arize Phoenix | |
|---|---|---|---|---|---|
| Multi-trial with CI | Free | — | $39/mo | — | — |
| Confidence intervals | Wilson CI | — | — | — | — |
| Step-level failure attribution | Fisher exact | — | — | — | Partial |
| Framework-agnostic | 6 adapters + OTel | Yes | LangChain only | Yes | Yes |
| Cost-per-correct-answer | Yes | — | — | — | — |
| LLM-as-Judge with calibration | Krippendorff α | — | Yes | Yes | — |
| Composite reliability score | ARS (0-100) | — | — | — | — |
| MCP security scanning | 6 vuln classes | — | — | — | — |
| Production drift detection | CUSUM + PH + KS | — | — | — | Partial |
| VS Code extension | Yes | — | — | — | — |
| Local-first | Yes | Yes | No | No | Self-host option |
Supported Frameworks
| Framework | Status | Notes |
|---|---|---|
| LangGraph | Native adapter | Full trajectory, callbacks, token tracking |
| CrewAI | Native adapter | Task-level trajectory, crew cost tracking |
| AutoGen | Native adapter | v0.4+ (autogen-agentchat) and legacy pyautogen |
| Pydantic AI | Native adapter | Tool calls, response parts, token usage |
| OpenAI Agents SDK | Native adapter | Runner integration, tool call capture |
| smolagents (HF) | Native adapter | Dict and object log formats |
| Any OTel-instrumented agent | Supported | Automatic span capture via OTel SDK |
| Custom | Supported | AgentInput -> AgentOutput protocol |
Supported Models (cost tracking)
| Provider | Models |
|---|---|
| Anthropic | Claude 3 Haiku/Sonnet/Opus, Claude 3.5, Claude Sonnet 4.5, Claude Opus 4 |
| OpenAI | GPT-4o, GPT-4o-mini, GPT-4 Turbo, GPT-3.5 Turbo, o1, o3-mini |
| Gemini 2.0 Flash, Gemini 1.5 Pro/Flash, Gemini 1.0 Pro | |
| Mistral | Large, Medium, Small, Codestral, Pixtral |
| Meta | Llama 3.3 70B, Llama 3.1 405B/70B |
| DeepSeek | DeepSeek Chat, DeepSeek Reasoner |
Contributing
git clone https://github.com/alepot55/agentrial.git
cd agentrial
pip install -e ".[dev]"
pytest # 438 tests
ruff check .
mypy agentrial/
See CONTRIBUTING.md for details.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentrial-0.2.0.tar.gz.
File metadata
- Download URL: agentrial-0.2.0.tar.gz
- Upload date:
- Size: 159.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82adc18395b18a2aee5bc4195e9493864f2f4b41dc5f90f6e875b332b1e50bb4
|
|
| MD5 |
a66204a7a1107363c4fd4b1cd5328c9a
|
|
| BLAKE2b-256 |
7a61cc721be036f39f723070ea625df5aeb5cf21d9144f7bf56723e3fc0b7ba7
|
Provenance
The following attestation bundles were made for agentrial-0.2.0.tar.gz:
Publisher:
release.yml on alepot55/agentrial
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentrial-0.2.0.tar.gz -
Subject digest:
82adc18395b18a2aee5bc4195e9493864f2f4b41dc5f90f6e875b332b1e50bb4 - Sigstore transparency entry: 924360944
- Sigstore integration time:
-
Permalink:
alepot55/agentrial@cfe1bb862538e25fd975070a93ecc33469cda7e3 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/alepot55
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfe1bb862538e25fd975070a93ecc33469cda7e3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file agentrial-0.2.0-py3-none-any.whl.
File metadata
- Download URL: agentrial-0.2.0-py3-none-any.whl
- Upload date:
- Size: 126.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cd799f9bb7375a19f962d56e77926cc42a1842db93a47b29d767010c165b2f6
|
|
| MD5 |
f1b8cd0dcad4b5faccf3c4bda4a1eebe
|
|
| BLAKE2b-256 |
a504bde52feab69c85bb16c1c716f889c36527d6d1d2ef95e512d7d1086179ee
|
Provenance
The following attestation bundles were made for agentrial-0.2.0-py3-none-any.whl:
Publisher:
release.yml on alepot55/agentrial
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentrial-0.2.0-py3-none-any.whl -
Subject digest:
0cd799f9bb7375a19f962d56e77926cc42a1842db93a47b29d767010c165b2f6 - Sigstore transparency entry: 924360947
- Sigstore integration time:
-
Permalink:
alepot55/agentrial@cfe1bb862538e25fd975070a93ecc33469cda7e3 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/alepot55
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfe1bb862538e25fd975070a93ecc33469cda7e3 -
Trigger Event:
release
-
Statement type: