Skip to main content

Statistical evaluation framework for AI agents - pytest for agent trajectories

Project description

agentrial

The pytest for AI agents. Statistical evaluation with confidence intervals and failure attribution.

PyPI License: MIT Python 3.11+

Your agent passes Monday, fails Wednesday. Same prompt, same model. agentrial tells you why.


Quickstart

pip install agentrial
agentrial init
agentrial run

That's it. You'll see real results in seconds:

╭──────────────────────────────────────────────────────────────────────╮
│ sample-demo - PASSED                                                 │
╰───────────────────────────────────────────────────── Threshold: 80% ─╯
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test Case        ┃ Pass Rate ┃ 95% CI           ┃ Avg Cost ┃ Avg Latency ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ greeting         │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ capital-france   │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ capital-japan    │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ basic-math       │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
└──────────────────┴───────────┴──────────────────┴──────────┴─────────────┘

Overall Pass Rate: 100.0% (85.0%-100.0%)
Total Cost: $0.0000

Replace sample_agent.py with your own agent, update tests/test_sample.yml, and you're evaluating real agents.


What it does

  • Multi-trial execution — Run every test N times automatically. A single pass means nothing for non-deterministic agents.
  • Wilson confidence intervals — Statistically accurate pass rates, even with small samples and extreme proportions (0% or 100%).
  • Step-level failure attribution — Pinpoints which tool call diverges between passing and failing runs using Fisher exact test.
  • Real cost tracking — Actual API costs from model metadata, 40+ models supported across Anthropic, OpenAI, Google, Mistral.
  • Regression detection — Fisher exact test catches reliability drops between versions. Blocks PRs in CI when quality degrades.
  • Local-first — Your data never leaves your machine. No accounts, no SaaS, no telemetry.

Why this exists

Every agent framework ships with benchmarks showing 90%+ accuracy. But run those same agents 100 times on the same task, and you'll see pass rates drop to 60-80% with wide variance. The benchmarks measure one run; production sees thousands.

No existing tool gives you statistically rigorous, framework-agnostic agent testing that runs in CI/CD. LangSmith requires a paid account and locks you to LangChain. Promptfoo doesn't do multi-trial with confidence intervals. DeepEval and Arize don't do trajectory-level failure attribution. agentrial fills that gap: open-source, free, local-first, works with any agent framework.


How it compares

Feature agentrial Promptfoo LangSmith DeepEval Arize
Multi-trial with CI Free No $39/mo No No
Confidence intervals Yes No No No No
Trajectory step analysis Yes No Partial No Yes
Failure attribution Yes No No No No
Framework-agnostic (OTel) Yes Yes No Yes Yes
Free CI/CD integration Yes Yes No No No
Local-first (no data leaves) Yes Yes No No No
Cost-per-correct-answer Yes No No No No

Writing Tests

Tests are YAML files. Define what your agent receives and what it should produce:

suite: my-agent-tests
agent: my_module.agent       # Python import path to your wrapped agent
trials: 10
threshold: 0.85              # Minimum pass rate

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

  - name: capital-lookup
    input:
      query: "What is the capital of Japan?"
    expected:
      output_contains: ["Tokyo"]

  - name: error-handling
    input:
      query: "Divide 10 by 0"
    expected:
      output_contains_any: ["undefined", "cannot", "error"]
    max_cost: 0.05
    max_latency_ms: 5000

All assertion types

expected:
  output_contains: ["word1", "word2"]        # AND — all must be present
  output_contains_any: ["option1", "option2"] # OR — at least one
  exact_match: "exact output string"
  regex: "\\d+ results found"
  tool_calls:
    - tool: search
      params_contain:
        query: "expected term"

# Per-step expectations
step_expectations:
  - step_index: 0
    tool_name: search
    params_contain:
      query: "search term"
    output_contains: ["result"]

Test discovery

agentrial auto-discovers test files:

agentrial run tests/          # Finds test_*.yml, test_*.yaml, test_*.py
agentrial run agentrial.yml   # Run a specific file

Wrapping Your Agent

agentrial needs a callable: AgentInput -> AgentOutput. Use an adapter for your framework.

LangGraph

from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
from agentrial.runner.adapters import wrap_langgraph_agent

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression."""
    return str(eval(expression))

llm = ChatAnthropic(model="claude-3-haiku-20240307", temperature=0)
graph = create_react_agent(llm, tools=[calculate])

# This is what your YAML points to
agent = wrap_langgraph_agent(graph)

The LangGraph adapter automatically captures full trajectory, token usage, real API cost, and execution duration.

Custom agents

Implement the protocol directly:

from agentrial.types import AgentInput, AgentOutput, AgentMetadata

def agent(input: AgentInput) -> AgentOutput:
    # Your agent logic
    return AgentOutput(
        output="result",
        steps=[],
        metadata=AgentMetadata(total_tokens=100, cost=0.001, duration_ms=500.0),
        success=True,
    )

Fluent Assertion API

For Python-defined tests:

from agentrial import expect

result = agent(AgentInput(query="Book a flight to Rome"))

e = expect(result).succeeded() \
    .tool_called("search_flights", params_contain={"destination": "FCO"}) \
    .cost_below(0.15) \
    .latency_below(5000)

# Output checks return OutputExpectation (separate chain)
e.output.contains("confirmed", "Rome")

# Step checks return StepExpectation (separate chain)
e.step(0).tool_name("search_flights").params_contain(destination="FCO")

assert e.passed()
Method Description
.succeeded() Agent completed without error
.output.contains(*strings) Output contains all substrings
.output.equals(string) Exact match
.output.matches(regex) Regex match
.tool_called(name, params_contain={}) Tool was called with params
.step(i).tool_name(name) Step i called named tool
.step(i).params_contain(**kw) Step i had params matching kw
.cost_below(max_usd) Cost under threshold
.latency_below(max_ms) Latency under threshold
.tokens_below(max_tokens) Tokens under threshold
.trajectory_length(min, max) Step count within bounds
.passed() Returns True if all pass
.get_failures() Returns failure messages

CLI Reference

agentrial init                          # Create sample project (ready to run)
agentrial run                           # Run all tests in current directory
agentrial run tests/                    # Run tests in specific directory
agentrial run --trials 20 --threshold 0.9  # Override settings
agentrial run -o results.json           # Export JSON report
agentrial run --json                    # JSON to stdout
agentrial run --flamegraph              # Show trajectory flame graphs
agentrial run --html flamegraph.html    # Export flame graph as HTML
agentrial run --judge                   # Enable LLM-as-Judge evaluation
agentrial run --update-snapshots        # Save snapshot baseline
agentrial compare results.json -b baseline.json  # Regression detection
agentrial baseline results.json         # Save baseline
agentrial config                        # Show configuration
agentrial snapshot update               # Run and save snapshot
agentrial snapshot check                # Compare against snapshot
agentrial security scan --mcp-config c.json  # MCP security scan
agentrial pareto --models m1,m2,m3      # Pareto frontier analysis
agentrial prompt track prompt.txt       # Track prompt version
agentrial prompt diff v1 v2             # Diff prompt versions
agentrial prompt list                   # List prompt versions
agentrial monitor --baseline snap.json  # Configure drift monitoring
agentrial dashboard                     # Launch web dashboard
Flag Short Description Default
--config -c Config file path agentrial.yml
--trials -n Trials per test case 10
--threshold -t Min pass rate (0-1) 0.85
--output -o JSON output path
--json JSON to stdout false
--flamegraph Show trajectory flame graphs false
--html Export flame graph HTML
--judge Enable LLM-as-Judge false
--update-snapshots Save as snapshot baseline false

CI/CD Integration

GitHub Actions

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial && pip install -e .
      - run: agentrial run --trials 10 --threshold 0.85 -o results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: agentrial-results
          path: results.json

Regression detection in CI

      - run: agentrial run -o results.json
      - run: agentrial compare results.json --baseline baseline.json

Fisher's exact test (p < 0.05) detects statistically significant regressions. Exit code 1 blocks the PR.


Statistical Methods

agentrial uses real statistical tests, not simple averages.

Method What it does
Wilson score interval Confidence intervals for pass rates — accurate at boundaries (0%, 100%) and small samples
Bootstrap resampling CI for cost/latency — non-parametric, no normality assumption (500 iterations)
Fisher exact test Regression detection — compares pass rates between two runs (p < 0.05)
Mann-Whitney U test Compares cost/latency distributions between versions
Benjamini-Hochberg Controls false discovery rate when comparing multiple metrics

Failure attribution

When tests fail, agentrial analyzes trajectory divergence:

  1. Groups trials by pass/fail
  2. At each step, compares distribution of tool calls
  3. Fisher exact test identifies the step with significant divergence
  4. Reports the divergent step with a recommendation

Architecture

agentrial/
├── cli.py                  # Click CLI — run, compare, baseline, config, init, etc.
├── config.py               # YAML config loading and test file discovery
├── types.py                # AgentInput, AgentOutput, TestCase, Suite, etc.
├── snapshots.py            # Statistical snapshot testing and comparison
├── pareto.py               # Cost-accuracy Pareto frontier analysis
├── prompts.py              # Prompt version control (track, diff, list)
├── monitor.py              # Production drift detection (CUSUM, Page-Hinkley, KS)
├── pytest_plugin.py        # @agent_test decorator for pytest integration
├── runner/
│   ├── engine.py           # MultiTrialEngine — orchestrates N trials per test
│   ├── trajectory.py       # TrajectoryRecorder — captures steps, tokens, cost
│   ├── otel.py             # OpenTelemetry span capture for any framework
│   └── adapters/
│       ├── base.py         # BaseAdapter protocol + FunctionAdapter
│       ├── langgraph.py    # LangGraph adapter (callbacks + trajectory)
│       ├── crewai.py       # CrewAI adapter
│       ├── autogen.py      # AutoGen adapter (v0.4+ and legacy)
│       ├── pydantic_ai.py  # Pydantic AI adapter
│       ├── openai_agents.py # OpenAI Agents SDK adapter
│       ├── smolagents.py   # Hugging Face smolagents adapter
│       └── pricing.py      # Model pricing for 40+ LLMs
├── evaluators/
│   ├── exact.py            # contains, regex, tool_called, exact_match
│   ├── expect.py           # Fluent assertion API
│   ├── functional.py       # Custom check functions, range checks
│   ├── llm_judge.py        # Calibrated LLM-as-Judge evaluator
│   ├── multi_agent.py      # Multi-agent evaluation
│   └── step_eval.py        # Per-step and trajectory evaluation
├── metrics/
│   ├── basic.py            # Pass rate, cost, latency, token efficiency
│   ├── statistical.py      # Wilson CI, bootstrap, Fisher, Mann-Whitney, BH
│   └── trajectory.py       # Failure attribution via divergence analysis
├── reporters/
│   ├── terminal.py         # Rich terminal output
│   ├── json_report.py      # JSON export, load, comparison
│   └── flamegraph.py       # Trajectory flame graphs (terminal + HTML)
├── security/
│   └── scanner.py          # MCP security scanner (5 vulnerability classes)
└── dashboard/
    ├── app.py              # FastAPI cloud dashboard
    ├── models.py           # Dashboard data models
    └── store.py            # Persistent storage backend

Supported Frameworks

Framework Status Notes
LangGraph Native adapter Full trajectory, callbacks, token tracking
CrewAI Native adapter Task-level trajectory, crew cost tracking
AutoGen Native adapter v0.4+ (autogen-agentchat) and legacy pyautogen
Pydantic AI Native adapter Tool calls, response parts, token usage
OpenAI Agents SDK Native adapter Runner integration, tool call capture
smolagents (HF) Native adapter Dict and object log formats
Any OTel-instrumented agent Supported Automatic span capture via OTel SDK
Custom Supported AgentInput -> AgentOutput protocol

Supported Models (cost tracking)

Provider Models
Anthropic Claude 3 Haiku/Sonnet/Opus, Claude 3.5, Claude 4
OpenAI GPT-4o, GPT-4o-mini, GPT-4 Turbo, GPT-3.5 Turbo
Google Gemini 1.5 Pro/Flash, Gemini 1.0 Pro
Mistral Large, Medium, Small

Contributing

git clone https://github.com/alepot55/agentrial.git
cd agentrial
pip install -e ".[dev]"
pytest
ruff check .
mypy agentrial/

See CONTRIBUTING.md for details.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.2.0a1.tar.gz (147.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentrial-0.2.0a1-py3-none-any.whl (118.2 kB view details)

Uploaded Python 3

File details

Details for the file agentrial-0.2.0a1.tar.gz.

File metadata

  • Download URL: agentrial-0.2.0a1.tar.gz
  • Upload date:
  • Size: 147.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentrial-0.2.0a1.tar.gz
Algorithm Hash digest
SHA256 89d3f40d4ad76386001928d3555fcf1b86741e518b1357e00860d23722f2a3d5
MD5 c47b0506493a9aaceffccaa164400ae6
BLAKE2b-256 841a8e7f3b7931ade24d707acd8117400edc2a2e3116ae7b151a57e164eb93cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentrial-0.2.0a1.tar.gz:

Publisher: release.yml on alepot55/agentrial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentrial-0.2.0a1-py3-none-any.whl.

File metadata

  • Download URL: agentrial-0.2.0a1-py3-none-any.whl
  • Upload date:
  • Size: 118.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentrial-0.2.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 6883e3ce0c81f6c013160fd4eb71f64af094ac8744ffd7cc49ddf3db07be971e
MD5 faf7ebd53b8ba14d6e58b20a0465f929
BLAKE2b-256 3b39a5da48468c41bd1bd684a0a5853ebe0375288d0244e26cbcd7ff8ac6a64c

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentrial-0.2.0a1-py3-none-any.whl:

Publisher: release.yml on alepot55/agentrial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page