Statistical evaluation framework for AI agents - pytest for agent trajectories

These details have not been verified by PyPI

Project links

Project description

agentrial

Statistical evaluation framework for AI agents

Your agent passes Monday, fails Wednesday. agentrial tells you why.

AI agents are non-deterministic. A single test run tells you nothing. agentrial runs your agent N times, computes confidence intervals on pass rates, tracks real API costs, and pinpoints which step in the trajectory causes failures — so you ship agents that work reliably, not just once.

╭─────────────────────────────────────────────────────────────────────────────╮
│ my-agent-tests - PASSED                                                     │
╰───────────────────────────────────────────────────────── Threshold: 85.0% ─╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test Case              ┃ Pass Rate┃ 95% CI          ┃ Avg Cost ┃ Avg Latency┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━┩
│ easy-multiply          │   100.0% │ (72.2%-100.0%)  │  $0.0005 │      1.59s │
│ medium-inference       │   100.0% │ (72.2%-100.0%)  │  $0.0006 │      2.61s │
│ hard-multi-step        │   100.0% │ (72.2%-100.0%)  │  $0.0010 │      3.52s │
└────────────────────────┴──────────┴─────────────────┴──────────┴────────────┘

Overall Pass Rate: 100.0% (72.2%-100.0%)
Total Cost: $0.06

Why agentrial?
Quick Start
Writing Tests
Wrapping Your Agent
CLI Reference
Fluent Assertion API
CI/CD Integration
Statistical Methods
Architecture
Real-World Results
Supported Models
Contributing
License

Why agentrial?

Testing AI agents is fundamentally different from testing deterministic software. The same input can produce different tool calls, different reasoning paths, and different outputs. A single "it passed" run is meaningless.

agentrial solves this with:

Feature	What it does
Multi-trial execution	Run every test N times automatically
Wilson confidence intervals	Statistically accurate pass rates, even with small sample sizes
Step-level failure attribution	Identifies which tool call diverges between passed and failed runs
Real cost tracking	Computes actual API costs from model metadata (supports 40+ models)
Regression detection	Fisher exact test catches reliability drops between versions
CI/CD integration	GitHub Action that blocks PRs when agent quality degrades

Quick Start

Install

pip install agentrial

For LangGraph support:

pip install agentrial[langgraph]

Create a test file

Create agentrial.yml in your project root:

suite: my-agent-tests
agent: my_module.agent       # Python import path to your wrapped agent
trials: 10
threshold: 0.85              # Minimum pass rate to consider the suite passing

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

Run

agentrial run

Or initialize a sample project:

agentrial init

Writing Tests

Tests are defined in YAML files (agentrial.yml, or any test_*.yml / test_*.yaml file).

Suite-level configuration

suite: my-suite              # Suite name
agent: my_module.agent       # Python import path to wrapped agent callable
trials: 10                   # Number of trials per test case
threshold: 0.85              # Minimum overall pass rate (0.0 - 1.0)

Test case options

Each test case supports a range of assertion types:

cases:
  - name: test-name
    input:
      query: "User question"
      context:                   # Optional context dict passed to agent
        user_id: "123"

    expected:
      # String matching (AND logic — all must be present)
      output_contains: ["expected", "words"]

      # String matching (OR logic — at least one must be present)
      output_contains_any: ["option1", "option2"]

      # Regex pattern
      regex: "\\d+ results found"

      # Tool call assertions
      tool_calls:
        - tool: search
          params_contain:
            query: "expected search term"

    # Cost and latency constraints (per trial)
    max_cost: 0.05
    max_latency: 5000          # milliseconds

    # Step-level expectations
    step_expectations:
      - step_index: 0
        tool_name: search
        params_contain:
          query: "search term"
        output_contains: ["result"]

Multiple test files

agentrial auto-discovers test files in the given directory:

agentrial run tests/          # Discovers test_*.yml, test_*.yaml, test_*.py
agentrial run agentrial.yml   # Run a specific file

Wrapping Your Agent

agentrial needs a callable that takes AgentInput and returns AgentOutput. Use an adapter to wrap your framework's agent.

LangGraph

from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
from agentrial.runner.adapters import wrap_langgraph_agent

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression."""
    return str(eval(expression))

llm = ChatAnthropic(model="claude-3-haiku-20240307", temperature=0)
graph = create_react_agent(llm, tools=[calculate])

# Export this — it's what agentrial.yml points to
agent = wrap_langgraph_agent(graph)

Then reference it in your YAML:

agent: my_module.agent

The LangGraph adapter automatically captures:

Full trajectory (every tool call and LLM response)
Token usage per step
Real API cost from model pricing data
Execution duration

Custom agents

Implement the agent protocol directly:

from agentrial.types import AgentInput, AgentOutput, AgentMetadata

def agent(input: AgentInput) -> AgentOutput:
    # Your agent logic here
    return AgentOutput(
        output="result",
        steps=[],
        metadata=AgentMetadata(
            total_tokens=100,
            cost=0.001,
            duration_ms=500.0,
        ),
        success=True,
    )

CLI Reference

# Run all tests in current directory
agentrial run

# Run specific file or directory
agentrial run tests/
agentrial run agentrial.yml

# Override trial count and threshold
agentrial run --trials 20 --threshold 0.9

# Export results to JSON
agentrial run -o results.json

# Output JSON to stdout
agentrial run --json

# Compare current results against a baseline
agentrial compare results.json --baseline baseline.json

# Save a baseline
agentrial baseline results.json -o baseline.json

# Show current configuration
agentrial config

# Initialize sample project
agentrial init

Options

Flag	Short	Description	Default
`--config`	`-c`	Path to config file	`agentrial.yml`
`--trials`	`-n`	Number of trials per test case	`10`
`--threshold`	`-t`	Minimum pass rate (0.0-1.0)	`0.85`
`--output`	`-o`	JSON output file path	—
`--json`		Output JSON to stdout	`false`

Fluent Assertion API

For Python-defined tests, agentrial provides a chainable assertion builder:

from agentrial import expect

result = agent(AgentInput(query="Book a flight to Rome"))

# Chain assertions fluently
expect(result) \
    .succeeded() \
    .output.contains("confirmed", "Rome") \
    .tool_called("search_flights") \
    .step(0).params_contain(destination="FCO") \
    .cost_below(0.15) \
    .latency_below(5000) \
    .tokens_below(3000) \
    .trajectory_length(min_steps=2, max_steps=10)

Available assertions

Method	Description
`.succeeded()`	Agent execution completed without error
`.output.contains(*strings)`	Output contains all substrings (AND)
`.output.equals(string)`	Output exactly matches string
`.output.matches(regex)`	Output matches regex pattern
`.output.length_between(min, max)`	Output length within bounds
`.tool_called(name, params_contain={})`	Specific tool was called with params
`.step(i).tool_name(name)`	Step at index is a call to named tool
`.step(i).params_contain(**kv)`	Step parameters contain expected values
`.step(i).output_contains(*strings)`	Step output contains substrings
`.cost_below(max_usd)`	Total cost under threshold
`.latency_below(max_ms)`	Total latency under threshold
`.tokens_below(max_tokens)`	Total tokens under threshold
`.trajectory_length(min, max)`	Number of steps within bounds
`.passed()`	Returns `True` if all assertions passed
`.get_failures()`	Returns list of failure messages

CI/CD Integration

GitHub Actions (simple)

Add to .github/workflows/agent-eval.yml:

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial[langgraph] langchain-anthropic
      - run: pip install -e .
      - run: agentrial run --trials 10 --threshold 0.85 -o results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: agentrial-results
          path: results.json

Regression detection in CI

Run against a saved baseline to catch reliability drops:

      - run: agentrial run -o results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - run: agentrial compare results.json --baseline baseline.json

The compare command uses Fisher's exact test (p < 0.05) to detect statistically significant regressions. Exit code 1 if a regression is found.

Statistical Methods

agentrial uses real statistical tests, not simple averages.

Pass rate confidence intervals

Wilson score interval — accurate for small sample sizes and extreme proportions (0% or 100%), unlike normal approximation which fails at boundaries.

10 trials, 9 passes → 90.0% (59.6% - 98.2%)
10 trials, 10 passes → 100.0% (72.2% - 100.0%)

Cost and latency confidence intervals

Bootstrap resampling (500 iterations) — non-parametric, no normality assumption required. Reports mean with 95% CI for cost and latency metrics.

Regression detection

Test	Use case
Fisher exact test	Compare pass rates between two runs (p < 0.05)
Mann-Whitney U test	Compare cost/latency distributions
Benjamini-Hochberg correction	Controls false discovery rate when comparing multiple metrics

Step-level failure attribution

When tests fail, agentrial analyzes trajectory divergence between passing and failing trials:

Groups trials by pass/fail
At each step, compares the distribution of tool calls
Uses Fisher exact test to identify the step with statistically significant divergence
Reports the divergent step with a human-readable recommendation

Architecture

agentrial/
├── cli.py                  # Click CLI — run, compare, baseline, config, init
├── config.py               # YAML config loading and test file discovery
├── types.py                # Dataclasses: AgentInput, AgentOutput, TestCase, etc.
├── runner/
│   ├── engine.py           # MultiTrialEngine — orchestrates N trials per test
│   ├── trajectory.py       # TrajectoryRecorder — captures steps, tokens, cost
│   ├── otel.py             # OpenTelemetry span capture for any framework
│   └── adapters/
│       ├── base.py         # BaseAdapter protocol
│       ├── langgraph.py    # LangGraph adapter with callback-based capture
│       └── pricing.py      # Model pricing for 40+ LLMs
├── evaluators/
│   ├── exact.py            # contains, regex, tool_called, exact_match
│   ├── expect.py           # Fluent assertion API
│   ├── functional.py       # Custom check functions, range checks
│   └── step_eval.py        # Per-step and trajectory evaluation
├── metrics/
│   ├── basic.py            # Pass rate, cost, latency, token efficiency
│   ├── statistical.py      # Wilson CI, bootstrap, Fisher, Mann-Whitney, BH
│   └── trajectory.py       # Failure attribution via divergence analysis
└── reporters/
    ├── terminal.py         # Rich terminal output with colored tables
    └── json_report.py      # JSON export, load, and comparison

Real-World Results

Tested with Claude 3 Haiku on a 3-tool agent (calculator, country lookup, temperature conversion) — 100 trials:

Test Complexity	Pass Rate	95% CI	Avg Cost	Avg Latency	Avg Tokens
Easy (direct tool call)	100%	72.2% - 100%	$0.0005	1.6s	1,513
Medium (inference + tool)	100%	72.2% - 100%	$0.0006	2.6s	1,926
Hard (multi-step reasoning)	100%	72.2% - 100%	$0.0010	3.5s	2,986

100 trials total. $0.06 total cost. Full trajectory capture.

See the complete example in examples/langgraph_haiku/.

Supported Models

agentrial has built-in pricing data for cost tracking across major providers:

Provider	Models
Anthropic	Claude 3 Haiku, Sonnet, Opus (all versions), Claude 3.5, Claude 4
OpenAI	GPT-4o, GPT-4o-mini, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo
Google	Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 1.0 Pro
Mistral	Large, Medium, Small

Cost is extracted automatically from model response metadata. No configuration needed.

Supported Frameworks

Framework	Status	Notes
LangGraph	Native adapter	Full trajectory capture, callbacks, token tracking
Any OpenTelemetry-instrumented agent	Supported	Automatic span capture via OTel SDK
Custom	Supported	Implement `AgentInput -> AgentOutput` protocol

Contributing

# Clone and install
git clone https://github.com/alepot55/agentrial.git
cd agentrial
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check .

# Type check
mypy agentrial/

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Feb 6, 2026

0.2.0a2 pre-release

Feb 6, 2026

0.2.0a1 pre-release

Feb 6, 2026

This version

0.1.4

Feb 5, 2026

0.1.3

Feb 5, 2026

0.1.2

Feb 5, 2026

0.1.1

Feb 5, 2026

0.1.0

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.1.4.tar.gz (55.1 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentrial-0.1.4-py3-none-any.whl (54.4 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file agentrial-0.1.4.tar.gz.

File metadata

Download URL: agentrial-0.1.4.tar.gz
Upload date: Feb 5, 2026
Size: 55.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`cd6f1babac751c9a16ca003ffb69477c1f6bac0346d5b98b09ffc91876d0b9ff`
MD5	`29ecc822a7f5f0d026461be8177153a6`
BLAKE2b-256	`e13cf6003880ff2626bf6f1895c47c1f38bf567cb57ac8d04e48d9547584fed7`

See more details on using hashes here.

File details

Details for the file agentrial-0.1.4-py3-none-any.whl.

File metadata

Download URL: agentrial-0.1.4-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 54.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a39dd990c25ab915c15ad0f74f19597614359a3510599d702f529cfd2e5f36cb`
MD5	`f3d118f45ba23d5155bfbde51d6462c4`
BLAKE2b-256	`32b100f7be507ea3ad8c5635c33ac4154446c9e6c927c9e3a72427940749eaa5`

See more details on using hashes here.

agentrial 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agentrial

Table of Contents

Why agentrial?

Quick Start

Install

Create a test file

Run

Writing Tests

Suite-level configuration

Test case options

Multiple test files

Wrapping Your Agent

LangGraph

Custom agents

CLI Reference

Options

Fluent Assertion API

Available assertions

CI/CD Integration

GitHub Actions (simple)

Regression detection in CI

Statistical Methods

Pass rate confidence intervals

Cost and latency confidence intervals

Regression detection

Step-level failure attribution

Architecture

Real-World Results

Supported Models

Supported Frameworks

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes