Statistical evaluation framework for AI agents - pytest for agent trajectories

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alepot55

These details have not been verified by PyPI

Project description

agentrial

The pytest for AI agents. Statistical evaluation with confidence intervals and failure attribution.

Your agent passes Monday, fails Wednesday. Same prompt, same model. agentrial tells you why.

Quickstart

pip install agentrial
agentrial init
agentrial run

That's it. You'll see real results in seconds:

╭──────────────────────────────────────────────────────────────────────╮
│ sample-demo - PASSED                                                 │
╰───────────────────────────────────────────────────── Threshold: 80% ─╯
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test Case        ┃ Pass Rate ┃ 95% CI           ┃ Avg Cost ┃ Avg Latency ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ greeting         │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ capital-france   │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ capital-japan    │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ basic-math       │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
└──────────────────┴───────────┴──────────────────┴──────────┴─────────────┘

Overall Pass Rate: 100.0% (85.0%-100.0%)
Total Cost: $0.0000

Replace sample_agent.py with your own agent, update tests/test_sample.yml, and you're evaluating real agents.

What it does

Multi-trial execution — Run every test N times automatically. A single pass means nothing for non-deterministic agents.
Wilson confidence intervals — Statistically accurate pass rates, even with small samples and extreme proportions (0% or 100%).
Step-level failure attribution — Pinpoints which tool call diverges between passing and failing runs using Fisher exact test.
Real cost tracking — Actual API costs from model metadata, 40+ models supported across Anthropic, OpenAI, Google, Mistral.
Regression detection — Fisher exact test catches reliability drops between versions. Blocks PRs in CI when quality degrades.
Local-first — Your data never leaves your machine. No accounts, no SaaS, no telemetry.

Why this exists

Every agent framework ships with benchmarks showing 90%+ accuracy. But run those same agents 100 times on the same task, and you'll see pass rates drop to 60-80% with wide variance. The benchmarks measure one run; production sees thousands.

No existing tool gives you statistically rigorous, framework-agnostic agent testing that runs in CI/CD. LangSmith requires a paid account and locks you to LangChain. Promptfoo doesn't do multi-trial with confidence intervals. DeepEval and Arize don't do trajectory-level failure attribution. agentrial fills that gap: open-source, free, local-first, works with any agent framework.

How it compares

Feature	agentrial	Promptfoo	LangSmith	DeepEval	Arize
Multi-trial with CI	Free	No	$39/mo	No	No
Confidence intervals	Yes	No	No	No	No
Trajectory step analysis	Yes	No	Partial	No	Yes
Failure attribution	Yes	No	No	No	No
Framework-agnostic (OTel)	Yes	Yes	No	Yes	Yes
Free CI/CD integration	Yes	Yes	No	No	No
Local-first (no data leaves)	Yes	Yes	No	No	No
Cost-per-correct-answer	Yes	No	No	No	No

Writing Tests

Tests are YAML files. Define what your agent receives and what it should produce:

suite: my-agent-tests
agent: my_module.agent       # Python import path to your wrapped agent
trials: 10
threshold: 0.85              # Minimum pass rate

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

  - name: capital-lookup
    input:
      query: "What is the capital of Japan?"
    expected:
      output_contains: ["Tokyo"]

  - name: error-handling
    input:
      query: "Divide 10 by 0"
    expected:
      output_contains_any: ["undefined", "cannot", "error"]
    max_cost: 0.05
    max_latency_ms: 5000

All assertion types

expected:
  output_contains: ["word1", "word2"]        # AND — all must be present
  output_contains_any: ["option1", "option2"] # OR — at least one
  exact_match: "exact output string"
  regex: "\\d+ results found"
  tool_calls:
    - tool: search
      params_contain:
        query: "expected term"

# Per-step expectations
step_expectations:
  - step_index: 0
    tool_name: search
    params_contain:
      query: "search term"
    output_contains: ["result"]

Test discovery

agentrial auto-discovers test files:

agentrial run tests/          # Finds test_*.yml, test_*.yaml, test_*.py
agentrial run agentrial.yml   # Run a specific file

Wrapping Your Agent

agentrial needs a callable: AgentInput -> AgentOutput. Use an adapter for your framework.

LangGraph

from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
from agentrial.runner.adapters import wrap_langgraph_agent

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression."""
    return str(eval(expression))

llm = ChatAnthropic(model="claude-3-haiku-20240307", temperature=0)
graph = create_react_agent(llm, tools=[calculate])

# This is what your YAML points to
agent = wrap_langgraph_agent(graph)

The LangGraph adapter automatically captures full trajectory, token usage, real API cost, and execution duration.

Custom agents

Implement the protocol directly:

from agentrial.types import AgentInput, AgentOutput, AgentMetadata

def agent(input: AgentInput) -> AgentOutput:
    # Your agent logic
    return AgentOutput(
        output="result",
        steps=[],
        metadata=AgentMetadata(total_tokens=100, cost=0.001, duration_ms=500.0),
        success=True,
    )

Fluent Assertion API

For Python-defined tests:

from agentrial import expect

result = agent(AgentInput(query="Book a flight to Rome"))

e = expect(result).succeeded() \
    .tool_called("search_flights", params_contain={"destination": "FCO"}) \
    .cost_below(0.15) \
    .latency_below(5000)

# Output checks return OutputExpectation (separate chain)
e.output.contains("confirmed", "Rome")

# Step checks return StepExpectation (separate chain)
e.step(0).tool_name("search_flights").params_contain(destination="FCO")

assert e.passed()

Method	Description
`.succeeded()`	Agent completed without error
`.output.contains(*strings)`	Output contains all substrings
`.output.equals(string)`	Exact match
`.output.matches(regex)`	Regex match
`.tool_called(name, params_contain={})`	Tool was called with params
`.step(i).tool_name(name)`	Step i called named tool
`.step(i).params_contain(**kw)`	Step i had params matching kw
`.cost_below(max_usd)`	Cost under threshold
`.latency_below(max_ms)`	Latency under threshold
`.tokens_below(max_tokens)`	Tokens under threshold
`.trajectory_length(min, max)`	Step count within bounds
`.passed()`	Returns `True` if all pass
`.get_failures()`	Returns failure messages

CLI Reference

agentrial init                          # Create sample project (ready to run)
agentrial run                           # Run all tests in current directory
agentrial run tests/                    # Run tests in specific directory
agentrial run --trials 20 --threshold 0.9  # Override settings
agentrial run -o results.json           # Export JSON report
agentrial run --json                    # JSON to stdout
agentrial run --flamegraph              # Show trajectory flame graphs
agentrial run --html flamegraph.html    # Export flame graph as HTML
agentrial run --judge                   # Enable LLM-as-Judge evaluation
agentrial run --update-snapshots        # Save snapshot baseline
agentrial compare results.json -b baseline.json  # Regression detection
agentrial baseline results.json         # Save baseline
agentrial config                        # Show configuration
agentrial snapshot update               # Run and save snapshot
agentrial snapshot check                # Compare against snapshot
agentrial security scan --mcp-config c.json  # MCP security scan
agentrial pareto --models m1,m2,m3      # Pareto frontier analysis
agentrial prompt track prompt.txt       # Track prompt version
agentrial prompt diff v1 v2             # Diff prompt versions
agentrial prompt list                   # List prompt versions
agentrial monitor --baseline snap.json  # Configure drift monitoring
agentrial dashboard                     # Launch web dashboard

Flag	Short	Description	Default
`--config`	`-c`	Config file path	`agentrial.yml`
`--trials`	`-n`	Trials per test case	`10`
`--threshold`	`-t`	Min pass rate (0-1)	`0.85`
`--output`	`-o`	JSON output path	—
`--json`		JSON to stdout	`false`
`--flamegraph`		Show trajectory flame graphs	`false`
`--html`		Export flame graph HTML	—
`--judge`		Enable LLM-as-Judge	`false`
`--update-snapshots`		Save as snapshot baseline	`false`

CI/CD Integration

GitHub Actions

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial && pip install -e .
      - run: agentrial run --trials 10 --threshold 0.85 -o results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: agentrial-results
          path: results.json

Regression detection in CI

      - run: agentrial run -o results.json
      - run: agentrial compare results.json --baseline baseline.json

Fisher's exact test (p < 0.05) detects statistically significant regressions. Exit code 1 blocks the PR.

Statistical Methods

agentrial uses real statistical tests, not simple averages.

Method	What it does
Wilson score interval	Confidence intervals for pass rates — accurate at boundaries (0%, 100%) and small samples
Bootstrap resampling	CI for cost/latency — non-parametric, no normality assumption (500 iterations)
Fisher exact test	Regression detection — compares pass rates between two runs (p < 0.05)
Mann-Whitney U test	Compares cost/latency distributions between versions
Benjamini-Hochberg	Controls false discovery rate when comparing multiple metrics

Failure attribution

When tests fail, agentrial analyzes trajectory divergence:

Groups trials by pass/fail
At each step, compares distribution of tool calls
Fisher exact test identifies the step with significant divergence
Reports the divergent step with a recommendation

Architecture

agentrial/
├── cli.py                  # Click CLI — run, compare, baseline, config, init, etc.
├── config.py               # YAML config loading and test file discovery
├── types.py                # AgentInput, AgentOutput, TestCase, Suite, etc.
├── snapshots.py            # Statistical snapshot testing and comparison
├── pareto.py               # Cost-accuracy Pareto frontier analysis
├── prompts.py              # Prompt version control (track, diff, list)
├── monitor.py              # Production drift detection (CUSUM, Page-Hinkley, KS)
├── pytest_plugin.py        # @agent_test decorator for pytest integration
├── runner/
│   ├── engine.py           # MultiTrialEngine — orchestrates N trials per test
│   ├── trajectory.py       # TrajectoryRecorder — captures steps, tokens, cost
│   ├── otel.py             # OpenTelemetry span capture for any framework
│   └── adapters/
│       ├── base.py         # BaseAdapter protocol + FunctionAdapter
│       ├── langgraph.py    # LangGraph adapter (callbacks + trajectory)
│       ├── crewai.py       # CrewAI adapter
│       ├── autogen.py      # AutoGen adapter (v0.4+ and legacy)
│       ├── pydantic_ai.py  # Pydantic AI adapter
│       ├── openai_agents.py # OpenAI Agents SDK adapter
│       ├── smolagents.py   # Hugging Face smolagents adapter
│       └── pricing.py      # Model pricing for 40+ LLMs
├── evaluators/
│   ├── exact.py            # contains, regex, tool_called, exact_match
│   ├── expect.py           # Fluent assertion API
│   ├── functional.py       # Custom check functions, range checks
│   ├── llm_judge.py        # Calibrated LLM-as-Judge evaluator
│   ├── multi_agent.py      # Multi-agent evaluation
│   └── step_eval.py        # Per-step and trajectory evaluation
├── metrics/
│   ├── basic.py            # Pass rate, cost, latency, token efficiency
│   ├── statistical.py      # Wilson CI, bootstrap, Fisher, Mann-Whitney, BH
│   └── trajectory.py       # Failure attribution via divergence analysis
├── reporters/
│   ├── terminal.py         # Rich terminal output
│   ├── json_report.py      # JSON export, load, comparison
│   └── flamegraph.py       # Trajectory flame graphs (terminal + HTML)
├── security/
│   └── scanner.py          # MCP security scanner (5 vulnerability classes)
└── dashboard/
    ├── app.py              # FastAPI cloud dashboard
    ├── models.py           # Dashboard data models
    └── store.py            # Persistent storage backend

Supported Frameworks

Framework	Status	Notes
LangGraph	Native adapter	Full trajectory, callbacks, token tracking
CrewAI	Native adapter	Task-level trajectory, crew cost tracking
AutoGen	Native adapter	v0.4+ (autogen-agentchat) and legacy pyautogen
Pydantic AI	Native adapter	Tool calls, response parts, token usage
OpenAI Agents SDK	Native adapter	Runner integration, tool call capture
smolagents (HF)	Native adapter	Dict and object log formats
Any OTel-instrumented agent	Supported	Automatic span capture via OTel SDK
Custom	Supported	`AgentInput -> AgentOutput` protocol

Supported Models (cost tracking)

Provider	Models
Anthropic	Claude 3 Haiku/Sonnet/Opus, Claude 3.5, Claude 4
OpenAI	GPT-4o, GPT-4o-mini, GPT-4 Turbo, GPT-3.5 Turbo
Google	Gemini 1.5 Pro/Flash, Gemini 1.0 Pro
Mistral	Large, Medium, Small

Contributing

git clone https://github.com/alepot55/agentrial.git
cd agentrial
pip install -e ".[dev]"
pytest
ruff check .
mypy agentrial/

See CONTRIBUTING.md for details.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alepot55

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Feb 6, 2026

0.2.0a2 pre-release

Feb 6, 2026

This version

0.2.0a1 pre-release

Feb 6, 2026

0.1.4

Feb 5, 2026

0.1.3

Feb 5, 2026

0.1.2

Feb 5, 2026

0.1.1

Feb 5, 2026

0.1.0

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.2.0a1.tar.gz (147.5 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentrial-0.2.0a1-py3-none-any.whl (118.2 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file agentrial-0.2.0a1.tar.gz.

File metadata

Download URL: agentrial-0.2.0a1.tar.gz
Upload date: Feb 6, 2026
Size: 147.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentrial-0.2.0a1.tar.gz
Algorithm	Hash digest
SHA256	`89d3f40d4ad76386001928d3555fcf1b86741e518b1357e00860d23722f2a3d5`
MD5	`c47b0506493a9aaceffccaa164400ae6`
BLAKE2b-256	`841a8e7f3b7931ade24d707acd8117400edc2a2e3116ae7b151a57e164eb93cd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentrial-0.2.0a1.tar.gz:

Publisher: release.yml on alepot55/agentrial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentrial-0.2.0a1.tar.gz
- Subject digest: 89d3f40d4ad76386001928d3555fcf1b86741e518b1357e00860d23722f2a3d5
- Sigstore transparency entry: 923981060
- Sigstore integration time: Feb 6, 2026
Source repository:
- Permalink: alepot55/agentrial@80dc8edd9b2269eeceea38ebd07a0502503e9662
- Branch / Tag: refs/tags/v0.2.0-alpha.1
- Owner: https://github.com/alepot55
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@80dc8edd9b2269eeceea38ebd07a0502503e9662
- Trigger Event: release

File details

Details for the file agentrial-0.2.0a1-py3-none-any.whl.

File metadata

Download URL: agentrial-0.2.0a1-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 118.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentrial-0.2.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6883e3ce0c81f6c013160fd4eb71f64af094ac8744ffd7cc49ddf3db07be971e`
MD5	`faf7ebd53b8ba14d6e58b20a0465f929`
BLAKE2b-256	`3b39a5da48468c41bd1bd684a0a5853ebe0375288d0244e26cbcd7ff8ac6a64c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentrial-0.2.0a1-py3-none-any.whl:

Publisher: release.yml on alepot55/agentrial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentrial-0.2.0a1-py3-none-any.whl
- Subject digest: 6883e3ce0c81f6c013160fd4eb71f64af094ac8744ffd7cc49ddf3db07be971e
- Sigstore transparency entry: 923981061
- Sigstore integration time: Feb 6, 2026
Source repository:
- Permalink: alepot55/agentrial@80dc8edd9b2269eeceea38ebd07a0502503e9662
- Branch / Tag: refs/tags/v0.2.0-alpha.1
- Owner: https://github.com/alepot55
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@80dc8edd9b2269eeceea38ebd07a0502503e9662
- Trigger Event: release

agentrial 0.2.0a1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agentrial

Quickstart

What it does

Why this exists

How it compares

Writing Tests

All assertion types

Test discovery

Wrapping Your Agent

LangGraph

Custom agents

Fluent Assertion API

CLI Reference

CI/CD Integration

GitHub Actions

Regression detection in CI

Statistical Methods

Failure attribution

Architecture

Supported Frameworks

Supported Models (cost tracking)

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance