Statistical evaluation framework for AI agents - pytest for agent trajectories

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alepot55

These details have not been verified by PyPI

Project description

agentrial

The pytest for AI agents. Run your agent 100 times, get confidence intervals instead of anecdotes.

Your agent passes Monday, fails Wednesday. Same prompt, same model. agentrial tells you why.

Quickstart

pip install agentrial
agentrial init
agentrial run

That's it. You'll see real results in seconds:

╭──────────────────────────────────────────────────────────────────────╮
│ sample-demo - PASSED                                                 │
╰───────────────────────────────────────────────────────── Threshold: 80% ─╯
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test Case        ┃ Pass Rate ┃ 95% CI           ┃ Avg Cost ┃ Avg Latency ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ greeting         │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ capital-france   │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ capital-japan    │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
│ basic-math       │    100.0% │ (56.6%-100.0%)   │  $0.0000 │         0ms │
└──────────────────┴───────────┴──────────────────┴──────────┴─────────────┘

Overall Pass Rate: 100.0% (85.0%-100.0%)
Total Cost: $0.0000

Replace sample_agent.py with your own agent, update tests/test_sample.yml, and you're evaluating real agents.

Why this exists

Every agent framework ships with benchmarks showing 90%+ accuracy. But run those same agents 100 times on the same task, and you'll see pass rates drop to 60-80% with wide variance. The benchmarks measure one run; production sees thousands.

No existing tool gives you statistically rigorous, framework-agnostic agent testing that runs in CI/CD. LangSmith requires a paid account and locks you to LangChain. Promptfoo doesn't do multi-trial with confidence intervals. DeepEval and Arize don't do trajectory-level failure attribution. agentrial fills that gap: open-source, free, local-first, works with any agent framework.

What it does

Multi-trial execution — Run every test N times automatically. A single pass means nothing for non-deterministic agents.
Wilson confidence intervals — Statistically accurate pass rates, even with small samples and extreme proportions (0% or 100%).
Step-level failure attribution — Pinpoints which tool call diverges between passing and failing runs using Fisher exact test.
Real cost tracking — Actual API costs from model metadata, 45+ models supported across Anthropic, OpenAI, Google, Mistral, Meta, DeepSeek.
Regression detection — Fisher exact test catches reliability drops between versions. Blocks PRs in CI when quality degrades.
Local-first — Your data never leaves your machine. No accounts, no SaaS, no telemetry.
Agent Reliability Score — A single 0-100 composite metric that combines accuracy, consistency, cost efficiency, latency, trajectory quality, and failure recovery. Weighted scoring with transparent breakdown — one number to track across releases.
Production monitoring — Deploy agentrial monitor as a cron job or sidecar. CUSUM and Page-Hinkley detectors catch drift in pass rate, cost, and latency. Kolmogorov-Smirnov test detects distribution shifts. Alerts before users notice.

Writing Tests

Tests are YAML files. Define what your agent receives and what it should produce:

suite: my-agent-tests
agent: my_module.agent       # Python import path to your wrapped agent
trials: 10
threshold: 0.85              # Minimum pass rate

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

  - name: capital-lookup
    input:
      query: "What is the capital of Japan?"
    expected:
      output_contains: ["Tokyo"]

  - name: error-handling
    input:
      query: "Divide 10 by 0"
    expected:
      output_contains_any: ["undefined", "cannot", "error"]
    max_cost: 0.05
    max_latency_ms: 5000

All assertion types

expected:
  output_contains: ["word1", "word2"]        # AND — all must be present
  output_contains_any: ["option1", "option2"] # OR — at least one
  exact_match: "exact output string"
  regex: "\\d+ results found"
  tool_calls:
    - tool: search
      params_contain:
        query: "expected term"

# Per-step expectations
step_expectations:
  - step_index: 0
    tool_name: search
    params_contain:
      query: "search term"
    output_contains: ["result"]

Test discovery

agentrial auto-discovers test files:

agentrial run tests/          # Finds test_*.yml, test_*.yaml
agentrial run agentrial.yml   # Run a specific file

Wrapping Your Agent

agentrial needs a callable: AgentInput -> AgentOutput. Use an adapter for your framework.

LangGraph

from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
from agentrial.runner.adapters import wrap_langgraph_agent

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression."""
    return str(eval(expression))

llm = ChatAnthropic(model="claude-3-haiku-20240307", temperature=0)
graph = create_react_agent(llm, tools=[calculate])

# This is what your YAML points to
agent = wrap_langgraph_agent(graph)

The LangGraph adapter automatically captures full trajectory, token usage, real API cost, and execution duration.

Custom agents

Implement the protocol directly:

from agentrial.types import AgentInput, AgentOutput, AgentMetadata

def agent(input: AgentInput) -> AgentOutput:
    # Your agent logic
    return AgentOutput(
        output="result",
        steps=[],
        metadata=AgentMetadata(total_tokens=100, cost=0.001, duration_ms=500.0),
        success=True,
    )

Fluent Assertion API

For Python-defined tests:

from agentrial import expect

result = agent(AgentInput(query="Book a flight to Rome"))

e = expect(result).succeeded() \
    .tool_called("search_flights", params_contain={"destination": "FCO"}) \
    .cost_below(0.15) \
    .latency_below(5000)

# Output checks return OutputExpectation (separate chain)
e.output.contains("confirmed", "Rome")

# Step checks return StepExpectation (separate chain)
e.step(0).tool_name("search_flights").params_contain(destination="FCO")

assert e.passed()

Method	Description
`.succeeded()`	Agent completed without error
`.output.contains(*strings)`	Output contains all substrings
`.output.equals(string)`	Exact match
`.output.matches(regex)`	Regex match
`.tool_called(name, params_contain={})`	Tool was called with params
`.step(i).tool_name(name)`	Step i called named tool
`.step(i).params_contain(**kw)`	Step i had params matching kw
`.cost_below(max_usd)`	Cost under threshold
`.latency_below(max_ms)`	Latency under threshold
`.tokens_below(max_tokens)`	Tokens under threshold
`.trajectory_length(min, max)`	Step count within bounds
`.passed()`	Returns `True` if all pass
`.get_failures()`	Returns failure messages

CI/CD Integration

GitHub Actions

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial && pip install -e .
      - run: agentrial run --trials 10 --threshold 0.85 -o results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: agentrial-results
          path: results.json

Regression detection in CI

      - run: agentrial run -o results.json
      - run: agentrial compare results.json --baseline baseline.json

Fisher's exact test (p < 0.05) detects statistically significant regressions. Exit code 1 blocks the PR.

Advanced features

Trajectory flame graphs

Visualize agent execution paths across trials. Identify where passing and failing runs diverge.

agentrial run --flamegraph          # Terminal visualization
agentrial run --flamegraph --html flamegraph.html  # Interactive HTML export

LLM-as-Judge

Use a second LLM to evaluate response quality with calibrated scoring.

agentrial run --judge               # Add judge evaluation

Implements Krippendorff's alpha for inter-rater reliability and t-distribution CI for score estimates. Calibration protocol ensures judge consistency before scoring.

Snapshot testing

Capture baseline behavior and detect regressions automatically.

agentrial snapshot update           # Save current behavior as baseline
agentrial snapshot check            # Compare against baseline

Uses Fisher exact test on pass rates and Mann-Whitney U on cost/latency, with Benjamini-Hochberg correction across all comparisons.

MCP security scanner

Audit MCP server configurations for 6 vulnerability classes: prompt injection, tool shadowing, data exfiltration, permission escalation, rug pull, and configuration weakness.

agentrial security scan --mcp-config servers.json

Multi-agent evaluation

Evaluate multi-agent systems with delegation accuracy, handoff fidelity, redundancy rate, and cascade failure metrics.

Pareto frontier analysis

Find the optimal cost-accuracy trade-off across models.

agentrial pareto --models claude-3-haiku,gpt-4o-mini,gemini-flash

Prompt version control

Track, diff, and manage prompt versions with statistical comparison between versions.

agentrial prompt track prompts/v2.txt
agentrial prompt diff v1 v2
agentrial prompt list

Agent Reliability Score

A composite 0-100 metric combining 6 weighted components: accuracy (40%), consistency (20%), cost efficiency (10%), latency (10%), trajectory quality (10%), and recovery (10%).

agentrial ars results.json
agentrial ars results.json --cost-ceiling 0.5

Benchmark registry

Publish evaluation results as verifiable, shareable benchmark files with SHA-256 integrity checksums.

agentrial publish results.json --agent-name my-agent --agent-version 1.0.0
agentrial verify --agent-name my-agent --agent-version 1.0.0 --suite-name my-suite

Eval packs

Domain-specific evaluation packages distributed as Python packages via entry points. Install a pack, get specialized test suites and evaluators.

agentrial packs list               # Show installed packs

Dashboard

Local FastAPI dashboard for browsing results, comparing runs, and tracking trends.

agentrial dashboard                # Start at http://localhost:8080

VS Code extension

Browse test suites, run evaluations, view flame graphs, and compare snapshots — all from your editor.

Install from the VS Code Marketplace or search "agentrial" in VS Code extensions.

Features:

Suite explorer sidebar with test case tree
Run suites and individual test cases with one click
Interactive trajectory flame graph visualization
Snapshot comparison for regression detection
MCP security scan integration
Auto-refresh on YAML file changes

Statistical Methods

agentrial uses real statistical tests, not simple averages.

Method	What it does
Wilson score interval	Confidence intervals for pass rates — accurate at boundaries (0%, 100%) and small samples
Bootstrap resampling	CI for cost/latency — non-parametric, no normality assumption (500 iterations)
Fisher exact test	Regression detection — compares pass rates between two runs (p < 0.05)
Mann-Whitney U test	Compares cost/latency distributions between versions
Benjamini-Hochberg	Controls false discovery rate when comparing multiple metrics
CUSUM / Page-Hinkley	Sequential change-point detection for production monitoring
Kolmogorov-Smirnov	Distribution shift detection for cost and latency
Krippendorff's alpha	Inter-rater reliability for LLM-as-Judge with t-distribution CI

Failure attribution

When tests fail, agentrial analyzes trajectory divergence:

Groups trials by pass/fail
At each step, compares distribution of tool calls
Fisher exact test identifies the step with significant divergence
Reports the divergent step with a recommendation

CLI Reference

agentrial init                              # Scaffold sample project
agentrial run                               # Run all tests
agentrial run tests/ --trials 20            # Custom trials
agentrial run -o results.json               # JSON export
agentrial run --flamegraph                  # Trajectory flame graphs
agentrial run --judge                       # LLM-as-Judge evaluation
agentrial compare results.json -b base.json # Regression detection
agentrial baseline results.json             # Save baseline
agentrial snapshot update / check           # Snapshot testing
agentrial security scan --mcp-config c.json # MCP security scan
agentrial pareto --models m1,m2,m3          # Cost-accuracy Pareto frontier
agentrial prompt track/diff/list            # Prompt version control
agentrial monitor --baseline snap.json      # Production drift detection
agentrial ars results.json                  # Agent Reliability Score
agentrial publish results.json --agent-name me --agent-version 1.0  # Publish benchmark
agentrial verify --agent-name me --agent-version 1.0 --suite-name s # Verify integrity
agentrial packs list                        # List installed eval packs
agentrial dashboard                         # Start local dashboard
agentrial config                            # Show configuration

Flag	Short	Description	Default
`--config`	`-c`	Config file path	`agentrial.yml`
`--trials`	`-n`	Trials per test case	`10`
`--threshold`	`-t`	Min pass rate (0-1)	`0.85`
`--output`	`-o`	JSON output path	—
`--json`		JSON to stdout	`false`
`--flamegraph`		Show trajectory flame graphs	`false`
`--html`		Export flame graph HTML	—
`--judge`		Enable LLM-as-Judge	`false`
`--update-snapshots`		Save as snapshot baseline	`false`

How it compares

	agentrial	Promptfoo	LangSmith	DeepEval	Arize Phoenix
Multi-trial with CI	Free	—	$39/mo	—	—
Confidence intervals	Wilson CI	—	—	—	—
Step-level failure attribution	Fisher exact	—	—	—	Partial
Framework-agnostic	6 adapters + OTel	Yes	LangChain only	Yes	Yes
Cost-per-correct-answer	Yes	—	—	—	—
LLM-as-Judge with calibration	Krippendorff α	—	Yes	Yes	—
Composite reliability score	ARS (0-100)	—	—	—	—
MCP security scanning	6 vuln classes	—	—	—	—
Production drift detection	CUSUM + PH + KS	—	—	—	Partial
VS Code extension	Yes	—	—	—	—
Local-first	Yes	Yes	No	No	Self-host option

Supported Frameworks

Framework	Status	Notes
LangGraph	Native adapter	Full trajectory, callbacks, token tracking
CrewAI	Native adapter	Task-level trajectory, crew cost tracking
AutoGen	Native adapter	v0.4+ (autogen-agentchat) and legacy pyautogen
Pydantic AI	Native adapter	Tool calls, response parts, token usage
OpenAI Agents SDK	Native adapter	Runner integration, tool call capture
smolagents (HF)	Native adapter	Dict and object log formats
Any OTel-instrumented agent	Supported	Automatic span capture via OTel SDK
Custom	Supported	`AgentInput -> AgentOutput` protocol

Supported Models (cost tracking)

Provider	Models
Anthropic	Claude 3 Haiku/Sonnet/Opus, Claude 3.5, Claude Sonnet 4.5, Claude Opus 4
OpenAI	GPT-4o, GPT-4o-mini, GPT-4 Turbo, GPT-3.5 Turbo, o1, o3-mini
Google	Gemini 2.0 Flash, Gemini 1.5 Pro/Flash, Gemini 1.0 Pro
Mistral	Large, Medium, Small, Codestral, Pixtral
Meta	Llama 3.3 70B, Llama 3.1 405B/70B
DeepSeek	DeepSeek Chat, DeepSeek Reasoner

Contributing

git clone https://github.com/alepot55/agentrial.git
cd agentrial
pip install -e ".[dev]"
pytest                    # 438 tests
ruff check .
mypy agentrial/

See CONTRIBUTING.md for details.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alepot55

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Feb 6, 2026

0.2.0a2 pre-release

Feb 6, 2026

0.2.0a1 pre-release

Feb 6, 2026

0.1.4

Feb 5, 2026

0.1.3

Feb 5, 2026

0.1.2

Feb 5, 2026

0.1.1

Feb 5, 2026

0.1.0

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.2.0.tar.gz (159.3 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentrial-0.2.0-py3-none-any.whl (126.4 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file agentrial-0.2.0.tar.gz.

File metadata

Download URL: agentrial-0.2.0.tar.gz
Upload date: Feb 6, 2026
Size: 159.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentrial-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`82adc18395b18a2aee5bc4195e9493864f2f4b41dc5f90f6e875b332b1e50bb4`
MD5	`a66204a7a1107363c4fd4b1cd5328c9a`
BLAKE2b-256	`7a61cc721be036f39f723070ea625df5aeb5cf21d9144f7bf56723e3fc0b7ba7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentrial-0.2.0.tar.gz:

Publisher: release.yml on alepot55/agentrial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentrial-0.2.0.tar.gz
- Subject digest: 82adc18395b18a2aee5bc4195e9493864f2f4b41dc5f90f6e875b332b1e50bb4
- Sigstore transparency entry: 924360944
- Sigstore integration time: Feb 6, 2026
Source repository:
- Permalink: alepot55/agentrial@cfe1bb862538e25fd975070a93ecc33469cda7e3
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/alepot55
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cfe1bb862538e25fd975070a93ecc33469cda7e3
- Trigger Event: release

File details

Details for the file agentrial-0.2.0-py3-none-any.whl.

File metadata

Download URL: agentrial-0.2.0-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 126.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentrial-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cd799f9bb7375a19f962d56e77926cc42a1842db93a47b29d767010c165b2f6`
MD5	`f1b8cd0dcad4b5faccf3c4bda4a1eebe`
BLAKE2b-256	`a504bde52feab69c85bb16c1c716f889c36527d6d1d2ef95e512d7d1086179ee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentrial-0.2.0-py3-none-any.whl:

Publisher: release.yml on alepot55/agentrial

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentrial-0.2.0-py3-none-any.whl
- Subject digest: 0cd799f9bb7375a19f962d56e77926cc42a1842db93a47b29d767010c165b2f6
- Sigstore transparency entry: 924360947
- Sigstore integration time: Feb 6, 2026
Source repository:
- Permalink: alepot55/agentrial@cfe1bb862538e25fd975070a93ecc33469cda7e3
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/alepot55
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cfe1bb862538e25fd975070a93ecc33469cda7e3
- Trigger Event: release

agentrial 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agentrial

Quickstart

Why this exists

What it does

Writing Tests

All assertion types

Test discovery

Wrapping Your Agent

LangGraph

Custom agents

Fluent Assertion API

CI/CD Integration

GitHub Actions

Regression detection in CI

Advanced features

Trajectory flame graphs

LLM-as-Judge

Snapshot testing

MCP security scanner

Multi-agent evaluation

Pareto frontier analysis

Prompt version control

Agent Reliability Score

Benchmark registry

Eval packs

Dashboard

VS Code extension

Statistical Methods

Failure attribution

CLI Reference

How it compares

Supported Frameworks

Supported Models (cost tracking)

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance