Statistical evaluation framework for AI agents - pytest for agent trajectories

These details have not been verified by PyPI

Project links

Project description

agenteval

Your agent passes Monday, fails Wednesday. agenteval tells you why.

Statistical evaluation framework for AI agents. Like pytest, but for non-deterministic systems.

Why agenteval?

Statistical rigor: Every test runs N trials with Wilson confidence intervals — not "it passed once"
Trajectory analysis: Step-by-step failure attribution identifies exactly which tool call diverged
Cost tracking: Real API costs from model metadata, not estimates
CI/CD ready: GitHub Action that blocks PRs when reliability drops

Quick Start

pip install agentrial

Create a test file agenteval.yml:

suite: my-agent-tests
agent: my_agent.agent  # Python import path
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

Run:

agenteval run

Output:

╭──────────────────────────────────────────────────────────────────────────────╮
│ my-agent-tests - PASSED                                                      │
╰────────────────────────────────────────────────────── Threshold: 85.0% ──────╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test Case              ┃ Pass Rate┃ 95% CI          ┃ Avg Cost ┃ Avg Latency┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━┩
│ basic-math             │   100.0% │ (72.2%-100.0%)  │  $0.0005 │      1.59s │
└────────────────────────┴──────────┴─────────────────┴──────────┴────────────┘

Overall Pass Rate: 100.0% (72.2%-100.0%)
Total Cost: $0.005

Real-World Results

Tested with Claude 3 Haiku on a 3-tool agent (calculator, country lookup, temperature conversion):

Test Complexity	Pass Rate	95% CI	Avg Cost	Avg Latency	Avg Tokens
Easy (direct tool call)	100%	72.2%-100%	$0.0005	1.6s	1,513
Medium (inference + tool)	100%	72.2%-100%	$0.0006	2.6s	1,926
Hard (multi-step reasoning)	100%	72.2%-100%	$0.0010	3.5s	2,986

100 trials total, $0.06 total cost, full trajectory capture.

Test Case Options

cases:
  - name: my-test
    input:
      query: "User question"
      context: {}  # Optional context dict
    expected:
      # All strings must be present (AND logic)
      output_contains: ["expected", "words"]

      # At least one string must be present (OR logic)
      output_contains_any: ["option1", "option2", "option3"]

      # Regex pattern
      regex: "\\d+ results found"

      # Expected tool calls
      tool_calls:
        - tool: search
          params_contain: {query: "expected"}

CI/CD Integration

Add to .github/workflows/agent-eval.yml:

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial
      - run: agenteval run --threshold 0.85
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

LangGraph Integration

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from agenteval.runner.adapters import wrap_langgraph_agent

# Your LangGraph agent
llm = ChatAnthropic(model="claude-3-haiku-20240307")
graph = create_react_agent(llm, tools=[...])

# Wrap for agenteval
agent = wrap_langgraph_agent(graph)

Then reference in your test file:

agent: my_module.agent

CLI Reference

agenteval run [PATH]           # Run tests (default: current directory)
agenteval run --trials 20      # Override trial count
agenteval run --threshold 0.9  # Override pass threshold
agenteval run -o results.json  # Export JSON report

agenteval compare -b baseline.json current.json  # Compare runs
agenteval init                                   # Initialize project

Supported Frameworks

LangGraph (native adapter with full trajectory capture)
CrewAI (coming soon)
AutoGen (coming soon)
Pydantic AI (coming soon)

Statistical Methods

Pass rate CI: Wilson score interval (accurate for small N and extreme proportions)
Cost/latency CI: Bootstrap resampling (500 iterations)
Regression detection: Fisher exact test for pass rate, Mann-Whitney U for metrics
Failure attribution: Trajectory divergence analysis between passed/failed trials

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Feb 6, 2026

0.2.0a2 pre-release

Feb 6, 2026

0.2.0a1 pre-release

Feb 6, 2026

0.1.4

Feb 5, 2026

0.1.3

Feb 5, 2026

0.1.2

Feb 5, 2026

0.1.1

Feb 5, 2026

This version

0.1.0

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.1.0.tar.gz (56.8 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentrial-0.1.0-py3-none-any.whl (50.8 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file agentrial-0.1.0.tar.gz.

File metadata

Download URL: agentrial-0.1.0.tar.gz
Upload date: Feb 5, 2026
Size: 56.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a9bb9d1e00d7e83721c5a2cbbbb508232231077d7f05b9a105001b8f778b5881`
MD5	`2fc5eff21b0d65baa7b7c807a49bc762`
BLAKE2b-256	`0496e2c7673610a0569cac922ad9d73a88f30d8ebac67b86423c13dcfc80f9e7`

See more details on using hashes here.

File details

Details for the file agentrial-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentrial-0.1.0-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 50.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`881029304f2946a2526e12604544e88b582cbe156da7df2f685311aa0d6ce10a`
MD5	`48971e1e55b9f115c56f0d927dea64e0`
BLAKE2b-256	`80aae00118e2fae215bd21328d6f815320cf84eb1837b0dfc2ea020d23d18c9f`

See more details on using hashes here.

agentrial 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agenteval

Why agenteval?

Quick Start

Real-World Results

Test Case Options

CI/CD Integration

LangGraph Integration

CLI Reference

Supported Frameworks

Statistical Methods

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes