Skip to main content

Statistical evaluation framework for AI agents - pytest for agent trajectories

Project description

agenteval

License: MIT Python 3.11+

Your agent passes Monday, fails Wednesday. agenteval tells you why.

Statistical evaluation framework for AI agents. Like pytest, but for non-deterministic systems.

Why agenteval?

  • Statistical rigor: Every test runs N trials with Wilson confidence intervals — not "it passed once"
  • Trajectory analysis: Step-by-step failure attribution identifies exactly which tool call diverged
  • Cost tracking: Real API costs from model metadata, not estimates
  • CI/CD ready: GitHub Action that blocks PRs when reliability drops

Quick Start

pip install agentrial

Create a test file agenteval.yml:

suite: my-agent-tests
agent: my_agent.agent  # Python import path
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

Run:

agentrial run

Output:

╭──────────────────────────────────────────────────────────────────────────────╮
│ my-agent-tests - PASSED                                                      │
╰────────────────────────────────────────────────────── Threshold: 85.0% ──────╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test Case              ┃ Pass Rate┃ 95% CI          ┃ Avg Cost ┃ Avg Latency┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━┩
│ basic-math             │   100.0% │ (72.2%-100.0%)  │  $0.0005 │      1.59s │
└────────────────────────┴──────────┴─────────────────┴──────────┴────────────┘

Overall Pass Rate: 100.0% (72.2%-100.0%)
Total Cost: $0.005

Real-World Results

Tested with Claude 3 Haiku on a 3-tool agent (calculator, country lookup, temperature conversion):

Test Complexity Pass Rate 95% CI Avg Cost Avg Latency Avg Tokens
Easy (direct tool call) 100% 72.2%-100% $0.0005 1.6s 1,513
Medium (inference + tool) 100% 72.2%-100% $0.0006 2.6s 1,926
Hard (multi-step reasoning) 100% 72.2%-100% $0.0010 3.5s 2,986

100 trials total, $0.06 total cost, full trajectory capture.

Test Case Options

cases:
  - name: my-test
    input:
      query: "User question"
      context: {}  # Optional context dict
    expected:
      # All strings must be present (AND logic)
      output_contains: ["expected", "words"]

      # At least one string must be present (OR logic)
      output_contains_any: ["option1", "option2", "option3"]

      # Regex pattern
      regex: "\\d+ results found"

      # Expected tool calls
      tool_calls:
        - tool: search
          params_contain: {query: "expected"}

CI/CD Integration

Add to .github/workflows/agent-eval.yml:

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial
      - run: agentrial run --threshold 0.85
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

LangGraph Integration

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from agenteval.runner.adapters import wrap_langgraph_agent

# Your LangGraph agent
llm = ChatAnthropic(model="claude-3-haiku-20240307")
graph = create_react_agent(llm, tools=[...])

# Wrap for agenteval
agent = wrap_langgraph_agent(graph)

Then reference in your test file:

agent: my_module.agent

CLI Reference

agentrial run [PATH]           # Run tests (default: current directory)
agentrial run --trials 20      # Override trial count
agentrial run --threshold 0.9  # Override pass threshold
agentrial run -o results.json  # Export JSON report

agentrial compare -b baseline.json current.json  # Compare runs
agentrial init                                   # Initialize project

Supported Frameworks

  • LangGraph (native adapter with full trajectory capture)
  • CrewAI (coming soon)
  • AutoGen (coming soon)
  • Pydantic AI (coming soon)

Statistical Methods

  • Pass rate CI: Wilson score interval (accurate for small N and extreme proportions)
  • Cost/latency CI: Bootstrap resampling (500 iterations)
  • Regression detection: Fisher exact test for pass rate, Mann-Whitney U for metrics
  • Failure attribution: Trajectory divergence analysis between passed/failed trials

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.1.1.tar.gz (56.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentrial-0.1.1-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file agentrial-0.1.1.tar.gz.

File metadata

  • Download URL: agentrial-0.1.1.tar.gz
  • Upload date:
  • Size: 56.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c9304b9bf2a434310780a133a88316fc52e10de9345c98b626cab278faab0bce
MD5 fa494047fe2414c8152bcd12cc454a48
BLAKE2b-256 3c986618a53999fb01cfff682e7b373a85a5009aaceeeb631261a6e434c6d5a2

See more details on using hashes here.

File details

Details for the file agentrial-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: agentrial-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 50.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4885426dc804ec6ab26e4381fba00b63e6bf5141f00ec0f3b10691726c0a4106
MD5 ad220b3c76f0c664ac54302eb2f7f362
BLAKE2b-256 5d8c5d752406d88bfb14750e8f77ecdae026d6aeb004ecb3890fee47f1f6bd57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page