Skip to main content

Statistical evaluation framework for AI agents - pytest for agent trajectories

Project description

agenteval

License: MIT Python 3.11+

Your agent passes Monday, fails Wednesday. agenteval tells you why.

Statistical evaluation framework for AI agents. Like pytest, but for non-deterministic systems.

Why agenteval?

  • Statistical rigor: Every test runs N trials with Wilson confidence intervals — not "it passed once"
  • Trajectory analysis: Step-by-step failure attribution identifies exactly which tool call diverged
  • Cost tracking: Real API costs from model metadata, not estimates
  • CI/CD ready: GitHub Action that blocks PRs when reliability drops

Quick Start

pip install agentrial

Create a test file agenteval.yml:

suite: my-agent-tests
agent: my_agent.agent  # Python import path
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

Run:

agenteval run

Output:

╭──────────────────────────────────────────────────────────────────────────────╮
│ my-agent-tests - PASSED                                                      │
╰────────────────────────────────────────────────────── Threshold: 85.0% ──────╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test Case              ┃ Pass Rate┃ 95% CI          ┃ Avg Cost ┃ Avg Latency┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━┩
│ basic-math             │   100.0% │ (72.2%-100.0%)  │  $0.0005 │      1.59s │
└────────────────────────┴──────────┴─────────────────┴──────────┴────────────┘

Overall Pass Rate: 100.0% (72.2%-100.0%)
Total Cost: $0.005

Real-World Results

Tested with Claude 3 Haiku on a 3-tool agent (calculator, country lookup, temperature conversion):

Test Complexity Pass Rate 95% CI Avg Cost Avg Latency Avg Tokens
Easy (direct tool call) 100% 72.2%-100% $0.0005 1.6s 1,513
Medium (inference + tool) 100% 72.2%-100% $0.0006 2.6s 1,926
Hard (multi-step reasoning) 100% 72.2%-100% $0.0010 3.5s 2,986

100 trials total, $0.06 total cost, full trajectory capture.

Test Case Options

cases:
  - name: my-test
    input:
      query: "User question"
      context: {}  # Optional context dict
    expected:
      # All strings must be present (AND logic)
      output_contains: ["expected", "words"]

      # At least one string must be present (OR logic)
      output_contains_any: ["option1", "option2", "option3"]

      # Regex pattern
      regex: "\\d+ results found"

      # Expected tool calls
      tool_calls:
        - tool: search
          params_contain: {query: "expected"}

CI/CD Integration

Add to .github/workflows/agent-eval.yml:

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial
      - run: agenteval run --threshold 0.85
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

LangGraph Integration

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from agenteval.runner.adapters import wrap_langgraph_agent

# Your LangGraph agent
llm = ChatAnthropic(model="claude-3-haiku-20240307")
graph = create_react_agent(llm, tools=[...])

# Wrap for agenteval
agent = wrap_langgraph_agent(graph)

Then reference in your test file:

agent: my_module.agent

CLI Reference

agenteval run [PATH]           # Run tests (default: current directory)
agenteval run --trials 20      # Override trial count
agenteval run --threshold 0.9  # Override pass threshold
agenteval run -o results.json  # Export JSON report

agenteval compare -b baseline.json current.json  # Compare runs
agenteval init                                   # Initialize project

Supported Frameworks

  • LangGraph (native adapter with full trajectory capture)
  • CrewAI (coming soon)
  • AutoGen (coming soon)
  • Pydantic AI (coming soon)

Statistical Methods

  • Pass rate CI: Wilson score interval (accurate for small N and extreme proportions)
  • Cost/latency CI: Bootstrap resampling (500 iterations)
  • Regression detection: Fisher exact test for pass rate, Mann-Whitney U for metrics
  • Failure attribution: Trajectory divergence analysis between passed/failed trials

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.1.0.tar.gz (56.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentrial-0.1.0-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file agentrial-0.1.0.tar.gz.

File metadata

  • Download URL: agentrial-0.1.0.tar.gz
  • Upload date:
  • Size: 56.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a9bb9d1e00d7e83721c5a2cbbbb508232231077d7f05b9a105001b8f778b5881
MD5 2fc5eff21b0d65baa7b7c807a49bc762
BLAKE2b-256 0496e2c7673610a0569cac922ad9d73a88f30d8ebac67b86423c13dcfc80f9e7

See more details on using hashes here.

File details

Details for the file agentrial-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentrial-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 881029304f2946a2526e12604544e88b582cbe156da7df2f685311aa0d6ce10a
MD5 48971e1e55b9f115c56f0d927dea64e0
BLAKE2b-256 80aae00118e2fae215bd21328d6f815320cf84eb1837b0dfc2ea020d23d18c9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page