Skip to main content

Statistical evaluation framework for AI agents - pytest for agent trajectories

Project description

agentrial

License: MIT Python 3.11+

Your agent passes Monday, fails Wednesday. agentrial tells you why.

Statistical evaluation framework for AI agents. Like pytest, but for non-deterministic systems.

Why agentrial?

  • Statistical rigor: Every test runs N trials with Wilson confidence intervals — not "it passed once"
  • Trajectory analysis: Step-by-step failure attribution identifies exactly which tool call diverged
  • Cost tracking: Real API costs from model metadata, not estimates
  • CI/CD ready: GitHub Action that blocks PRs when reliability drops

Quick Start

pip install agentrial

Create a test file agentrial.yml:

suite: my-agent-tests
agent: my_agent.agent  # Python import path
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

Run:

agentrial run

Output:

╭──────────────────────────────────────────────────────────────────────────────╮
│ my-agent-tests - PASSED                                                      │
╰────────────────────────────────────────────────────── Threshold: 85.0% ──────╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test Case              ┃ Pass Rate┃ 95% CI          ┃ Avg Cost ┃ Avg Latency┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━┩
│ basic-math             │   100.0% │ (72.2%-100.0%)  │  $0.0005 │      1.59s │
└────────────────────────┴──────────┴─────────────────┴──────────┴────────────┘

Overall Pass Rate: 100.0% (72.2%-100.0%)
Total Cost: $0.005

Real-World Results

Tested with Claude 3 Haiku on a 3-tool agent (calculator, country lookup, temperature conversion):

Test Complexity Pass Rate 95% CI Avg Cost Avg Latency Avg Tokens
Easy (direct tool call) 100% 72.2%-100% $0.0005 1.6s 1,513
Medium (inference + tool) 100% 72.2%-100% $0.0006 2.6s 1,926
Hard (multi-step reasoning) 100% 72.2%-100% $0.0010 3.5s 2,986

100 trials total, $0.06 total cost, full trajectory capture.

Test Case Options

cases:
  - name: my-test
    input:
      query: "User question"
      context: {}  # Optional context dict
    expected:
      # All strings must be present (AND logic)
      output_contains: ["expected", "words"]

      # At least one string must be present (OR logic)
      output_contains_any: ["option1", "option2", "option3"]

      # Regex pattern
      regex: "\\d+ results found"

      # Expected tool calls
      tool_calls:
        - tool: search
          params_contain: {query: "expected"}

CI/CD Integration

Add to .github/workflows/agent-eval.yml:

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial
      - run: agentrial run --threshold 0.85
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

LangGraph Integration

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from agentrial.runner.adapters import wrap_langgraph_agent

# Your LangGraph agent
llm = ChatAnthropic(model="claude-3-haiku-20240307")
graph = create_react_agent(llm, tools=[...])

# Wrap for agentrial
agent = wrap_langgraph_agent(graph)

Then reference in your test file:

agent: my_module.agent

CLI Reference

agentrial run [PATH]           # Run tests (default: current directory)
agentrial run --trials 20      # Override trial count
agentrial run --threshold 0.9  # Override pass threshold
agentrial run -o results.json  # Export JSON report

agentrial compare -b baseline.json current.json  # Compare runs
agentrial init                                   # Initialize project

Supported Frameworks

  • LangGraph (native adapter with full trajectory capture)
  • CrewAI (coming soon)
  • AutoGen (coming soon)
  • Pydantic AI (coming soon)

Statistical Methods

  • Pass rate CI: Wilson score interval (accurate for small N and extreme proportions)
  • Cost/latency CI: Bootstrap resampling (500 iterations)
  • Regression detection: Fisher exact test for pass rate, Mann-Whitney U for metrics
  • Failure attribution: Trajectory divergence analysis between passed/failed trials

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.1.2.tar.gz (56.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentrial-0.1.2-py3-none-any.whl (50.7 kB view details)

Uploaded Python 3

File details

Details for the file agentrial-0.1.2.tar.gz.

File metadata

  • Download URL: agentrial-0.1.2.tar.gz
  • Upload date:
  • Size: 56.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.2.tar.gz
Algorithm Hash digest
SHA256 42bce4b94bb9089c6c92a287519a13f16bf1cc30e6a26abef363a8ef156640a5
MD5 db0fe68d89bc203a31b80d4130aa77e6
BLAKE2b-256 196491676adc8fa2ff8787e5de2cc560ed6076ffbcbbfa8cd3a5eb3050f46804

See more details on using hashes here.

File details

Details for the file agentrial-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: agentrial-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 50.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7fdf06a5734e44ffcd5554845fc949102a9621b278a983e878f70044d7e98e6d
MD5 9a522184b6c1c32571d4ec2573f016f9
BLAKE2b-256 4948b32bc27db528e381da418ec52387eb075fbd9428757fe7c9de3a42092e1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page