Skip to main content

Statistical evaluation framework for AI agents - pytest for agent trajectories

Project description

agentrial

License: MIT Python 3.11+

Your agent passes Monday, fails Wednesday. agentrial tells you why.

Statistical evaluation framework for AI agents. Like pytest, but for non-deterministic systems.

Why agentrial?

  • Statistical rigor: Every test runs N trials with Wilson confidence intervals — not "it passed once"
  • Trajectory analysis: Step-by-step failure attribution identifies exactly which tool call diverged
  • Cost tracking: Real API costs from model metadata, not estimates
  • CI/CD ready: GitHub Action that blocks PRs when reliability drops

Quick Start

pip install agentrial

Create a test file agentrial.yml:

suite: my-agent-tests
agent: my_agent.agent  # Python import path
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

Run:

agentrial run

Output:

╭──────────────────────────────────────────────────────────────────────────────╮
│ my-agent-tests - PASSED                                                      │
╰────────────────────────────────────────────────────── Threshold: 85.0% ──────╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test Case              ┃ Pass Rate┃ 95% CI          ┃ Avg Cost ┃ Avg Latency┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━┩
│ basic-math             │   100.0% │ (72.2%-100.0%)  │  $0.0005 │      1.59s │
└────────────────────────┴──────────┴─────────────────┴──────────┴────────────┘

Overall Pass Rate: 100.0% (72.2%-100.0%)
Total Cost: $0.005

Real-World Results

Tested with Claude 3 Haiku on a 3-tool agent (calculator, country lookup, temperature conversion):

Test Complexity Pass Rate 95% CI Avg Cost Avg Latency Avg Tokens
Easy (direct tool call) 100% 72.2%-100% $0.0005 1.6s 1,513
Medium (inference + tool) 100% 72.2%-100% $0.0006 2.6s 1,926
Hard (multi-step reasoning) 100% 72.2%-100% $0.0010 3.5s 2,986

100 trials total, $0.06 total cost, full trajectory capture.

Test Case Options

cases:
  - name: my-test
    input:
      query: "User question"
      context: {}  # Optional context dict
    expected:
      # All strings must be present (AND logic)
      output_contains: ["expected", "words"]

      # At least one string must be present (OR logic)
      output_contains_any: ["option1", "option2", "option3"]

      # Regex pattern
      regex: "\\d+ results found"

      # Expected tool calls
      tool_calls:
        - tool: search
          params_contain: {query: "expected"}

CI/CD Integration

Add to .github/workflows/agent-eval.yml:

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial
      - run: agentrial run --threshold 0.85
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

LangGraph Integration

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from agentrial.runner.adapters import wrap_langgraph_agent

# Your LangGraph agent
llm = ChatAnthropic(model="claude-3-haiku-20240307")
graph = create_react_agent(llm, tools=[...])

# Wrap for agentrial
agent = wrap_langgraph_agent(graph)

Then reference in your test file:

agent: my_module.agent

CLI Reference

agentrial run [PATH]           # Run tests (default: current directory)
agentrial run --trials 20      # Override trial count
agentrial run --threshold 0.9  # Override pass threshold
agentrial run -o results.json  # Export JSON report

agentrial compare -b baseline.json current.json  # Compare runs
agentrial init                                   # Initialize project

Supported Frameworks

  • LangGraph (native adapter with full trajectory capture)
  • CrewAI (coming soon)
  • AutoGen (coming soon)
  • Pydantic AI (coming soon)

Statistical Methods

  • Pass rate CI: Wilson score interval (accurate for small N and extreme proportions)
  • Cost/latency CI: Bootstrap resampling (500 iterations)
  • Regression detection: Fisher exact test for pass rate, Mann-Whitney U for metrics
  • Failure attribution: Trajectory divergence analysis between passed/failed trials

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrial-0.1.3.tar.gz (57.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentrial-0.1.3-py3-none-any.whl (50.7 kB view details)

Uploaded Python 3

File details

Details for the file agentrial-0.1.3.tar.gz.

File metadata

  • Download URL: agentrial-0.1.3.tar.gz
  • Upload date:
  • Size: 57.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.3.tar.gz
Algorithm Hash digest
SHA256 6acca904f079760ce00854faa4c3b767acd31c32fdcacf8706988e96e4eeb4b5
MD5 3dd180c77f7673e6000c2a14e3d18397
BLAKE2b-256 7f82a42e5a02426a01c1dd4a31d19bb1bb5dfc11e92a7985dcfaf3e1347021b0

See more details on using hashes here.

File details

Details for the file agentrial-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: agentrial-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 50.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentrial-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 06f1dd2483388b9e730d27e68bc17ce1449a26aa4fe7f89601ed8d1a05aef148
MD5 66ea1370ca034536c92597373c168a97
BLAKE2b-256 d0cad902814a1f575ce416dbd8a7789b986b57eea05c75d80ed3f3e1c1e0f338

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page