Skip to main content

Testing and evaluation framework for AI agents

Project description

AgentEval 🧪

PyPI Tests Python 3.9+ License: MIT

Testing and evaluation framework for AI agents. Define test suites in YAML, grade agent outputs with 6 pluggable graders, track results over time, and detect regressions with statistical comparison.


Why AgentEval?

AI agents are hard to test. They're non-deterministic, they call tools, and their outputs vary between runs. Traditional unit tests don't cut it.

  • 🎯 YAML-based test suites — Define inputs, expected outputs, and grading criteria declaratively
  • 📊 Statistical regression detection — Welch's t-test across multiple runs, not just pass/fail
  • 🔌 6 built-in graders — Exact match, contains, regex, tool-check, LLM-judge, and custom
  • 🔗 AgentLens integration — Import real production sessions as test cases
  • 💰 Cost & latency tracking — Know what each eval costs in tokens and dollars
  • 🗄️ SQLite result storage — Every run is persisted for historical comparison

Quick Start

pip install agentevalkit

1. Define a test suite

# suite.yaml
name: my-agent-tests
agent: my_agent:run

cases:
  - name: basic-math
    input: "What is 2 + 2?"
    expected:
      output_contains: ["4"]
    grader: contains

  - name: tool-usage
    input: "Search for the weather in NYC"
    expected:
      tools_called: ["web_search"]
    grader: tool-check

  - name: format-check
    input: "List 3 colors"
    expected:
      pattern: "\\d\\.\\s+\\w+"
    grader: regex

2. Create your agent callable

# my_agent.py
from agenteval.models import AgentResult

def run(input_text: str) -> AgentResult:
    # Your agent logic here
    return AgentResult(
        output="The answer is 4.",
        tools_called=[{"name": "web_search", "args": {"query": "weather NYC"}}],
        tokens_in=12,
        tokens_out=8,
        cost_usd=0.0003,
    )

3. Run the eval

$ agenteval run --suite suite.yaml --verbose

============================================================
Suite: my-agent-tests  |  Run: c1c6493118d5
============================================================
  PASS  basic-addition (score=1.00, 150ms)
  PASS  capital-city (score=1.00, 200ms)
  PASS  quantum-summary (score=1.00, 350ms)
  PASS  tool-usage (score=1.00, 280ms)
  PASS  list-format (score=1.00, 120ms)

Total: 5  Passed: 5  Failed: 0  Pass rate: 100%
Cost: $0.0023  Avg latency: 220ms

Features

🎯 6 Built-in Graders

Grader What it checks Expected fields
exact Exact string match output
contains Substring presence output_contains: [list]
regex Pattern matching pattern
tool-check Tools were called tools_called: [list]
llm-judge LLM evaluates quality criteria (free-form)
custom Your own function grader_config: {function: "mod:fn"}

📊 Statistical Comparison

Compare runs with Welch's t-test to detect statistically significant regressions:

$ agenteval compare c1c6493118d5,d17a2dce0222 4ee7e40601e3,ba5b0dde212b

============================================================================
Comparing: c1c6493118d5,d17a2dce0222 vs 4ee7e40601e3,ba5b0dde212b
Alpha: 0.05  Regression threshold: 0.0
============================================================================

Case                          Base   Target     Diff   p-value  Sig Status
----------------------------------------------------------------------------
  basic-addition             1.000    1.000   +0.000         —
  capital-city               1.000    0.500   -0.500    0.4533
  quantum-summary            1.000    0.500   -0.500    0.4533
  tool-usage                 1.000    0.000   -1.000    0.0000    * ▼ regressed
  list-format                1.000    0.500   -0.500    0.4533

Summary: 0 improved, 1 regressed, 4 unchanged

⚠ 1 regression(s) detected!

Run the same suite multiple times and compare groups: agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2. Uses scipy when available, falls back to pure Python.

🔗 AgentLens Integration

Import real agent sessions from AgentLens as test suites:

agenteval import --from agentlens --db sessions.db --output suite.yaml --grader contains
# Imported 42 cases → suite.yaml

Turn production traffic into regression tests — no manual test writing needed.

💰 Cost & Latency Tracking

Every eval tracks tokens and cost. Your agent callable returns AgentResult with tokens_in, tokens_out, and cost_usd, and AgentEval aggregates them per run.


YAML Suite Format

Full annotated example:

name: my-agent-tests           # Suite name (shown in reports)
agent: my_module:my_agent      # Default agent callable (module:function)

defaults:                       # Defaults applied to all cases
  grader: contains
  grader_config:
    ignore_case: true

cases:
  - name: basic-math            # Unique case name
    input: "What is 2 + 2?"     # Input passed to agent
    expected:                    # Grader-specific expected values
      output_contains: ["4"]
    grader: contains             # Override default grader
    tags: [math, basic]          # Tags for filtering (--tag math)

  - name: tool-usage
    input: "Search for weather"
    expected:
      tools_called: ["web_search"]
    grader: tool-check

  - name: quality-check
    input: "Explain gravity"
    expected:
      criteria: "Should mention Newton or Einstein, be scientifically accurate"
    grader: llm-judge
    grader_config:
      model: gpt-4o-mini         # LLM judge model
      api_base: https://api.openai.com/v1

  - name: custom-validation
    input: "Generate a JSON object"
    expected: {}
    grader: custom
    grader_config:
      function: my_graders:validate_json  # Your grader function

CLI Reference

agenteval run

agenteval run --suite suite.yaml [--agent module:fn] [--verbose] [--tag math] [--timeout 30] [--db agenteval.db]
  • --suite — Path to YAML suite file (required)
  • --agent — Override the agent callable from the suite
  • --verbose / -v — Show per-case pass/fail details
  • --tag — Filter cases by tag (repeatable)
  • --timeout — Per-case timeout in seconds (default: 30)
  • --db — SQLite database path (default: agenteval.db)

Exit code is 1 if any case fails.

agenteval list

agenteval list [--suite-filter name] [--limit 20] [--db agenteval.db]
$ agenteval list --limit 5

ID             Suite                Passed   Failed   Rate     Created
--------------------------------------------------------------------------------
aeccd5e53f03   math-agent-demo      2        3        40%      2026-02-12T21:12:12
4f3e380f622c   math-agent-demo      3        2        60%      2026-02-12T21:12:12
bd4ef3a0727b   math-agent-demo      1        4        20%      2026-02-12T21:12:12
e2ca43e99852   math-agent-demo      3        2        60%      2026-02-12T21:12:11
32ed650cab6d   math-agent-demo      2        3        40%      2026-02-12T21:12:11

agenteval compare

agenteval compare RUN_A RUN_B [--alpha 0.05] [--threshold 0.0] [--stats/--no-stats]
agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2   # Multi-run comparison

agenteval import

agenteval import --from agentlens --db sessions.db --output suite.yaml [--grader contains] [--limit 100]

Grader Reference

exact

Compares result.output exactly with expected.output. Config: ignore_case: bool.

expected:
  output: "The answer is 42."
grader: exact
grader_config:
  ignore_case: true

contains

Checks that all substrings in expected.output_contains appear in the output.

expected:
  output_contains: ["Paris", "France"]
grader: contains

regex

Matches result.output against expected.pattern (Python regex). Config: flags: [IGNORECASE, DOTALL, MULTILINE].

expected:
  pattern: "\\d+\\.\\d+"
grader: regex
grader_config:
  flags: [IGNORECASE]

tool-check

Verifies expected tools were called. Config: ordered: bool for sequence matching.

expected:
  tools_called: ["web_search", "calculator"]
grader: tool-check
grader_config:
  ordered: true

llm-judge

Sends the input, output, and criteria to an LLM for evaluation. Requires OPENAI_API_KEY or compatible API.

expected:
  criteria: "Response should be helpful, accurate, and concise"
grader: llm-judge
grader_config:
  model: gpt-4o-mini

custom

Imports and calls your own grader function. Must accept (case: EvalCase, result: AgentResult) -> GradeResult.

grader: custom
grader_config:
  function: my_module:my_grader

Part of the AgentKit Ecosystem

AgentEval works great standalone, and even better with:

  • AgentLens — Session recording & observability for AI agents
  • Lore — Prompt management and versioning
  • AgentGate — API gateway for AI agents
  • FormBridge — Form automation with AI

Contributing

Contributions welcome! This project uses:

  • pytest for testing (127 tests passing)
  • ruff for linting
  • src layout (src/agenteval/)
git clone https://github.com/amitpaz1/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentevalkit-0.3.0.tar.gz (72.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentevalkit-0.3.0-py3-none-any.whl (55.6 kB view details)

Uploaded Python 3

File details

Details for the file agentevalkit-0.3.0.tar.gz.

File metadata

  • Download URL: agentevalkit-0.3.0.tar.gz
  • Upload date:
  • Size: 72.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentevalkit-0.3.0.tar.gz
Algorithm Hash digest
SHA256 aa9617a838825ac85605f8f3bc716e9d2f2298ea207fd996a5bc0f74d85f7a74
MD5 a6a582f430cb1169c46b662f50309b18
BLAKE2b-256 13e3087a3a30ba0b3f5d214b74f32b24fb9ab526a7606f6a8ba8e1979bbbc45f

See more details on using hashes here.

File details

Details for the file agentevalkit-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: agentevalkit-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 55.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentevalkit-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ae2344f897a994dcbf3e35d22422c9f93ab71e0ad52cd068f0190b9aafc89b8
MD5 d2b70eacfbb254cb190a8ba48649c595
BLAKE2b-256 43db54edf8faf710c75569d868fcbec89c5e2c4f5045b9792ead8ae64239df7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page