Testing and evaluation framework for AI agents
Project description
AgentEval 🧪
Testing and evaluation framework for AI agents. Define test suites in YAML, grade agent outputs with 6 pluggable graders, track results over time, and detect regressions with statistical comparison.
Why AgentEval?
AI agents are hard to test. They're non-deterministic, they call tools, and their outputs vary between runs. Traditional unit tests don't cut it.
- 🎯 YAML-based test suites — Define inputs, expected outputs, and grading criteria declaratively
- 📊 Statistical regression detection — Welch's t-test across multiple runs, not just pass/fail
- 🔌 6 built-in graders — Exact match, contains, regex, tool-check, LLM-judge, and custom
- 🔗 AgentLens integration — Import real production sessions as test cases
- 💰 Cost & latency tracking — Know what each eval costs in tokens and dollars
- 🗄️ SQLite result storage — Every run is persisted for historical comparison
Quick Start
pip install agentevalkit
1. Define a test suite
# suite.yaml
name: my-agent-tests
agent: my_agent:run
cases:
- name: basic-math
input: "What is 2 + 2?"
expected:
output_contains: ["4"]
grader: contains
- name: tool-usage
input: "Search for the weather in NYC"
expected:
tools_called: ["web_search"]
grader: tool-check
- name: format-check
input: "List 3 colors"
expected:
pattern: "\\d\\.\\s+\\w+"
grader: regex
2. Create your agent callable
# my_agent.py
from agenteval.models import AgentResult
def run(input_text: str) -> AgentResult:
# Your agent logic here
return AgentResult(
output="The answer is 4.",
tools_called=[{"name": "web_search", "args": {"query": "weather NYC"}}],
tokens_in=12,
tokens_out=8,
cost_usd=0.0003,
)
3. Run the eval
$ agenteval run --suite suite.yaml --verbose
============================================================
Suite: my-agent-tests | Run: c1c6493118d5
============================================================
PASS basic-addition (score=1.00, 150ms)
PASS capital-city (score=1.00, 200ms)
PASS quantum-summary (score=1.00, 350ms)
PASS tool-usage (score=1.00, 280ms)
PASS list-format (score=1.00, 120ms)
Total: 5 Passed: 5 Failed: 0 Pass rate: 100%
Cost: $0.0023 Avg latency: 220ms
Features
🎯 6 Built-in Graders
| Grader | What it checks | Expected fields |
|---|---|---|
exact |
Exact string match | output |
contains |
Substring presence | output_contains: [list] |
regex |
Pattern matching | pattern |
tool-check |
Tools were called | tools_called: [list] |
llm-judge |
LLM evaluates quality | criteria (free-form) |
custom |
Your own function | grader_config: {function: "mod:fn"} |
📊 Statistical Comparison
Compare runs with Welch's t-test to detect statistically significant regressions:
$ agenteval compare c1c6493118d5,d17a2dce0222 4ee7e40601e3,ba5b0dde212b
============================================================================
Comparing: c1c6493118d5,d17a2dce0222 vs 4ee7e40601e3,ba5b0dde212b
Alpha: 0.05 Regression threshold: 0.0
============================================================================
Case Base Target Diff p-value Sig Status
----------------------------------------------------------------------------
basic-addition 1.000 1.000 +0.000 —
capital-city 1.000 0.500 -0.500 0.4533
quantum-summary 1.000 0.500 -0.500 0.4533
tool-usage 1.000 0.000 -1.000 0.0000 * ▼ regressed
list-format 1.000 0.500 -0.500 0.4533
Summary: 0 improved, 1 regressed, 4 unchanged
⚠ 1 regression(s) detected!
Run the same suite multiple times and compare groups: agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2. Uses scipy when available, falls back to pure Python.
🔗 AgentLens Integration
Import real agent sessions from AgentLens as test suites:
# From AgentLens SQLite database
agenteval import --from agentlens --db sessions.db --output suite.yaml --grader contains
# From AgentLens server API
agenteval import-agentlens --url http://localhost:3000 --output suite.yaml --grader contains
# With filtering and interactive review
agenteval import --from agentlens --db sessions.db --output suite.yaml --filter-tag production --auto-assertions --interactive
Import modes:
- SQLite mode (
import --from agentlens --db path) — reads directly from an AgentLens database file - Server mode (
import-agentlens --url URL) — fetches sessions via the AgentLens HTTP API
Sessions are converted to eval cases with input/output mapping and optional tool-call assertions. Use --auto-assertions to automatically generate expected fields from session data, and --interactive to review each case before saving.
Turn production traffic into regression tests — no manual test writing needed.
💰 Cost & Latency Tracking
Every eval tracks tokens and cost. Your agent callable returns AgentResult with tokens_in, tokens_out, and cost_usd, and AgentEval aggregates them per run.
YAML Suite Format
Full annotated example:
name: my-agent-tests # Suite name (shown in reports)
agent: my_module:my_agent # Default agent callable (module:function)
defaults: # Defaults applied to all cases
grader: contains
grader_config:
ignore_case: true
cases:
- name: basic-math # Unique case name
input: "What is 2 + 2?" # Input passed to agent
expected: # Grader-specific expected values
output_contains: ["4"]
grader: contains # Override default grader
tags: [math, basic] # Tags for filtering (--tag math)
- name: tool-usage
input: "Search for weather"
expected:
tools_called: ["web_search"]
grader: tool-check
- name: quality-check
input: "Explain gravity"
expected:
criteria: "Should mention Newton or Einstein, be scientifically accurate"
grader: llm-judge
grader_config:
model: gpt-4o-mini # LLM judge model
api_base: https://api.openai.com/v1
- name: custom-validation
input: "Generate a JSON object"
expected: {}
grader: custom
grader_config:
function: my_graders:validate_json # Your grader function
CLI Reference
agenteval run
agenteval run --suite suite.yaml [--agent module:fn] [--verbose] [--tag math] [--timeout 30] [--db agenteval.db]
--suite— Path to YAML suite file (required)--agent— Override the agent callable from the suite--verbose/-v— Show per-case pass/fail details--tag— Filter cases by tag (repeatable)--timeout— Per-case timeout in seconds (default: 30)--db— SQLite database path (default:agenteval.db)
Exit code is 1 if any case fails.
agenteval list
agenteval list [--suite-filter name] [--limit 20] [--db agenteval.db]
$ agenteval list --limit 5
ID Suite Passed Failed Rate Created
--------------------------------------------------------------------------------
aeccd5e53f03 math-agent-demo 2 3 40% 2026-02-12T21:12:12
4f3e380f622c math-agent-demo 3 2 60% 2026-02-12T21:12:12
bd4ef3a0727b math-agent-demo 1 4 20% 2026-02-12T21:12:12
e2ca43e99852 math-agent-demo 3 2 60% 2026-02-12T21:12:11
32ed650cab6d math-agent-demo 2 3 40% 2026-02-12T21:12:11
agenteval compare
agenteval compare RUN_A RUN_B [--alpha 0.05] [--threshold 0.0] [--stats/--no-stats]
agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2 # Multi-run comparison
agenteval import
agenteval import --from agentlens --db sessions.db --output suite.yaml [--grader contains] [--limit 100]
Grader Reference
exact
Compares result.output exactly with expected.output. Config: ignore_case: bool.
expected:
output: "The answer is 42."
grader: exact
grader_config:
ignore_case: true
contains
Checks that all substrings in expected.output_contains appear in the output.
expected:
output_contains: ["Paris", "France"]
grader: contains
regex
Matches result.output against expected.pattern (Python regex). Config: flags: [IGNORECASE, DOTALL, MULTILINE].
expected:
pattern: "\\d+\\.\\d+"
grader: regex
grader_config:
flags: [IGNORECASE]
tool-check
Verifies expected tools were called. Config: ordered: bool for sequence matching.
expected:
tools_called: ["web_search", "calculator"]
grader: tool-check
grader_config:
ordered: true
llm-judge
Sends the input, output, and criteria to an LLM for evaluation. Requires OPENAI_API_KEY or compatible API.
expected:
criteria: "Response should be helpful, accurate, and concise"
grader: llm-judge
grader_config:
model: gpt-4o-mini
custom
Imports and calls your own grader function. Must accept (case: EvalCase, result: AgentResult) -> GradeResult.
grader: custom
grader_config:
function: my_module:my_grader
Adapters
Adapters let you test agents built with popular frameworks without writing a custom callable.
pip install agentevalkit[langchain] # LangChain
pip install agentevalkit[crewai] # CrewAI
pip install agentevalkit[autogen] # AutoGen
| Adapter | Framework Method | Install Extra |
|---|---|---|
langchain |
agent.invoke(input) |
[langchain] |
crewai |
crew.kickoff(inputs={"input": ...}) |
[crewai] |
autogen |
agent.run(input) or agent.initiate_chat(message=...) |
[autogen] |
Usage with YAML suite defaults:
# suite.yaml
name: my-tests
agent: my_module:my_chain
defaults:
adapter: langchain
Or via CLI:
agenteval run --suite suite.yaml --adapter langchain
Each adapter extracts output, tool calls, and token usage from the framework's response format into a standard AgentResult.
Distributed Execution
Scale eval suites across multiple workers using Redis as a broker.
Setup
pip install agentevalkit[distributed]
Start Workers
# Terminal 1: Start a worker
agenteval worker --broker redis://localhost:6379 --agent my_module:my_agent
# Terminal 2: Start another worker
agenteval worker --broker redis://localhost:6379 --agent my_module:my_agent
Run with Workers
agenteval run --suite suite.yaml --workers redis://localhost:6379 --worker-timeout 60
How It Works
- The coordinator pushes eval cases to a Redis queue
- Workers pop cases, execute the agent, and push results back
- The coordinator collects results and builds the final
EvalRun - If no workers are detected, execution falls back to local mode automatically
Configuration
--workers URL— Redis broker URL (supportsredis://andrediss://for TLS)--worker-timeout N— Seconds to wait for worker results (default: 30)- Workers register heartbeats and are automatically detected by the coordinator
Security: Use
rediss://URLs with authentication for production deployments. See docs/troubleshooting.md for Redis security guidance.
Troubleshooting
See docs/troubleshooting.md for solutions to common issues including:
- Agent callable import errors (
module:functionformat) - Missing dependency extras (
[distributed],[langchain], etc.) - OpenAI API key setup for
llm-judgegrader - Compare command syntax
- Redis connection issues for distributed execution
Contributing
Contributions welcome! This project uses:
- pytest for testing
- ruff for linting
- src layout (
src/agenteval/)
git clone https://github.com/amitpaz1/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytest
🧰 AgentKit Ecosystem
| Project | Description | |
|---|---|---|
| AgentLens | Observability & audit trail for AI agents | |
| Lore | Cross-agent memory and lesson sharing | |
| AgentGate | Human-in-the-loop approval gateway | |
| FormBridge | Agent-human mixed-mode forms | |
| AgentEval | Testing & evaluation framework | ⬅️ you are here |
| agentkit-mesh | Agent discovery & delegation | |
| agentkit-cli | Unified CLI orchestrator | |
| agentkit-guardrails | Reactive policy guardrails |
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentevalkit-0.6.0.tar.gz.
File metadata
- Download URL: agentevalkit-0.6.0.tar.gz
- Upload date:
- Size: 114.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0d40bc37ac0faea7eaa57b4bf93e0fc448a518e1d8e269d8e61f815f423362e
|
|
| MD5 |
af30d6b955b2277216cd25436aee0f28
|
|
| BLAKE2b-256 |
f4d5ec30a4221918ce531706fac0d09e750db6f68c1b52241d54c167d9223ff0
|
File details
Details for the file agentevalkit-0.6.0-py3-none-any.whl.
File metadata
- Download URL: agentevalkit-0.6.0-py3-none-any.whl
- Upload date:
- Size: 104.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86dc93e6ec74c99a8595576e9fb818520c79e30c722a947f861b7139f532bdea
|
|
| MD5 |
30cef196cd8920de35a7eac38f6727e0
|
|
| BLAKE2b-256 |
14f47fbe1a530f49e738a5ee9cf85f4b442bf9a74a124a8cac73e16b6a92b6d4
|