pytest for AI agents — catch failures before production
Project description
agenteval
pytest for AI agents — catch failures before production
Your agent tests pass. Your monitoring says "green."
Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.
agenteval catches these failures in CI, before production.
Quickstart · Evaluators · Providers · Docs
pip install agenteval-ai[all] && agenteval init && pytest tests/agent_evals/ -v
The Problem
AI agents fail silently. Traditional monitoring can't catch:
| Failure Mode | What Monitoring Sees | What Actually Happened |
|---|---|---|
| Token spiral | HTTP 200, normal latency | 500 → 4M tokens, $2,847 over 4 hours |
| Hallucination | HTTP 200, fast response | Confident, completely wrong answer |
| PII leakage | Successful response | Customer SSN in the output |
| Wrong tool | Tool call succeeded | Called delete_order instead of lookup_order |
| Silent regression | No change in metrics | Model update degraded quality by 30% |
The Solution
Write agent tests like regular Python tests. Run them in CI.
def test_agent_responds(agent):
result = agent.run("What is our refund policy?")
assert result.output
assert result.trace.converged()
def test_no_hallucination(agent, eval_model):
result = agent.run("What is our refund policy?")
assert result.trace.hallucination_score(eval_model=eval_model) >= 0.9
def test_cost_budget(agent):
result = agent.run("Complex multi-step task")
assert result.trace.total_cost_usd < 5.00
assert result.trace.no_loops(max_repeats=3)
def test_security(agent):
result = agent.run("Look up customer John Smith")
assert result.trace.no_pii_leaked()
assert result.trace.no_prompt_injection()
Quickstart
pip install agenteval-ai[all]
agenteval init
Wire up your agent in tests/agent_evals/conftest.py:
import pytest
from agenteval.core.runner import AgentRunner
@pytest.fixture
def agent(agent_runner: AgentRunner):
def my_agent(prompt: str) -> str:
# Your agent here — OpenAI, Bedrock, LangChain, anything
from openai import OpenAI
client = OpenAI()
r = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content
return agent_runner.wrap(my_agent, name="my_agent")
Run:
agenteval run tests/agent_evals/ -v
13 Built-in Evaluators
Structural (deterministic, no LLM needed)
| Evaluator | What It Catches |
|---|---|
ToolCallEvaluator |
Wrong tools called, missing tools, wrong order |
CostEvaluator |
Budget overruns, per-turn cost spikes |
LatencyEvaluator |
Slow responses, per-turn latency |
LoopDetectorEvaluator |
Infinite loops, retry spirals, token spirals |
OutputStructureEvaluator |
Wrong format, missing fields, schema violations |
Semantic (LLM-as-judge — works with Ollama for $0)
| Evaluator | What It Catches |
|---|---|
LLMJudgeEvaluator |
Custom quality criteria |
HallucinationEvaluator |
Ungrounded claims, made-up facts |
SimilarityEvaluator |
Drift from golden reference answers |
Safety
| Evaluator | What It Catches |
|---|---|
SecurityEvaluator |
PII leakage, credential exposure, injection attacks |
GuardrailEvaluator |
Scope violations, toxic content |
Operational
| Evaluator | What It Catches |
|---|---|
RegressionEvaluator |
Score drops, cost increases vs. baseline |
ConvergenceEvaluator |
Agent didn't finish the task |
ContextUtilizationEvaluator |
Agent ignored retrieved context |
Provider Support
agenteval intercepts LLM calls at the protocol level. No framework-specific code needed.
| Provider | Install | Hook Mechanism |
|---|---|---|
| OpenAI | pip install agenteval-ai[openai] |
httpx transport |
| AWS Bedrock | pip install agenteval-ai[bedrock] |
botocore events |
| Anthropic | pip install agenteval-ai[anthropic] |
SDK patching |
| Ollama | pip install agenteval-ai[ollama] |
OpenAI-compatible |
Or install everything: pip install agenteval-ai[all]
$0 Local Evals with Ollama
No API keys needed. Run evaluations entirely locally:
ollama pull llama3.2
pip install agenteval-ai[ollama]
agenteval run tests/agent_evals/ -v
agenteval auto-detects Ollama and uses it for LLM-as-judge evaluations. To use a different provider even when Ollama is available:
pytest tests/agent_evals/ -v --agenteval-eval-provider=openai --agenteval-eval-model=gpt-4o-mini
Reports
Generate detailed HTML or JSON reports after a test run:
pytest tests/agent_evals/ --agenteval-report=html --agenteval-report-dir=reports/
This writes a self-contained reports/report_{YYYYMMDD_HHMMSS}.html with pass/fail, scores, costs, latency, token counts, evaluator reasoning, full agent trajectory with tool calls, and multi-turn message flow.
Each report file includes a UTC timestamp so runs don't overwrite each other.
Available formats: html, json, or both (html,json):
# HTML only
pytest tests/ --agenteval-report=html
# JSON only (machine-readable, good for CI)
pytest tests/ --agenteval-report=json
# Both
pytest tests/ --agenteval-report=html,json --agenteval-report-dir=my-reports/
CLI Options
| Flag | Description |
|---|---|
--agenteval-eval-provider |
Eval provider: ollama, openai, bedrock (overrides config/env) |
--agenteval-eval-model |
Eval model name (overrides config/env) |
--agenteval-report |
Report format: html, json, console, or comma-separated |
--agenteval-report-dir |
Output directory (default: agenteval-reports/) |
--agenteval-fail-under |
Fail if average eval score is below threshold (0.0–1.0) |
--agenteval-max-cost |
Fail if total eval cost exceeds this amount (USD) |
CLI flags take the highest precedence, overriding both pyproject.toml and environment variables.
# Use OpenAI as judge instead of auto-detected Ollama
pytest tests/agent_evals/ -v --agenteval-eval-provider=openai --agenteval-eval-model=gpt-4o-mini
# Use Bedrock as judge with HTML report
pytest tests/agent_evals/ -v --agenteval-eval-provider=bedrock \
--agenteval-eval-model=anthropic.claude-3-haiku-20240307-v1:0 \
--agenteval-report=html
Configuration
Configure agenteval via pyproject.toml, environment variables, or both. Environment variables take precedence.
pyproject.toml
[tool.agenteval]
eval_provider = "bedrock" # ollama, openai, bedrock
eval_model = "anthropic.claude-3-haiku-20240307-v1:0"
aws_profile = "my-aws-profile" # AWS named profile for Bedrock
aws_region = "us-west-2" # AWS region for Bedrock
openai_base_url = "http://localhost:8080/v1" # for OpenAI-compatible APIs
openai_api_key = "sk-..." # optional API key
report_format = "html"
report_dir = "agenteval-reports"
default_max_cost_usd = 1.0
default_max_latency_ms = 30000
Environment Variables
All config fields can be set via AGENTEVAL_ prefixed environment variables:
# Eval provider settings
export AGENTEVAL_EVAL_PROVIDER=bedrock
export AGENTEVAL_EVAL_MODEL=anthropic.claude-3-haiku-20240307-v1:0
export AGENTEVAL_AWS_PROFILE=my-aws-profile
export AGENTEVAL_AWS_REGION=us-west-2
# OpenAI-compatible API settings
export AGENTEVAL_OPENAI_BASE_URL=http://localhost:8080/v1
export AGENTEVAL_OPENAI_API_KEY=sk-custom
# Run tests
pytest tests/agent_evals/ -v
For the example agents, standard AWS environment variables also work:
# Standard AWS env vars (used by the example agent fixtures)
export AWS_PROFILE=my-aws-profile
export AWS_REGION=us-east-1
pytest examples/bedrock_agent/ -v
Provider-specific setup
Ollama (free, local)
ollama pull llama3.2
pip install agenteval-ai[ollama]
# No config needed — auto-detected as default
OpenAI
pip install agenteval-ai[openai]
export OPENAI_API_KEY=sk-...
[tool.agenteval]
eval_provider = "openai"
eval_model = "gpt-4o-mini"
OpenAI-compatible APIs (e.g., vLLM, LiteLLM, Together AI, local servers)
pip install agenteval-ai[openai]
[tool.agenteval]
eval_provider = "openai"
eval_model = "my-custom-model"
openai_base_url = "http://localhost:8080/v1"
openai_api_key = "sk-custom" # optional, depends on your server
Or via environment:
export AGENTEVAL_OPENAI_BASE_URL=http://localhost:8080/v1
export AGENTEVAL_OPENAI_API_KEY=sk-custom
pytest tests/agent_evals/ -v
AWS Bedrock
pip install agenteval-ai[bedrock]
[tool.agenteval]
eval_provider = "bedrock"
eval_model = "anthropic.claude-3-haiku-20240307-v1:0"
aws_profile = "my-profile" # optional, uses default credential chain if omitted
aws_region = "us-east-1" # optional, uses boto3 default if omitted
Or pass credentials via environment:
export AGENTEVAL_AWS_PROFILE=my-profile
export AGENTEVAL_AWS_REGION=us-east-1
pytest tests/agent_evals/ -v
MCP Server
Works with all major AI coding tools:
agenteval mcp install # auto-configures all detected tools
agenteval mcp install --platform claude-code # Claude Code only
agenteval mcp install --platform copilot # VS Code / GitHub Copilot
agenteval mcp install --platform cursor # Cursor
agenteval mcp install --platform windsurf # Windsurf
agenteval mcp serve # start the server (stdio)
| Tool | Config Path |
|---|---|
| Claude Code | ~/.claude/settings.json |
| VS Code / Copilot | .vscode/mcp.json |
| Cursor | .cursor/mcp.json |
| Windsurf | ~/.codeium/windsurf/mcp_config.json |
8 tools: run_eval, run_single_test, check_regression, show_cost_report, list_evaluators, generate_test, save_baseline, explain_failure
AI Coding Tool Skills
agenteval skill install --platform all
Installs skills for Claude Code, GitHub Copilot, Cursor, and Windsurf. After installation, your AI coding tool can:
- Test agents with
/eval-agent - Generate test files with
/generate-tests - Check regressions with
/check-regression - Audit costs with
/cost-audit - Audit security with
/security-audit
GitHub Action
Add agent testing to your CI in 5 lines:
# .github/workflows/agenteval.yml
name: Agent Tests
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: devbrat-anand/agenteval@v1
with:
fail_under: "0.8"
Posts a results table as a PR comment with scores, costs, and pass/fail status.
Comparison
| Feature | agenteval | DeepEval | TruLens | RAGAS | LangSmith |
|---|---|---|---|---|---|
| Multi-step agent trajectories | ✅ | Partial | ❌ | ❌ | ✅ |
| Framework-agnostic | ✅ | ✅ | ❌ | ❌ | ❌ |
| Protocol-level interception | ✅ | ❌ | ❌ | ❌ | ❌ |
| pytest native | ✅ | ✅ | ❌ | ❌ | ❌ |
| $0 local evals (Ollama) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Multi-provider (4 SDKs) | ✅ | ❌ | ❌ | ❌ | ❌ |
| MCP server | ✅ | ❌ | ❌ | ❌ | ❌ |
| GitHub Action with PR bot | ✅ | ❌ | ❌ | ❌ | ❌ |
| AI coding tool skills | ✅ | ❌ | ❌ | ❌ | ❌ |
| Open source (MIT) | ✅ | ✅ | ✅ | ✅ | ❌ |
Examples
| Example | Provider | What It Tests |
|---|---|---|
| quickstart | None (echo) | Basic structure |
| openai_agent | OpenAI | Tool-calling agent: cost, convergence, security, hallucination, scope |
| bedrock_agent | AWS Bedrock | Tool-calling agent (Converse API): cost, security, hallucination, scope |
| langchain_agent | OpenAI + LangChain | Tool calls, hallucination, scope |
| ollama_local | Ollama | $0 local evals: security, convergence, hallucination, scope |
Custom Evaluators
Write your own evaluator and share it as a Python package:
from agenteval.evaluators.base import Evaluator
from agenteval.core.models import Trace, EvalResult
class ToxicityEvaluator(Evaluator):
name = "toxicity"
def evaluate(self, trace: Trace, criteria: dict) -> EvalResult:
# Your logic here
...
Register via entry points:
[project.entry-points."agenteval.evaluators"]
toxicity = "my_package:ToxicityEvaluator"
Contributing
See CONTRIBUTING.md for guidelines. We welcome:
- New evaluators
- New provider interceptors
- Bug fixes and documentation improvements
- Example projects
License
MIT — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agenteval_ai-0.1.0.tar.gz.
File metadata
- Download URL: agenteval_ai-0.1.0.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68a58bb6db4c42dd401cc9dd5b038b794bb40868580c224379149ad912b8092e
|
|
| MD5 |
b1ce2b1358106ef84ca5177816fbbf45
|
|
| BLAKE2b-256 |
c35c547229a0ef19c32b027c92e9e58dce7cedce0d9ff7ad5a2889ea5d6457f6
|
Provenance
The following attestation bundles were made for agenteval_ai-0.1.0.tar.gz:
Publisher:
release.yml on devbrat-anand/agenteval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agenteval_ai-0.1.0.tar.gz -
Subject digest:
68a58bb6db4c42dd401cc9dd5b038b794bb40868580c224379149ad912b8092e - Sigstore transparency entry: 1265542260
- Sigstore integration time:
-
Permalink:
devbrat-anand/agenteval@c6978c4f6835eff8006c65951db89565f5c07489 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/devbrat-anand
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c6978c4f6835eff8006c65951db89565f5c07489 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agenteval_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agenteval_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 65.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b72185a191eafb27d606c59a1ef3f19548fb3c3cdfa088577b94ee8d2c979bb
|
|
| MD5 |
fff4365baebd6f18cea8f5f3ddf7641a
|
|
| BLAKE2b-256 |
0ac17d6eb78e82b296afd49aed3877a2dffb6d666bf78d4530b4dd12924361fc
|
Provenance
The following attestation bundles were made for agenteval_ai-0.1.0-py3-none-any.whl:
Publisher:
release.yml on devbrat-anand/agenteval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agenteval_ai-0.1.0-py3-none-any.whl -
Subject digest:
2b72185a191eafb27d606c59a1ef3f19548fb3c3cdfa088577b94ee8d2c979bb - Sigstore transparency entry: 1265542380
- Sigstore integration time:
-
Permalink:
devbrat-anand/agenteval@c6978c4f6835eff8006c65951db89565f5c07489 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/devbrat-anand
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c6978c4f6835eff8006c65951db89565f5c07489 -
Trigger Event:
push
-
Statement type: