pytest for AI agents — catch failures before production

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

devbrat

These details have not been verified by PyPI

Project description

agenteval

pytest for AI agents — catch failures before production

Your agent tests pass. Your monitoring says "green."
Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.

agenteval catches these failures in CI, before production.

Quickstart · Evaluators · Providers · Docs

pip install agenteval-ai[all] && agenteval init && pytest tests/agent_evals/ -v

agenteval demo — running tests and seeing results

The Problem

AI agents fail silently. Traditional monitoring can't catch:

Failure Mode	What Monitoring Sees	What Actually Happened
Token spiral	HTTP 200, normal latency	500 → 4M tokens, $2,847 over 4 hours
Hallucination	HTTP 200, fast response	Confident, completely wrong answer
PII leakage	Successful response	Customer SSN in the output
Wrong tool	Tool call succeeded	Called `delete_order` instead of `lookup_order`
Silent regression	No change in metrics	Model update degraded quality by 30%

The Solution

Write agent tests like regular Python tests. Run them in CI.

def test_agent_responds(agent):
    result = agent.run("What is our refund policy?")
    assert result.output
    assert result.trace.converged()

def test_no_hallucination(agent, eval_model):
    result = agent.run("What is our refund policy?")
    assert result.trace.hallucination_score(eval_model=eval_model) >= 0.9

def test_cost_budget(agent):
    result = agent.run("Complex multi-step task")
    assert result.trace.total_cost_usd < 5.00
    assert result.trace.no_loops(max_repeats=3)

def test_security(agent):
    result = agent.run("Look up customer John Smith")
    assert result.trace.no_pii_leaked()
    assert result.trace.no_prompt_injection()

Quickstart

pip install agenteval-ai[all]
agenteval init

Wire up your agent in tests/agent_evals/conftest.py:

import pytest
from agenteval.core.runner import AgentRunner

@pytest.fixture
def agent(agent_runner: AgentRunner):
    def my_agent(prompt: str) -> str:
        # Your agent here — OpenAI, Bedrock, LangChain, anything
        from openai import OpenAI
        client = OpenAI()
        r = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
        )
        return r.choices[0].message.content

    return agent_runner.wrap(my_agent, name="my_agent")

Run:

agenteval run tests/agent_evals/ -v

13 Built-in Evaluators

Structural (deterministic, no LLM needed)

Evaluator	What It Catches
`ToolCallEvaluator`	Wrong tools called, missing tools, wrong order
`CostEvaluator`	Budget overruns, per-turn cost spikes
`LatencyEvaluator`	Slow responses, per-turn latency
`LoopDetectorEvaluator`	Infinite loops, retry spirals, token spirals
`OutputStructureEvaluator`	Wrong format, missing fields, schema violations

Semantic (LLM-as-judge — works with Ollama for $0)

Evaluator	What It Catches
`LLMJudgeEvaluator`	Custom quality criteria
`HallucinationEvaluator`	Ungrounded claims, made-up facts
`SimilarityEvaluator`	Drift from golden reference answers

Safety

Evaluator	What It Catches
`SecurityEvaluator`	PII leakage, credential exposure, injection attacks
`GuardrailEvaluator`	Scope violations, toxic content

Operational

Evaluator	What It Catches
`RegressionEvaluator`	Score drops, cost increases vs. baseline
`ConvergenceEvaluator`	Agent didn't finish the task
`ContextUtilizationEvaluator`	Agent ignored retrieved context

Provider Support

agenteval intercepts LLM calls at the protocol level. No framework-specific code needed.

Provider	Install	Hook Mechanism
OpenAI	`pip install agenteval-ai[openai]`	httpx transport
AWS Bedrock	`pip install agenteval-ai[bedrock]`	botocore events
Anthropic	`pip install agenteval-ai[anthropic]`	SDK patching
Ollama	`pip install agenteval-ai[ollama]`	OpenAI-compatible

Or install everything: pip install agenteval-ai[all]

$0 Local Evals with Ollama

No API keys needed. Run evaluations entirely locally:

ollama pull llama3.2
pip install agenteval-ai[ollama]
agenteval run tests/agent_evals/ -v

agenteval auto-detects Ollama and uses it for LLM-as-judge evaluations. To use a different provider even when Ollama is available:

pytest tests/agent_evals/ -v --agenteval-eval-provider=openai --agenteval-eval-model=gpt-4o-mini

Reports

Generate detailed HTML or JSON reports after a test run:

pytest tests/agent_evals/ --agenteval-report=html --agenteval-report-dir=reports/

This writes a self-contained reports/report_{YYYYMMDD_HHMMSS}.html with pass/fail, scores, costs, latency, token counts, evaluator reasoning, full agent trajectory with tool calls, and multi-turn message flow.

Each report file includes a UTC timestamp so runs don't overwrite each other.

Available formats: html, json, or both (html,json):

# HTML only
pytest tests/ --agenteval-report=html

# JSON only (machine-readable, good for CI)
pytest tests/ --agenteval-report=json

# Both
pytest tests/ --agenteval-report=html,json --agenteval-report-dir=my-reports/

CLI Options

Flag	Description
`--agenteval-eval-provider`	Eval provider: `ollama`, `openai`, `bedrock` (overrides config/env)
`--agenteval-eval-model`	Eval model name (overrides config/env)
`--agenteval-report`	Report format: `html`, `json`, `console`, or comma-separated
`--agenteval-report-dir`	Output directory (default: `agenteval-reports/`)
`--agenteval-fail-under`	Fail if average eval score is below threshold (0.0–1.0)
`--agenteval-max-cost`	Fail if total eval cost exceeds this amount (USD)

CLI flags take the highest precedence, overriding both pyproject.toml and environment variables.

# Use OpenAI as judge instead of auto-detected Ollama
pytest tests/agent_evals/ -v --agenteval-eval-provider=openai --agenteval-eval-model=gpt-4o-mini

# Use Bedrock as judge with HTML report
pytest tests/agent_evals/ -v --agenteval-eval-provider=bedrock \
  --agenteval-eval-model=anthropic.claude-3-haiku-20240307-v1:0 \
  --agenteval-report=html

Configuration

Configure agenteval via pyproject.toml, environment variables, or both. Environment variables take precedence.

pyproject.toml

[tool.agenteval]
eval_provider = "bedrock"              # ollama, openai, bedrock
eval_model = "anthropic.claude-3-haiku-20240307-v1:0"
aws_profile = "my-aws-profile"         # AWS named profile for Bedrock
aws_region = "us-west-2"               # AWS region for Bedrock
openai_base_url = "http://localhost:8080/v1"  # for OpenAI-compatible APIs
openai_api_key = "sk-..."             # optional API key
report_format = "html"
report_dir = "agenteval-reports"
default_max_cost_usd = 1.0
default_max_latency_ms = 30000

Environment Variables

All config fields can be set via AGENTEVAL_ prefixed environment variables:

# Eval provider settings
export AGENTEVAL_EVAL_PROVIDER=bedrock
export AGENTEVAL_EVAL_MODEL=anthropic.claude-3-haiku-20240307-v1:0
export AGENTEVAL_AWS_PROFILE=my-aws-profile
export AGENTEVAL_AWS_REGION=us-west-2

# OpenAI-compatible API settings
export AGENTEVAL_OPENAI_BASE_URL=http://localhost:8080/v1
export AGENTEVAL_OPENAI_API_KEY=sk-custom

# Run tests
pytest tests/agent_evals/ -v

For the example agents, standard AWS environment variables also work:

# Standard AWS env vars (used by the example agent fixtures)
export AWS_PROFILE=my-aws-profile
export AWS_REGION=us-east-1

pytest examples/bedrock_agent/ -v

Provider-specific setup

Ollama (free, local)

ollama pull llama3.2
pip install agenteval-ai[ollama]
# No config needed — auto-detected as default

OpenAI

pip install agenteval-ai[openai]
export OPENAI_API_KEY=sk-...

[tool.agenteval]
eval_provider = "openai"
eval_model = "gpt-4o-mini"

OpenAI-compatible APIs (e.g., vLLM, LiteLLM, Together AI, local servers)

pip install agenteval-ai[openai]

[tool.agenteval]
eval_provider = "openai"
eval_model = "my-custom-model"
openai_base_url = "http://localhost:8080/v1"
openai_api_key = "sk-custom"     # optional, depends on your server

Or via environment:

export AGENTEVAL_OPENAI_BASE_URL=http://localhost:8080/v1
export AGENTEVAL_OPENAI_API_KEY=sk-custom
pytest tests/agent_evals/ -v

AWS Bedrock

pip install agenteval-ai[bedrock]

[tool.agenteval]
eval_provider = "bedrock"
eval_model = "anthropic.claude-3-haiku-20240307-v1:0"
aws_profile = "my-profile"    # optional, uses default credential chain if omitted
aws_region = "us-east-1"      # optional, uses boto3 default if omitted

Or pass credentials via environment:

export AGENTEVAL_AWS_PROFILE=my-profile
export AGENTEVAL_AWS_REGION=us-east-1
pytest tests/agent_evals/ -v

MCP Server

Works with all major AI coding tools:

agenteval mcp install                        # auto-configures all detected tools
agenteval mcp install --platform claude-code # Claude Code only
agenteval mcp install --platform copilot     # VS Code / GitHub Copilot
agenteval mcp install --platform cursor      # Cursor
agenteval mcp install --platform windsurf    # Windsurf
agenteval mcp serve                          # start the server (stdio)

Tool	Config Path
Claude Code	`~/.claude/settings.json`
VS Code / Copilot	`.vscode/mcp.json`
Cursor	`.cursor/mcp.json`
Windsurf	`~/.codeium/windsurf/mcp_config.json`

8 tools: run_eval, run_single_test, check_regression, show_cost_report, list_evaluators, generate_test, save_baseline, explain_failure

AI Coding Tool Skills

agenteval skill install --platform all

Installs skills for Claude Code, GitHub Copilot, Cursor, and Windsurf. After installation, your AI coding tool can:

Test agents with /eval-agent
Generate test files with /generate-tests
Check regressions with /check-regression
Audit costs with /cost-audit
Audit security with /security-audit

GitHub Action

Add agent testing to your CI in 5 lines:

# .github/workflows/agenteval.yml
name: Agent Tests
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: devbrat-anand/agenteval@v1
        with:
          fail_under: "0.8"

Posts a results table as a PR comment with scores, costs, and pass/fail status.

Comparison

Feature	agenteval	DeepEval	TruLens	RAGAS	LangSmith
Multi-step agent trajectories	✅	Partial	❌	❌	✅
Framework-agnostic	✅	✅	❌	❌	❌
Protocol-level interception	✅	❌	❌	❌	❌
pytest native	✅	✅	❌	❌	❌
$0 local evals (Ollama)	✅	❌	❌	❌	❌
Multi-provider (4 SDKs)	✅	❌	❌	❌	❌
MCP server	✅	❌	❌	❌	❌
GitHub Action with PR bot	✅	❌	❌	❌	❌
AI coding tool skills	✅	❌	❌	❌	❌
Open source (MIT)	✅	✅	✅	✅	❌

Examples

Example	Provider	What It Tests
quickstart	None (echo)	Basic structure
openai_agent	OpenAI	Tool-calling agent: cost, convergence, security, hallucination, scope
bedrock_agent	AWS Bedrock	Tool-calling agent (Converse API): cost, security, hallucination, scope
langchain_agent	OpenAI + LangChain	Tool calls, hallucination, scope
ollama_local	Ollama	$0 local evals: security, convergence, hallucination, scope

Custom Evaluators

Write your own evaluator and share it as a Python package:

from agenteval.evaluators.base import Evaluator
from agenteval.core.models import Trace, EvalResult

class ToxicityEvaluator(Evaluator):
    name = "toxicity"

    def evaluate(self, trace: Trace, criteria: dict) -> EvalResult:
        # Your logic here
        ...

[project.entry-points."agenteval.evaluators"]
toxicity = "my_package:ToxicityEvaluator"

Contributing

See CONTRIBUTING.md for guidelines. We welcome:

New evaluators
New provider interceptors
Bug fixes and documentation improvements
Example projects

License

MIT — see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

devbrat

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Apr 9, 2026

This version

0.1.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenteval_ai-0.1.0.tar.gz (4.1 MB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agenteval_ai-0.1.0-py3-none-any.whl (65.9 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file agenteval_ai-0.1.0.tar.gz.

File metadata

Download URL: agenteval_ai-0.1.0.tar.gz
Upload date: Apr 9, 2026
Size: 4.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenteval_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`68a58bb6db4c42dd401cc9dd5b038b794bb40868580c224379149ad912b8092e`
MD5	`b1ce2b1358106ef84ca5177816fbbf45`
BLAKE2b-256	`c35c547229a0ef19c32b027c92e9e58dce7cedce0d9ff7ad5a2889ea5d6457f6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agenteval_ai-0.1.0.tar.gz:

Publisher: release.yml on devbrat-anand/agenteval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agenteval_ai-0.1.0.tar.gz
- Subject digest: 68a58bb6db4c42dd401cc9dd5b038b794bb40868580c224379149ad912b8092e
- Sigstore transparency entry: 1265542260
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: devbrat-anand/agenteval@c6978c4f6835eff8006c65951db89565f5c07489
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/devbrat-anand
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c6978c4f6835eff8006c65951db89565f5c07489
- Trigger Event: push

File details

Details for the file agenteval_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: agenteval_ai-0.1.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 65.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenteval_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b72185a191eafb27d606c59a1ef3f19548fb3c3cdfa088577b94ee8d2c979bb`
MD5	`fff4365baebd6f18cea8f5f3ddf7641a`
BLAKE2b-256	`0ac17d6eb78e82b296afd49aed3877a2dffb6d666bf78d4530b4dd12924361fc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agenteval_ai-0.1.0-py3-none-any.whl:

Publisher: release.yml on devbrat-anand/agenteval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agenteval_ai-0.1.0-py3-none-any.whl
- Subject digest: 2b72185a191eafb27d606c59a1ef3f19548fb3c3cdfa088577b94ee8d2c979bb
- Sigstore transparency entry: 1265542380
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: devbrat-anand/agenteval@c6978c4f6835eff8006c65951db89565f5c07489
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/devbrat-anand
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c6978c4f6835eff8006c65951db89565f5c07489
- Trigger Event: push

agenteval-ai 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agenteval

The Problem

The Solution

Quickstart

13 Built-in Evaluators

Structural (deterministic, no LLM needed)

Semantic (LLM-as-judge — works with Ollama for $0)

Safety

Operational

Provider Support

$0 Local Evals with Ollama

Reports

CLI Options

Configuration

pyproject.toml

Environment Variables

Provider-specific setup

MCP Server

AI Coding Tool Skills

GitHub Action

Comparison

Examples

Custom Evaluators

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance