A simple evaluation harness for AI agents.
Project description
agentbench
agentbench is an evaluation framework for AI agents. It provides core abstractions for tasks, scenarios, agents, and judges, with an async runner that supports retries, rate limiting, and comprehensive reporting.
Features
- Core abstractions: Task, Scenario, AgentAdapter, Judge, and CompositeJudge
- Async execution: Built-in retry logic, fail-fast behavior, and provider-based rate limiting
- Lifecycle hooks: Setup, reset, and teardown methods for agent management
- Comprehensive reporting: HTML reports, JSONL results, traces, and metadata
- Run comparison: Compare multiple runs to identify regressions and improvements
- Presets: Ready-to-use scenarios and judges for quick evaluation
Quick start
Install in editable mode:
pip install -e .
Run a simple example using presets:
python examples/math_with_presets.py
Example
Here's a minimal example that evaluates a simple math agent:
from agentbench import (
RunConfig,
run,
build_math_basic_scenario,
ExactMatchJudge,
)
class SimpleMathAgent:
name = "simple_math_agent"
version = "0.0.1"
provider_key = None
async def setup(self) -> None:
return None
async def reset(self) -> None:
return None
async def teardown(self) -> None:
return None
async def run_task(self, task, context=None):
prompt = task.input["prompt"]
if "2 + 3" in prompt:
response = "5"
elif "4 * 7" in prompt:
response = "28"
else:
response = "I do not know yet"
return {"response": response}
scenario = build_math_basic_scenario()
agent = SimpleMathAgent()
judge = ExactMatchJudge()
config = RunConfig(
name="math_example",
agents=[agent],
scenarios=[scenario],
judges=[judge],
)
results = run(config)
print(f"Got {sum(1 for r in results if r.passed)} / {len(results)} passing results.")
Run artifacts
Running an evaluation creates a directory under runs/<run_id>/ with:
- results.jsonl: All evaluation results in JSONL format
- traces.jsonl: Event traces for debugging and analysis
- run_metadata.json: Run configuration, environment, and version information
- report.html: Interactive HTML report with statistics, per-agent performance, and failing tasks
Open report.html in your browser to view the detailed evaluation report.
Testing
agentbench includes a comprehensive test suite with 43+ tests covering all major components:
- Core types (Task, Cost, EvaluationResult, Trace)
- Scenarios and presets
- Judges (ExactMatchJudge, CompositeJudge)
- Runner and lifecycle hooks
- Storage and reporting functions
- Integration tests
Run tests:
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest
# Generate HTML test report
pytest --html=test-results/report.html --self-contained-html
# Generate coverage report
pytest --cov=src/agentbench --cov-report=html:htmlcov
Test reports are generated in test-results/ and coverage reports in htmlcov/.
Project status
agentbench is in active development. The core framework is functional and ready for evaluation use cases.
Current version: 0.0.1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentbench-0.0.1.tar.gz.
File metadata
- Download URL: agentbench-0.0.1.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed2e0540cecc7a97b658abb2a90ce3ba6931416fac22c88d033b58f6954dbc1c
|
|
| MD5 |
a9ea87dbf20ad78cf9650667998b8b7b
|
|
| BLAKE2b-256 |
2071f250ebd6f248a7cfcc0eb2a1c9ad9c5e0f3cd85438d653fd24a0edf58f9c
|
File details
Details for the file agentbench-0.0.1-py3-none-any.whl.
File metadata
- Download URL: agentbench-0.0.1-py3-none-any.whl
- Upload date:
- Size: 18.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9d8862afda4148a0e9073d549d76be2e5cefa6141412475098a892a61f98e87
|
|
| MD5 |
7ef85c2b31c1faef2a676f3f60de3d09
|
|
| BLAKE2b-256 |
6c4c3dcbca61160c263c554485e215a1686019fb1c22c78d1af7286d30f544ac
|