A simple evaluation harness for AI agents.

These details have not been verified by PyPI

Project links

Project description

agentbench

agentbench is an evaluation framework for AI agents. It provides core abstractions for tasks, scenarios, agents, and judges, with an async runner that supports retries, rate limiting, and comprehensive reporting.

Features

Core abstractions: Task, Scenario, AgentAdapter, Judge, and CompositeJudge
Async execution: Built-in retry logic, fail-fast behavior, and provider-based rate limiting
Lifecycle hooks: Setup, reset, and teardown methods for agent management
Comprehensive reporting: HTML reports, JSONL results, traces, and metadata
Run comparison: Compare multiple runs to identify regressions and improvements
Presets: Ready-to-use scenarios and judges for quick evaluation

Quick start

Install in editable mode:

pip install -e .

Run a simple example using presets:

python examples/math_with_presets.py

Example

Here's a minimal example that evaluates a simple math agent:

from agentbench import (
    RunConfig,
    run,
    build_math_basic_scenario,
    ExactMatchJudge,
)


class SimpleMathAgent:
    name = "simple_math_agent"
    version = "0.0.1"
    provider_key = None

    async def setup(self) -> None:
        return None

    async def reset(self) -> None:
        return None

    async def teardown(self) -> None:
        return None

    async def run_task(self, task, context=None):
        prompt = task.input["prompt"]
        if "2 + 3" in prompt:
            response = "5"
        elif "4 * 7" in prompt:
            response = "28"
        else:
            response = "I do not know yet"
        return {"response": response}


scenario = build_math_basic_scenario()
agent = SimpleMathAgent()
judge = ExactMatchJudge()

config = RunConfig(
    name="math_example",
    agents=[agent],
    scenarios=[scenario],
    judges=[judge],
)

results = run(config)
print(f"Got {sum(1 for r in results if r.passed)} / {len(results)} passing results.")

Run artifacts

Running an evaluation creates a directory under runs/<run_id>/ with:

results.jsonl: All evaluation results in JSONL format
traces.jsonl: Event traces for debugging and analysis
run_metadata.json: Run configuration, environment, and version information
report.html: Interactive HTML report with statistics, per-agent performance, and failing tasks

Open report.html in your browser to view the detailed evaluation report.

Testing

agentbench includes a comprehensive test suite with 43+ tests covering all major components:

Core types (Task, Cost, EvaluationResult, Trace)
Scenarios and presets
Judges (ExactMatchJudge, CompositeJudge)
Runner and lifecycle hooks
Storage and reporting functions
Integration tests

Run tests:

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Generate HTML test report
pytest --html=test-results/report.html --self-contained-html

# Generate coverage report
pytest --cov=src/agentbench --cov-report=html:htmlcov

Test reports are generated in test-results/ and coverage reports in htmlcov/.

Project status

agentbench is in active development. The core framework is functional and ready for evaluation use cases.

Current version: 0.0.1

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Jan 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentbench-0.0.1.tar.gz (20.3 kB view details)

Uploaded Jan 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentbench-0.0.1-py3-none-any.whl (18.4 kB view details)

Uploaded Jan 7, 2026 Python 3

File details

Details for the file agentbench-0.0.1.tar.gz.

File metadata

Download URL: agentbench-0.0.1.tar.gz
Upload date: Jan 7, 2026
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentbench-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ed2e0540cecc7a97b658abb2a90ce3ba6931416fac22c88d033b58f6954dbc1c`
MD5	`a9ea87dbf20ad78cf9650667998b8b7b`
BLAKE2b-256	`2071f250ebd6f248a7cfcc0eb2a1c9ad9c5e0f3cd85438d653fd24a0edf58f9c`

See more details on using hashes here.

File details

Details for the file agentbench-0.0.1-py3-none-any.whl.

File metadata

Download URL: agentbench-0.0.1-py3-none-any.whl
Upload date: Jan 7, 2026
Size: 18.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentbench-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a9d8862afda4148a0e9073d549d76be2e5cefa6141412475098a892a61f98e87`
MD5	`7ef85c2b31c1faef2a676f3f60de3d09`
BLAKE2b-256	`6c4c3dcbca61160c263c554485e215a1686019fb1c22c78d1af7286d30f544ac`

See more details on using hashes here.

agentbench 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agentbench

Features

Quick start

Example

Run artifacts

Testing

Project status

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes