Skip to main content

A simple evaluation harness for AI agents.

Project description

agentbench

agentbench is an evaluation framework for AI agents. It provides core abstractions for tasks, scenarios, agents, and judges, with an async runner that supports retries, rate limiting, and comprehensive reporting.

Features

  • Core abstractions: Task, Scenario, AgentAdapter, Judge, and CompositeJudge
  • Async execution: Built-in retry logic, fail-fast behavior, and provider-based rate limiting
  • Lifecycle hooks: Setup, reset, and teardown methods for agent management
  • Comprehensive reporting: HTML reports, JSONL results, traces, and metadata
  • Run comparison: Compare multiple runs to identify regressions and improvements
  • Presets: Ready-to-use scenarios and judges for quick evaluation

Quick start

Install in editable mode:

pip install -e .

Run a simple example using presets:

python examples/math_with_presets.py

Example

Here's a minimal example that evaluates a simple math agent:

from agentbench import (
    RunConfig,
    run,
    build_math_basic_scenario,
    ExactMatchJudge,
)


class SimpleMathAgent:
    name = "simple_math_agent"
    version = "0.0.1"
    provider_key = None

    async def setup(self) -> None:
        return None

    async def reset(self) -> None:
        return None

    async def teardown(self) -> None:
        return None

    async def run_task(self, task, context=None):
        prompt = task.input["prompt"]
        if "2 + 3" in prompt:
            response = "5"
        elif "4 * 7" in prompt:
            response = "28"
        else:
            response = "I do not know yet"
        return {"response": response}


scenario = build_math_basic_scenario()
agent = SimpleMathAgent()
judge = ExactMatchJudge()

config = RunConfig(
    name="math_example",
    agents=[agent],
    scenarios=[scenario],
    judges=[judge],
)

results = run(config)
print(f"Got {sum(1 for r in results if r.passed)} / {len(results)} passing results.")

Run artifacts

Running an evaluation creates a directory under runs/<run_id>/ with:

  • results.jsonl: All evaluation results in JSONL format
  • traces.jsonl: Event traces for debugging and analysis
  • run_metadata.json: Run configuration, environment, and version information
  • report.html: Interactive HTML report with statistics, per-agent performance, and failing tasks

Open report.html in your browser to view the detailed evaluation report.

Testing

agentbench includes a comprehensive test suite with 43+ tests covering all major components:

  • Core types (Task, Cost, EvaluationResult, Trace)
  • Scenarios and presets
  • Judges (ExactMatchJudge, CompositeJudge)
  • Runner and lifecycle hooks
  • Storage and reporting functions
  • Integration tests

Run tests:

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Generate HTML test report
pytest --html=test-results/report.html --self-contained-html

# Generate coverage report
pytest --cov=src/agentbench --cov-report=html:htmlcov

Test reports are generated in test-results/ and coverage reports in htmlcov/.

Project status

agentbench is in active development. The core framework is functional and ready for evaluation use cases.

Current version: 0.0.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentbench-0.0.1.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentbench-0.0.1-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file agentbench-0.0.1.tar.gz.

File metadata

  • Download URL: agentbench-0.0.1.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentbench-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ed2e0540cecc7a97b658abb2a90ce3ba6931416fac22c88d033b58f6954dbc1c
MD5 a9ea87dbf20ad78cf9650667998b8b7b
BLAKE2b-256 2071f250ebd6f248a7cfcc0eb2a1c9ad9c5e0f3cd85438d653fd24a0edf58f9c

See more details on using hashes here.

File details

Details for the file agentbench-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: agentbench-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentbench-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a9d8862afda4148a0e9073d549d76be2e5cefa6141412475098a892a61f98e87
MD5 7ef85c2b31c1faef2a676f3f60de3d09
BLAKE2b-256 6c4c3dcbca61160c263c554485e215a1686019fb1c22c78d1af7286d30f544ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page