Skip to main content

A simple evaluation harness for AI agents.

Project description

agentbench

agentbench is an evaluation framework for AI agents. It provides core abstractions for tasks, scenarios, agents, and judges, with an async runner that supports retries, rate limiting, and comprehensive reporting.

Features

  • Core abstractions: Task, Scenario, AgentAdapter, Judge, and CompositeJudge
  • Async execution: Built-in retry logic, fail-fast behavior, and provider-based rate limiting
  • Lifecycle hooks: Setup, reset, and teardown methods for agent management
  • Comprehensive reporting: HTML reports, JSONL results, traces, and metadata
  • Run comparison: Compare multiple runs to identify regressions and improvements
  • Presets: Ready-to-use scenarios and judges for quick evaluation

Quick start

Install in editable mode:

pip install -e .

Run a simple example using presets:

python examples/math_with_presets.py

Example

Here's a minimal example that evaluates a simple math agent:

from agentbench import (
    RunConfig,
    run,
    build_math_basic_scenario,
    ExactMatchJudge,
)


class SimpleMathAgent:
    name = "simple_math_agent"
    version = "0.0.1"
    provider_key = None

    async def setup(self) -> None:
        return None

    async def reset(self) -> None:
        return None

    async def teardown(self) -> None:
        return None

    async def run_task(self, task, context=None):
        prompt = task.input["prompt"]
        if "2 + 3" in prompt:
            response = "5"
        elif "4 * 7" in prompt:
            response = "28"
        else:
            response = "I do not know yet"
        return {"response": response}


scenario = build_math_basic_scenario()
agent = SimpleMathAgent()
judge = ExactMatchJudge()

config = RunConfig(
    name="math_example",
    agents=[agent],
    scenarios=[scenario],
    judges=[judge],
)

results = run(config)
print(f"Got {sum(1 for r in results if r.passed)} / {len(results)} passing results.")

Run artifacts

Running an evaluation creates a directory under runs/<run_id>/ with:

  • results.jsonl: All evaluation results in JSONL format
  • traces.jsonl: Event traces for debugging and analysis
  • run_metadata.json: Run configuration, environment, and version information
  • report.html: Interactive HTML report with statistics, per-agent performance, and failing tasks

Open report.html in your browser to view the detailed evaluation report.

Testing

agentbench includes a comprehensive test suite with 43+ tests covering all major components:

  • Core types (Task, Cost, EvaluationResult, Trace)
  • Scenarios and presets
  • Judges (ExactMatchJudge, CompositeJudge)
  • Runner and lifecycle hooks
  • Storage and reporting functions
  • Integration tests

Run tests:

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Generate HTML test report
pytest --html=test-results/report.html --self-contained-html

# Generate coverage report
pytest --cov=src/agentbench --cov-report=html:htmlcov

Test reports are generated in test-results/ and coverage reports in htmlcov/.

Project status

agentbench is in active development. The core framework is functional and ready for evaluation use cases.

Current version: 0.0.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentflowtest-0.0.1.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentflowtest-0.0.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file agentflowtest-0.0.1.tar.gz.

File metadata

  • Download URL: agentflowtest-0.0.1.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentflowtest-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a1cc19d30caf30f32f4b4ccf5b3b81df3bb1bf0e0368fc2da54b3a70f646fd5f
MD5 ab3ba36adfbe1657366d6ed918a1e520
BLAKE2b-256 b025b2a1162692c79de1a393194415eb4580c974807567311ae4b0f762f2baa6

See more details on using hashes here.

File details

Details for the file agentflowtest-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: agentflowtest-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentflowtest-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ab86755ca8ac29860e28d9186b3c340181528283599e2411360e8fffbb23249
MD5 f4de8dc4f7c85da01833be18fdba647c
BLAKE2b-256 b783a4987075bb307d7d47faeafde795e585819f5ebcc815b0fb11a51bbb9ecd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page