Skip to main content

Benchmark autonomous AI agents on task completion, tool use, goal adherence, and safety. Works with any agent — just provide a callable.

Project description

agent-bench

Benchmark autonomous AI agents on task completion, tool use, goal adherence, and safety. Works with any agent — just provide a callable.

Tests Dependencies Python License LinkedIn


Why agent-bench?

Agents are hard to evaluate. Unlike single LLM calls, agents take multiple steps, call tools, and can drift from their purpose. Most evaluation frameworks require you to restructure your agent. agent-bench doesn't. Wrap your agent in a callable and pass it in.

Five evaluation dimensions:

Dimension Weight What it measures
Task completion 35% Did it satisfy success criteria?
Tool use 20% Did it call the right tools?
Goal adherence 20% Did it stay on task?
Safety 15% Was the output safe?
Efficiency 10% Did it complete within step budget?

Install

pip install agent-bench

Quick start

from agent_bench import AgentBench, Task, AgentResponse

def my_agent(instruction: str) -> AgentResponse:
    result = run_my_agent(instruction)
    return AgentResponse(
        output=result.text,
        tools_called=result.tools_used,
        steps=result.step_count,
    )

bench = AgentBench(pass_threshold=0.7)

report = bench.run(
    agent=my_agent,
    tasks=[
        Task(
            id="research_task",
            instruction="Find the current UK base interest rate",
            expected_tools=["search"],
            success_criteria=["base rate", "Bank of England", "%"],
            max_steps=5,
        ),
    ],
)
print(report.summary())
print(f"Pass rate: {report.pass_rate:.0%}")
print(f"Weakest dimension: {report.weakest_dimension.value}")

Evaluate a single response

result = bench.evaluate_single(task, response)
print(result.overall_score)
print(result.score_by_dimension)

Linda Oraegbunam | LinkedIn | GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_agent_bench-1.0.0.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_agent_bench-1.0.0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_agent_bench-1.0.0.tar.gz.

File metadata

  • Download URL: llm_agent_bench-1.0.0.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_agent_bench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d65f062a528db2938c18626f8b643baec9d239f3fa79ff1869944c6b5e6c909f
MD5 44c498b28aa885832f596f08012d6aac
BLAKE2b-256 9b8fb1e6e6895dfeca55e865098e29cd532948d5b161cd0fd0d5e38f1d6bf541

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_agent_bench-1.0.0.tar.gz:

Publisher: publish.yml on obielin/agent-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_agent_bench-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_agent_bench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ad6751a922783c647d798318f20e542abfdf595ca0ae1ddd63046214a62982e
MD5 b59f473b93d2c8a0e3e7dd4a52064884
BLAKE2b-256 7e7f52e1af415a20f585f47cccddee4d83a5eebd00a250bae806783e99d1d03c

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_agent_bench-1.0.0-py3-none-any.whl:

Publisher: publish.yml on obielin/agent-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page