HumanEval for tool use — a standardized benchmark for evaluating LLM tool-use capabilities

Project description

BenchAgent — HumanEval for Tool Use

A standardized benchmark for evaluating LLM tool-use capabilities across multiple categories: bash commands, code editing, code reading, code writing, multi-tool orchestration, and error recovery.

Installation

pip install bench-agent

For development:

pip install -e ".[dev]"

Quick Start

# List available tasks
bench-agent list-tasks

# List tasks by category
bench-agent list-tasks --category bash

# Run benchmark against a model
bench-agent run --model gpt-4 --category bash

# Run all categories
bench-agent run --model fableforge-14b --all

# View leaderboard
bench-agent leaderboard

# Export leaderboard as markdown
bench-agent export --format markdown

Task Categories

BASH (21 tasks)

Shell command execution: finding files, processing text, managing processes, network operations, log parsing, and system administration tasks.

EDIT (22 tasks)

Code modification: fixing bugs, refactoring code, adding features, changing APIs, adding type hints, converting sync to async, error handling, and API evolution.

READ (16 tasks)

Code comprehension: understanding structure, finding patterns, tracing execution, identifying vulnerabilities, and explaining code behavior.

WRITE (16 tasks)

Code creation: generating new files, configuration, tests, Dockerfiles, project scaffolding, and CI/CD pipelines.

MULTI-TOOL (16 tasks)

Complex tasks requiring 3+ tools in sequence: read → analyze → modify → verify, full project setup, and multi-file refactoring.

ERROR RECOVERY (16 tasks)

Fixing broken code, recovering from errors, handling edge cases: syntax errors, runtime errors, race conditions, security vulnerabilities, and infinite loops.

Scoring Methodology

Each task produces a TaskResult with:

Metric	Weight	Description
Functional correctness	60%	Does the solution work as expected?
Efficiency	25%	Fewer turns and tokens = higher score
Error recovery	15%	How well does the model recover from errors?

For failed tasks, partial credit applies:

Component	Weight	Description
Partial completion	50%	How close to a correct solution?
Error recovery rate	30%	Were errors identified and addressed?
Efficiency	20%	Resource usage despite failure

Score Calculation

Overall Score = 0.6 * functional_score + 0.15 * recovery_score + 0.25 * efficiency_score

For failed tasks:

Score = 0.5 * partial_credit + 0.3 * recovery_score + 0.2 * efficiency_score

Final scores are scaled to 0–100.

Task Structure

Each task defines:

task_id: Unique identifier (e.g., bash-001, edit-015)
category: One of the six categories
difficulty: easy, medium, or hard
description: What the model needs to accomplish
initial_state: Files to create before task execution
expected_outcome: What constitutes success
tools_required: Which tools the model should use
max_turns: Maximum tool-use turns allowed
verification_script: Python script to verify correctness

Task Counts

Category	Count
BASH	21
EDIT	22
READ	16
WRITE	16
MULTI-TOOL	16
ERROR RECOVERY	16
Total	107

Python API

from bench_agent.evaluator import evaluate_model
from bench_agent.runner import TaskRunner
from bench_agent.tasks import BASH_TASKS, EDIT_TASKS

# Run evaluation
report = evaluate_model(
    model_name="gpt-4",
    provider="openai",
    categories=[TaskCategory.BASH, TaskCategory.EDIT],
    num_tasks=10,
)

print(f"Total Score: {report.total_score}")
print(f"Category Scores: {report.category_scores}")
print(f"Error Recovery Rate: {report.error_recovery_rate}")

Leaderboard

from bench_agent.leaderboard import load_leaderboard, update_leaderboard, export_markdown

lb = load_leaderboard("leaderboard.json")
lb = update_leaderboard(lb, "gpt-4", results)
print(export_markdown(lb))

Architecture

src/bench_agent/
├── __init__.py          # Package init
├── models.py            # Pydantic data models
├── tasks.py             # 107 task definitions
├── runner.py            # Task execution runner
├── scorer.py            # Scoring system
├── leaderboard.py       # Leaderboard management
├── evaluator.py         # Model evaluation
└── cli.py               # Click CLI interface

Development

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=bench_agent

# Lint
ruff check src/

License

MIT

Ecosystem

Part of the FableForge ecosystem — 21 open-source projects built from 210K real agent traces:

Project	Description
Anvil	Self-verified coding agent
VerifyLoop	Plan→Execute→Verify→Recover framework
ErrorRecovery	Self-healing middleware (3,725 error patterns)
FableForge-14B	The fine-tuned 14B model (4-stage training)
ShellWhisperer	1.5B edge agent (phone/RPi, 50ms)
ReasonCritic	Verification model (130 benchmark tasks)
TraceCompiler	Compile traces → LoRA skills
AgentRuntime	Persistent agent daemon (systemd for AI)
AgentSwarm	Multi-agent from real trace transitions
AgentTelemetry	Datadog for agents (token tracking, costs)
BenchAgent	HumanEval for tool-use (107 tasks)
AgentDev	VSCode extension with verification
TraceViz	Trace replay visualizer (Next.js)
AgentSkills	npm for agent behaviors
AgentCurriculum	5-stage progressive training
AgentFuzzer	Adversarial testing for agents
AgentConstitution	Safety guardrails from traces
CostOptimizer	Token cost reduction (50-80%)
AgentProfiler	Behavioral fingerprinting
TrajectoryDistiller	Trace→training data pipeline
Fable5-Dataset	HuggingFace dataset release

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fableforge_bench_agent-0.1.0.tar.gz (27.9 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fableforge_bench_agent-0.1.0-py3-none-any.whl (29.6 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file fableforge_bench_agent-0.1.0.tar.gz.

File metadata

Download URL: fableforge_bench_agent-0.1.0.tar.gz
Upload date: Jun 14, 2026
Size: 27.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fableforge_bench_agent-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`55bb54e7d44266a51f1c713dc7dd3f423a847762eb69277bd4c527a12c1a247a`
MD5	`526eaccf35e04bc1c327212e36c897c4`
BLAKE2b-256	`b63a6c189c8945fe775deaf834194a7673fdb500920ad29fd2a729b0c7ecbd35`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fableforge_bench_agent-0.1.0.tar.gz:

Publisher: release.yml on KingLabsA/bench-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fableforge_bench_agent-0.1.0.tar.gz
- Subject digest: 55bb54e7d44266a51f1c713dc7dd3f423a847762eb69277bd4c527a12c1a247a
- Sigstore transparency entry: 1819977872
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: KingLabsA/bench-agent@12c6aea01153ab3035bb366e7287b49d9ad7e1c4
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/KingLabsA
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@12c6aea01153ab3035bb366e7287b49d9ad7e1c4
- Trigger Event: push

File details

Details for the file fableforge_bench_agent-0.1.0-py3-none-any.whl.

File metadata

Download URL: fableforge_bench_agent-0.1.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 29.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fableforge_bench_agent-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7240b8643080c089f9d55821a963a040e538b2fe61a7253194ba0c754264ef0a`
MD5	`0c142572fc2551ed5d1e8b45e173cf4d`
BLAKE2b-256	`6027f29774cb11f9778ca1c0058569c7bd10491051df844713381e844c9b551b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fableforge_bench_agent-0.1.0-py3-none-any.whl:

Publisher: release.yml on KingLabsA/bench-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fableforge_bench_agent-0.1.0-py3-none-any.whl
- Subject digest: 7240b8643080c089f9d55821a963a040e538b2fe61a7253194ba0c754264ef0a
- Sigstore transparency entry: 1819977903
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: KingLabsA/bench-agent@12c6aea01153ab3035bb366e7287b49d9ad7e1c4
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/KingLabsA
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@12c6aea01153ab3035bb366e7287b49d9ad7e1c4
- Trigger Event: push

fableforge-bench-agent 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

BenchAgent — HumanEval for Tool Use

Installation

Quick Start

Task Categories

BASH (21 tasks)

EDIT (22 tasks)

READ (16 tasks)

WRITE (16 tasks)

MULTI-TOOL (16 tasks)

ERROR RECOVERY (16 tasks)

Scoring Methodology

Score Calculation

Task Structure

Task Counts

Python API

Leaderboard

Architecture

Development

License

Ecosystem

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance