HumanEval for tool use — a standardized benchmark for evaluating LLM tool-use capabilities
Project description
BenchAgent — HumanEval for Tool Use
A standardized benchmark for evaluating LLM tool-use capabilities across multiple categories: bash commands, code editing, code reading, code writing, multi-tool orchestration, and error recovery.
Installation
pip install bench-agent
For development:
pip install -e ".[dev]"
Quick Start
# List available tasks
bench-agent list-tasks
# List tasks by category
bench-agent list-tasks --category bash
# Run benchmark against a model
bench-agent run --model gpt-4 --category bash
# Run all categories
bench-agent run --model fableforge-14b --all
# View leaderboard
bench-agent leaderboard
# Export leaderboard as markdown
bench-agent export --format markdown
Task Categories
BASH (21 tasks)
Shell command execution: finding files, processing text, managing processes, network operations, log parsing, and system administration tasks.
EDIT (22 tasks)
Code modification: fixing bugs, refactoring code, adding features, changing APIs, adding type hints, converting sync to async, error handling, and API evolution.
READ (16 tasks)
Code comprehension: understanding structure, finding patterns, tracing execution, identifying vulnerabilities, and explaining code behavior.
WRITE (16 tasks)
Code creation: generating new files, configuration, tests, Dockerfiles, project scaffolding, and CI/CD pipelines.
MULTI-TOOL (16 tasks)
Complex tasks requiring 3+ tools in sequence: read → analyze → modify → verify, full project setup, and multi-file refactoring.
ERROR RECOVERY (16 tasks)
Fixing broken code, recovering from errors, handling edge cases: syntax errors, runtime errors, race conditions, security vulnerabilities, and infinite loops.
Scoring Methodology
Each task produces a TaskResult with:
| Metric | Weight | Description |
|---|---|---|
| Functional correctness | 60% | Does the solution work as expected? |
| Efficiency | 25% | Fewer turns and tokens = higher score |
| Error recovery | 15% | How well does the model recover from errors? |
For failed tasks, partial credit applies:
| Component | Weight | Description |
|---|---|---|
| Partial completion | 50% | How close to a correct solution? |
| Error recovery rate | 30% | Were errors identified and addressed? |
| Efficiency | 20% | Resource usage despite failure |
Score Calculation
Overall Score = 0.6 * functional_score + 0.15 * recovery_score + 0.25 * efficiency_score
For failed tasks:
Score = 0.5 * partial_credit + 0.3 * recovery_score + 0.2 * efficiency_score
Final scores are scaled to 0–100.
Task Structure
Each task defines:
- task_id: Unique identifier (e.g.,
bash-001,edit-015) - category: One of the six categories
- difficulty:
easy,medium, orhard - description: What the model needs to accomplish
- initial_state: Files to create before task execution
- expected_outcome: What constitutes success
- tools_required: Which tools the model should use
- max_turns: Maximum tool-use turns allowed
- verification_script: Python script to verify correctness
Task Counts
| Category | Count |
|---|---|
| BASH | 21 |
| EDIT | 22 |
| READ | 16 |
| WRITE | 16 |
| MULTI-TOOL | 16 |
| ERROR RECOVERY | 16 |
| Total | 107 |
Python API
from bench_agent.evaluator import evaluate_model
from bench_agent.runner import TaskRunner
from bench_agent.tasks import BASH_TASKS, EDIT_TASKS
# Run evaluation
report = evaluate_model(
model_name="gpt-4",
provider="openai",
categories=[TaskCategory.BASH, TaskCategory.EDIT],
num_tasks=10,
)
print(f"Total Score: {report.total_score}")
print(f"Category Scores: {report.category_scores}")
print(f"Error Recovery Rate: {report.error_recovery_rate}")
Leaderboard
from bench_agent.leaderboard import load_leaderboard, update_leaderboard, export_markdown
lb = load_leaderboard("leaderboard.json")
lb = update_leaderboard(lb, "gpt-4", results)
print(export_markdown(lb))
Architecture
src/bench_agent/
├── __init__.py # Package init
├── models.py # Pydantic data models
├── tasks.py # 107 task definitions
├── runner.py # Task execution runner
├── scorer.py # Scoring system
├── leaderboard.py # Leaderboard management
├── evaluator.py # Model evaluation
└── cli.py # Click CLI interface
Development
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=bench_agent
# Lint
ruff check src/
License
MIT
Ecosystem
Part of the FableForge ecosystem — 21 open-source projects built from 210K real agent traces:
| Project | Description |
|---|---|
| Anvil | Self-verified coding agent |
| VerifyLoop | Plan→Execute→Verify→Recover framework |
| ErrorRecovery | Self-healing middleware (3,725 error patterns) |
| FableForge-14B | The fine-tuned 14B model (4-stage training) |
| ShellWhisperer | 1.5B edge agent (phone/RPi, 50ms) |
| ReasonCritic | Verification model (130 benchmark tasks) |
| TraceCompiler | Compile traces → LoRA skills |
| AgentRuntime | Persistent agent daemon (systemd for AI) |
| AgentSwarm | Multi-agent from real trace transitions |
| AgentTelemetry | Datadog for agents (token tracking, costs) |
| BenchAgent | HumanEval for tool-use (107 tasks) |
| AgentDev | VSCode extension with verification |
| TraceViz | Trace replay visualizer (Next.js) |
| AgentSkills | npm for agent behaviors |
| AgentCurriculum | 5-stage progressive training |
| AgentFuzzer | Adversarial testing for agents |
| AgentConstitution | Safety guardrails from traces |
| CostOptimizer | Token cost reduction (50-80%) |
| AgentProfiler | Behavioral fingerprinting |
| TrajectoryDistiller | Trace→training data pipeline |
| Fable5-Dataset | HuggingFace dataset release |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fableforge_bench_agent-0.1.0.tar.gz.
File metadata
- Download URL: fableforge_bench_agent-0.1.0.tar.gz
- Upload date:
- Size: 27.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55bb54e7d44266a51f1c713dc7dd3f423a847762eb69277bd4c527a12c1a247a
|
|
| MD5 |
526eaccf35e04bc1c327212e36c897c4
|
|
| BLAKE2b-256 |
b63a6c189c8945fe775deaf834194a7673fdb500920ad29fd2a729b0c7ecbd35
|
Provenance
The following attestation bundles were made for fableforge_bench_agent-0.1.0.tar.gz:
Publisher:
release.yml on KingLabsA/bench-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fableforge_bench_agent-0.1.0.tar.gz -
Subject digest:
55bb54e7d44266a51f1c713dc7dd3f423a847762eb69277bd4c527a12c1a247a - Sigstore transparency entry: 1819977872
- Sigstore integration time:
-
Permalink:
KingLabsA/bench-agent@12c6aea01153ab3035bb366e7287b49d9ad7e1c4 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/KingLabsA
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@12c6aea01153ab3035bb366e7287b49d9ad7e1c4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fableforge_bench_agent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fableforge_bench_agent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7240b8643080c089f9d55821a963a040e538b2fe61a7253194ba0c754264ef0a
|
|
| MD5 |
0c142572fc2551ed5d1e8b45e173cf4d
|
|
| BLAKE2b-256 |
6027f29774cb11f9778ca1c0058569c7bd10491051df844713381e844c9b551b
|
Provenance
The following attestation bundles were made for fableforge_bench_agent-0.1.0-py3-none-any.whl:
Publisher:
release.yml on KingLabsA/bench-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fableforge_bench_agent-0.1.0-py3-none-any.whl -
Subject digest:
7240b8643080c089f9d55821a963a040e538b2fe61a7253194ba0c754264ef0a - Sigstore transparency entry: 1819977903
- Sigstore integration time:
-
Permalink:
KingLabsA/bench-agent@12c6aea01153ab3035bb366e7287b49d9ad7e1c4 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/KingLabsA
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@12c6aea01153ab3035bb366e7287b49d9ad7e1c4 -
Trigger Event:
push
-
Statement type: