Skip to main content

HumanEval for tool use — a standardized benchmark for evaluating LLM tool-use capabilities

Project description

BenchAgent — HumanEval for Tool Use

License: MIT Python 3.10+ Tests

A standardized benchmark for evaluating LLM tool-use capabilities across multiple categories: bash commands, code editing, code reading, code writing, multi-tool orchestration, and error recovery.

Installation

pip install bench-agent

For development:

pip install -e ".[dev]"

Quick Start

# List available tasks
bench-agent list-tasks

# List tasks by category
bench-agent list-tasks --category bash

# Run benchmark against a model
bench-agent run --model gpt-4 --category bash

# Run all categories
bench-agent run --model fableforge-14b --all

# View leaderboard
bench-agent leaderboard

# Export leaderboard as markdown
bench-agent export --format markdown

Task Categories

BASH (21 tasks)

Shell command execution: finding files, processing text, managing processes, network operations, log parsing, and system administration tasks.

EDIT (22 tasks)

Code modification: fixing bugs, refactoring code, adding features, changing APIs, adding type hints, converting sync to async, error handling, and API evolution.

READ (16 tasks)

Code comprehension: understanding structure, finding patterns, tracing execution, identifying vulnerabilities, and explaining code behavior.

WRITE (16 tasks)

Code creation: generating new files, configuration, tests, Dockerfiles, project scaffolding, and CI/CD pipelines.

MULTI-TOOL (16 tasks)

Complex tasks requiring 3+ tools in sequence: read → analyze → modify → verify, full project setup, and multi-file refactoring.

ERROR RECOVERY (16 tasks)

Fixing broken code, recovering from errors, handling edge cases: syntax errors, runtime errors, race conditions, security vulnerabilities, and infinite loops.

Scoring Methodology

Each task produces a TaskResult with:

Metric Weight Description
Functional correctness 60% Does the solution work as expected?
Efficiency 25% Fewer turns and tokens = higher score
Error recovery 15% How well does the model recover from errors?

For failed tasks, partial credit applies:

Component Weight Description
Partial completion 50% How close to a correct solution?
Error recovery rate 30% Were errors identified and addressed?
Efficiency 20% Resource usage despite failure

Score Calculation

Overall Score = 0.6 * functional_score + 0.15 * recovery_score + 0.25 * efficiency_score

For failed tasks:

Score = 0.5 * partial_credit + 0.3 * recovery_score + 0.2 * efficiency_score

Final scores are scaled to 0–100.

Task Structure

Each task defines:

  • task_id: Unique identifier (e.g., bash-001, edit-015)
  • category: One of the six categories
  • difficulty: easy, medium, or hard
  • description: What the model needs to accomplish
  • initial_state: Files to create before task execution
  • expected_outcome: What constitutes success
  • tools_required: Which tools the model should use
  • max_turns: Maximum tool-use turns allowed
  • verification_script: Python script to verify correctness

Task Counts

Category Count
BASH 21
EDIT 22
READ 16
WRITE 16
MULTI-TOOL 16
ERROR RECOVERY 16
Total 107

Python API

from bench_agent.evaluator import evaluate_model
from bench_agent.runner import TaskRunner
from bench_agent.tasks import BASH_TASKS, EDIT_TASKS

# Run evaluation
report = evaluate_model(
    model_name="gpt-4",
    provider="openai",
    categories=[TaskCategory.BASH, TaskCategory.EDIT],
    num_tasks=10,
)

print(f"Total Score: {report.total_score}")
print(f"Category Scores: {report.category_scores}")
print(f"Error Recovery Rate: {report.error_recovery_rate}")

Leaderboard

from bench_agent.leaderboard import load_leaderboard, update_leaderboard, export_markdown

lb = load_leaderboard("leaderboard.json")
lb = update_leaderboard(lb, "gpt-4", results)
print(export_markdown(lb))

Architecture

src/bench_agent/
├── __init__.py          # Package init
├── models.py            # Pydantic data models
├── tasks.py             # 107 task definitions
├── runner.py            # Task execution runner
├── scorer.py            # Scoring system
├── leaderboard.py       # Leaderboard management
├── evaluator.py         # Model evaluation
└── cli.py               # Click CLI interface

Development

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=bench_agent

# Lint
ruff check src/

License

MIT

Ecosystem

Part of the FableForge ecosystem — 21 open-source projects built from 210K real agent traces:

Project Description
Anvil Self-verified coding agent
VerifyLoop Plan→Execute→Verify→Recover framework
ErrorRecovery Self-healing middleware (3,725 error patterns)
FableForge-14B The fine-tuned 14B model (4-stage training)
ShellWhisperer 1.5B edge agent (phone/RPi, 50ms)
ReasonCritic Verification model (130 benchmark tasks)
TraceCompiler Compile traces → LoRA skills
AgentRuntime Persistent agent daemon (systemd for AI)
AgentSwarm Multi-agent from real trace transitions
AgentTelemetry Datadog for agents (token tracking, costs)
BenchAgent HumanEval for tool-use (107 tasks)
AgentDev VSCode extension with verification
TraceViz Trace replay visualizer (Next.js)
AgentSkills npm for agent behaviors
AgentCurriculum 5-stage progressive training
AgentFuzzer Adversarial testing for agents
AgentConstitution Safety guardrails from traces
CostOptimizer Token cost reduction (50-80%)
AgentProfiler Behavioral fingerprinting
TrajectoryDistiller Trace→training data pipeline
Fable5-Dataset HuggingFace dataset release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fableforge_bench_agent-0.1.0.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fableforge_bench_agent-0.1.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file fableforge_bench_agent-0.1.0.tar.gz.

File metadata

  • Download URL: fableforge_bench_agent-0.1.0.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fableforge_bench_agent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 55bb54e7d44266a51f1c713dc7dd3f423a847762eb69277bd4c527a12c1a247a
MD5 526eaccf35e04bc1c327212e36c897c4
BLAKE2b-256 b63a6c189c8945fe775deaf834194a7673fdb500920ad29fd2a729b0c7ecbd35

See more details on using hashes here.

Provenance

The following attestation bundles were made for fableforge_bench_agent-0.1.0.tar.gz:

Publisher: release.yml on KingLabsA/bench-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fableforge_bench_agent-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fableforge_bench_agent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7240b8643080c089f9d55821a963a040e538b2fe61a7253194ba0c754264ef0a
MD5 0c142572fc2551ed5d1e8b45e173cf4d
BLAKE2b-256 6027f29774cb11f9778ca1c0058569c7bd10491051df844713381e844c9b551b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fableforge_bench_agent-0.1.0-py3-none-any.whl:

Publisher: release.yml on KingLabsA/bench-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page