Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.
Project description
What
BenchFlow runs AI coding agents against benchmark tasks and captures their full trajectory. It combines Harbor (environments, verifier, orchestration) with ACP (multi-turn agent communication).
The agent runs inside a sandboxed environment (Docker or Daytona). BenchFlow connects to it via ACP over a live stdio pipe. You can send one prompt or many — the agent stays alive between prompts, maintaining full context.
Install
pip install benchflow
Requires Python 3.12+ and Docker (or a Daytona API key for cloud sandboxes).
Quick Start
source .env # ANTHROPIC_API_KEY (auto-inherited by SDK)
# Run a single task
benchflow run -t path/to/task -a claude-agent-acp -e daytona
# Run a full benchmark (89 tasks, 64 concurrent)
benchflow job -t .ref/terminal-bench-2 -e daytona -c 64
# List available agents
benchflow agents
# View results
benchflow metrics jobs/
benchflow view jobs/my-job/my-trial/
SDK
import asyncio
from benchflow import SDK, Job, JobConfig, collect_metrics
async def main():
sdk = SDK()
# Single task — API keys auto-inherited from os.environ
result = await sdk.run(
task_path="path/to/task",
agent="claude-agent-acp",
model="claude-haiku-4-5-20251001",
environment="daytona", # or "docker"
)
print(result.rewards) # {"reward": 1.0}
print(result.n_tool_calls) # 17
# Multi-turn — None = use task's instruction.md
result = await sdk.run(
task_path="path/to/task",
agent="claude-agent-acp",
prompts=[
None,
"Review your solution. Check for errors, test it, and fix any issues.",
],
environment="daytona",
)
# Job — run a full benchmark with concurrency and retries
job = Job(
tasks_dir="path/to/tasks",
jobs_dir="jobs/tb2",
config=JobConfig(
agent="claude-agent-acp",
model="claude-haiku-4-5-20251001",
environment="daytona",
concurrency=64,
),
)
result = await job.run()
print(f"{result.passed}/{result.total} ({result.score:.1%})")
# Metrics — aggregate results from a jobs directory
metrics = collect_metrics("jobs/tb2", benchmark="TB2")
print(metrics.summary())
asyncio.run(main())
CLI
# Run a single task
benchflow run -t task/ -a claude-agent-acp -m claude-haiku-4-5-20251001 -e daytona
# Run a benchmark job
benchflow job -t tasks/ -a claude-agent-acp -c 64 -e daytona --retries 1
# List agents
benchflow agents
# View metrics
benchflow metrics jobs/tb2/ --json
benchflow metrics jobs/tb2/
# Evaluate a skill against tasks
benchflow eval -t tasks/ --skills-dir skills/ -a claude-agent-acp -e daytona
# List/install skills
benchflow skills
benchflow skills --install owner/repo@skill-name
# View trajectory
benchflow view jobs/tb2/my-trial/
# Create/validate tasks
benchflow tasks init my-task # scaffold a new task directory
benchflow tasks check tasks/my-task/ # validate task structure
Agents
Any ACP-compatible agent works. Registered agents are auto-installed in sandboxes.
benchflow agents # list registered agents
benchflow run -t task/ -a pi-acp -e daytona
See docs/tested-agents.md for the full list of tested agent × model/provider combinations.
Environments
| Environment | Concurrency | Notes |
|---|---|---|
docker |
~4 | Local Docker. Limited by network exhaustion. |
daytona |
64+ | Cloud sandboxes. Requires DAYTONA_API_KEY. |
How it Works
benchflow (host) Sandbox (Docker/Daytona)
| |
| 1. Start environment (Harbor) |
| 2. Install ACP agent (npm) |
| 3. stdio pipe (exec/SSH) --------> claude-agent-acp
| |
| ACP: initialize |
| ACP: session/new(cwd) --------------> agent sees workspace, skills
| ACP: session/set_model(haiku) ------> model configured
| ACP: session/prompt("solve this") --> agent uses Bash, Read, Write
| ACP: session/update <---------------- tool calls, messages, thoughts
| ACP: session/prompt("test it") -----> same session, full context
| ACP: session/update <---------------- more tool calls
| |
| 4. Run verifier (Harbor) -----------> tests/test.sh → reward.txt
| 5. Stop environment |
Task Format
Tasks follow the Harbor task format:
my-task/
├── task.toml # timeouts, resources, metadata
├── instruction.md # what the agent should do
├── environment/
│ └── Dockerfile # sandbox setup
├── tests/
│ └── test.sh # verifier → reward.txt
└── solution/ # optional reference solution
Results
Every run produces structured output:
jobs/{job_name}/{trial_name}/
├── config.json # SDK.run() parameters (agent, model, environment)
├── result.json # rewards, agent, timing breakdown
├── timing.json # {environment_setup, agent_setup, agent_execution, verifier, total}
├── prompts.json # prompts sent
├── agent/
│ ├── install-stdout.txt # agent install output
│ └── {agent_name}.txt # agent stderr/debug output (hyphens → underscores)
├── trajectory/
│ └── acp_trajectory.jsonl # tool calls + agent thoughts
└── verifier/
└── reward.txt # reward value
Benchmark Results
| Benchmark | Agent | Model | Score |
|---|---|---|---|
| TB2 single-turn | codex-acp | GPT-5.4* | 69.7% (62/89) |
| TB2 single-turn | claude-agent-acp | Sonnet 4.6 | 58.4% (52/89) |
| TB2 multi-turn | codex-acp | GPT-5.4* | 62.9% (56/89) |
| TB2 multi-turn | claude-agent-acp | Haiku 4.5 | 37.1% (33/89) |
| SkillsBench | codex-acp | GPT-5.4* | 37.2% (32/86) |
*GPT-5.4 runs used effort=medium.
Skills
BenchFlow ships a Claude Code skill in .claude/skills/benchflow/ that teaches agents how to use the framework. Place skills in ~/.claude/skills/ (or bake into task Dockerfiles) for auto-discovery.
Validation tasks in .claude/skills/benchflow/tasks/ confirm agents can use the skill correctly.
Architecture
BenchFlow provides:
- ACP client — multi-turn agent communication via live stdio pipe
- Job orchestration — concurrency, retries, resume, metrics
- Multi-agent registry — auto-install agents in sandboxes
- Trajectory capture — from ACP protocol
- Skills — teach agents to use BenchFlow itself
- Viewer — HTML trajectory visualization
- CLI —
run,job,agents,metrics,view,eval,skills,tasks,cleanup
Citation
If you use BenchFlow in academic work, please cite:
@software{BenchFlow_Team_BenchFlow_2026,
author = {{BenchFlow Team}},
month = mar,
title = {{BenchFlow: Multi-turn agent benchmarking with ACP}},
url = {https://github.com/benchflow-ai/benchflow},
year = {2026}
}
License
Apache License 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchflow-0.2.0.tar.gz.
File metadata
- Download URL: benchflow-0.2.0.tar.gz
- Upload date:
- Size: 286.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c886a4b1abb0b3b303af828ff48fb42588477a5ca5da2b663774c6853106303
|
|
| MD5 |
d800acdc87ab63d4f77e3316944c8de8
|
|
| BLAKE2b-256 |
e5bc6b0e2ca8172969a730ea8cb50cc7d32efa4025e679e37b9a2e9a64460a6e
|
File details
Details for the file benchflow-0.2.0-py3-none-any.whl.
File metadata
- Download URL: benchflow-0.2.0-py3-none-any.whl
- Upload date:
- Size: 88.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7d1985e3ccec60d74c3f3795047f389cfb749d1e8523cccaa82dcc9d1187e6f
|
|
| MD5 |
c527713c46a32f090931a9751588c65b
|
|
| BLAKE2b-256 |
a34a25d0e614e871d830752a2f1aea974775782c5e9d27cf38fc519218cb2db4
|