Skip to main content

Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.

Project description

BenchFlow

Multi-turn agent benchmarking with ACP

Discord

What

BenchFlow runs AI coding agents against benchmark tasks and captures their full trajectory. It combines Harbor (environments, verifier, orchestration) with ACP (multi-turn agent communication).

The agent runs inside a sandboxed environment (Docker or Daytona). BenchFlow connects to it via ACP over a live stdio pipe. You can send one prompt or many — the agent stays alive between prompts, maintaining full context.

Install

pip install benchflow

Requires Python 3.12+ and Docker (or a Daytona API key for cloud sandboxes).

Quick Start

source .env  # ANTHROPIC_API_KEY (auto-inherited by SDK)

# Run a single task
benchflow run -t path/to/task -a claude-agent-acp -e daytona

# Run a full benchmark (89 tasks, 64 concurrent)
benchflow job -t .ref/terminal-bench-2 -e daytona -c 64

# List available agents
benchflow agents

# View results
benchflow metrics jobs/
benchflow view jobs/my-job/my-trial/

SDK

import asyncio
from benchflow import SDK, Job, JobConfig, collect_metrics

async def main():
    sdk = SDK()

    # Single task — API keys auto-inherited from os.environ
    result = await sdk.run(
        task_path="path/to/task",
        agent="claude-agent-acp",
        model="claude-haiku-4-5-20251001",
        environment="daytona",  # or "docker"
    )
    print(result.rewards)       # {"reward": 1.0}
    print(result.n_tool_calls)  # 17

    # Multi-turn — None = use task's instruction.md
    result = await sdk.run(
        task_path="path/to/task",
        agent="claude-agent-acp",
        prompts=[
            None,
            "Review your solution. Check for errors, test it, and fix any issues.",
        ],
        environment="daytona",
    )

    # Job — run a full benchmark with concurrency and retries
    job = Job(
        tasks_dir="path/to/tasks",
        jobs_dir="jobs/tb2",
        config=JobConfig(
            agent="claude-agent-acp",
            model="claude-haiku-4-5-20251001",
            environment="daytona",
            concurrency=64,
        ),
    )
    result = await job.run()
    print(f"{result.passed}/{result.total} ({result.score:.1%})")

    # Metrics — aggregate results from a jobs directory
    metrics = collect_metrics("jobs/tb2", benchmark="TB2")
    print(metrics.summary())

asyncio.run(main())

CLI

# Run a single task
benchflow run -t task/ -a claude-agent-acp -m claude-haiku-4-5-20251001 -e daytona

# Run a benchmark job
benchflow job -t tasks/ -a claude-agent-acp -c 64 -e daytona --retries 1

# List agents
benchflow agents

# View metrics
benchflow metrics jobs/tb2/ --json
benchflow metrics jobs/tb2/

# Evaluate a skill against tasks
benchflow eval -t tasks/ --skills-dir skills/ -a claude-agent-acp -e daytona

# List/install skills
benchflow skills
benchflow skills --install owner/repo@skill-name

# View trajectory
benchflow view jobs/tb2/my-trial/

# Create/validate tasks
benchflow tasks init my-task     # scaffold a new task directory
benchflow tasks check tasks/my-task/  # validate task structure

Agents

Any ACP-compatible agent works. Registered agents are auto-installed in sandboxes.

benchflow agents              # list registered agents
benchflow run -t task/ -a pi-acp -e daytona

See docs/tested-agents.md for the full list of tested agent × model/provider combinations.

Environments

Environment Concurrency Notes
docker ~4 Local Docker. Limited by network exhaustion.
daytona 64+ Cloud sandboxes. Requires DAYTONA_API_KEY.

How it Works

benchflow (host)                          Sandbox (Docker/Daytona)
     |                                         |
     |  1. Start environment (Harbor)          |
     |  2. Install ACP agent (npm)             |
     |  3. stdio pipe (exec/SSH) --------> claude-agent-acp
     |                                         |
     |  ACP: initialize                        |
     |  ACP: session/new(cwd) --------------> agent sees workspace, skills
     |  ACP: session/set_model(haiku) ------> model configured
     |  ACP: session/prompt("solve this") --> agent uses Bash, Read, Write
     |  ACP: session/update <---------------- tool calls, messages, thoughts
     |  ACP: session/prompt("test it") -----> same session, full context
     |  ACP: session/update <---------------- more tool calls
     |                                         |
     |  4. Run verifier (Harbor) -----------> tests/test.sh → reward.txt
     |  5. Stop environment                    |

Task Format

Tasks follow the Harbor task format:

my-task/
├── task.toml              # timeouts, resources, metadata
├── instruction.md         # what the agent should do
├── environment/
│   └── Dockerfile         # sandbox setup
├── tests/
│   └── test.sh            # verifier → reward.txt
└── solution/              # optional reference solution

Results

Every run produces structured output:

jobs/{job_name}/{trial_name}/
├── config.json              # SDK.run() parameters (agent, model, environment)
├── result.json              # rewards, agent, timing breakdown
├── timing.json              # {environment_setup, agent_setup, agent_execution, verifier, total}
├── prompts.json             # prompts sent
├── agent/
│   ├── install-stdout.txt   # agent install output
│   └── {agent_name}.txt     # agent stderr/debug output (hyphens → underscores)
├── trajectory/
│   └── acp_trajectory.jsonl # tool calls + agent thoughts
└── verifier/
    └── reward.txt           # reward value

Benchmark Results

Benchmark Agent Model Score
TB2 single-turn codex-acp GPT-5.4* 69.7% (62/89)
TB2 single-turn claude-agent-acp Sonnet 4.6 58.4% (52/89)
TB2 multi-turn codex-acp GPT-5.4* 62.9% (56/89)
TB2 multi-turn claude-agent-acp Haiku 4.5 37.1% (33/89)
SkillsBench codex-acp GPT-5.4* 37.2% (32/86)

*GPT-5.4 runs used effort=medium.

Skills

BenchFlow ships a Claude Code skill in .claude/skills/benchflow/ that teaches agents how to use the framework. Place skills in ~/.claude/skills/ (or bake into task Dockerfiles) for auto-discovery.

Validation tasks in .claude/skills/benchflow/tasks/ confirm agents can use the skill correctly.

Architecture

BenchFlow provides:

  • ACP client — multi-turn agent communication via live stdio pipe
  • Job orchestration — concurrency, retries, resume, metrics
  • Multi-agent registry — auto-install agents in sandboxes
  • Trajectory capture — from ACP protocol
  • Skills — teach agents to use BenchFlow itself
  • Viewer — HTML trajectory visualization
  • CLIrun, job, agents, metrics, view, eval, skills, tasks, cleanup

Citation

If you use BenchFlow in academic work, please cite:

@software{BenchFlow_Team_BenchFlow_2026,
author = {{BenchFlow Team}},
month = mar,
title = {{BenchFlow: Multi-turn agent benchmarking with ACP}},
url = {https://github.com/benchflow-ai/benchflow},
year = {2026}
}

License

Apache License 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchflow-0.2.0.tar.gz (286.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchflow-0.2.0-py3-none-any.whl (88.6 kB view details)

Uploaded Python 3

File details

Details for the file benchflow-0.2.0.tar.gz.

File metadata

  • Download URL: benchflow-0.2.0.tar.gz
  • Upload date:
  • Size: 286.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for benchflow-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1c886a4b1abb0b3b303af828ff48fb42588477a5ca5da2b663774c6853106303
MD5 d800acdc87ab63d4f77e3316944c8de8
BLAKE2b-256 e5bc6b0e2ca8172969a730ea8cb50cc7d32efa4025e679e37b9a2e9a64460a6e

See more details on using hashes here.

File details

Details for the file benchflow-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: benchflow-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 88.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for benchflow-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c7d1985e3ccec60d74c3f3795047f389cfb749d1e8523cccaa82dcc9d1187e6f
MD5 c527713c46a32f090931a9751588c65b
BLAKE2b-256 a34a25d0e614e871d830752a2f1aea974775782c5e9d27cf38fc519218cb2db4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page