Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.

These details have not been verified by PyPI

Project links

Project description

BenchFlow

Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent

What

BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It supports single-agent, multi-agent, and multi-turn evaluation patterns through a Scene-based lifecycle.

Any ACP agent — Gemini CLI, Claude, Codex, OpenClaw, Pi, or your own
Multi-scene trials — skill generation → solve, coder → reviewer → revision
Cloud sandboxes — Daytona backend for parallel execution at scale
YAML-driven — same task folder, different trial configs for ablation

Install

pip install benchflow==0.3.0a3

Requires Python 3.12+. For cloud sandboxes, set DAYTONA_API_KEY.

Quick Start

CLI

# Run a single task with Gemini
bench eval create -t tasks/my-task -a gemini -m gemini-3.1-flash-lite-preview -e daytona

# Run from YAML config (batch, concurrent)
bench eval create -f benchmarks/tb2-gemini-baseline.yaml

# List agents
bench agent list

# Check task validity
bench tasks check tasks/my-task

Python

import benchflow as bf
from benchflow.trial import TrialConfig, Scene, Role, Turn

# Simplest: one agent, one task
result = await bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview")
print(result.rewards)  # {"reward": 1.0}

# Scene-based: skill-gen → solve (BYOS pattern)
config = TrialConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="skill-gen",
              roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("gen", "Analyze the task and write a skill to /app/generated-skill.md")]),
        Scene(name="solve",
              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("solver")]),  # None prompt = use instruction.md
    ],
    environment="daytona",
)
result = await bf.run(config)

# Multi-agent: coder + reviewer
config = TrialConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="review-loop",
              roles=[
                  Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
                  Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
              ],
              turns=[
                  Turn("coder", "Solve the task. Write to /app/.outbox/reviewer.json when done."),
                  Turn("reviewer", "Review the coder's work. Write feedback to /app/.outbox/coder.json."),
                  Turn("coder", "Read the reviewer's feedback and revise your solution."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)

YAML Trial Config

# trial-baseline.yaml
task_dir: .ref/terminal-bench-2
agent: gemini
model: gemini-3.1-flash-lite-preview
environment: daytona
concurrency: 89

# trial-byos.yaml (same tasks, different config)
task_dir: .ref/terminal-bench-2
scenes:
  - name: skill-gen
    roles: [{name: gen, agent: gemini, model: gemini-3.1-flash-lite-preview}]
    turns: [{role: gen, prompt: "Generate a skill for this task..."}]
  - name: solve
    roles: [{name: solver, agent: gemini, model: gemini-3.1-flash-lite-preview}]

CLI Reference

bench agent list              List registered agents
bench agent show <name>       Agent details + conformance status

bench eval create             Create + run evaluation (returns job-id)
bench eval list               List completed evaluations

bench skills eval             Evaluate skill via evals.json

bench tasks init <name>       Scaffold new task
bench tasks check <dir>       Validate task (--rubric for custom)

bench train create            Reward-based training sweep

bench environment create      Spin up sandbox from task dir
bench environment list        List active sandboxes

Terminology

Term	Definition	Example
Turn	One prompt in one ACP session — one role acts	Coder writes a regex
Multi-turn	Same role, multiple turns	Self-review: agent → agent
Round	One A→B exchange between different roles	Coder → Reviewer
Multi-round	Different roles exchanging turns	Coder → Reviewer → Coder
Scene	Interaction region with roles + turns	A code-review scene
Trial	Sequence of scenes in a shared sandbox	Skill-gen → Solve

Inter-role messaging: In multi-role scenes, agents communicate via outbox files. An agent writes /app/.outbox/{recipient}.json with {"to": "role", "content": "..."}. The scheduler reads these after each turn and injects the message into the next role's prompt.

Architecture

Trial = sequence of Scenes in a shared sandbox
Scene = Roles + Turns (one interaction region)
Role  = agent + model
Turn  = one prompt for one role

bf.run(config)
  → Trial.create(config)
    → trial.setup()      # resolve config, create env object
    → trial.start()      # spin up sandbox, upload task files
    → for scene in config.scenes:
        → trial._run_scene(scene)  # connect/execute/disconnect per role
          → setup /app/.outbox/    # (multi-role scenes only)
          → for turn in scene.turns:
              → read outbox → inject messages into prompt
              → connect as role → execute → disconnect
    → trial.verify()     # run verifier, score
    → trial.cleanup()    # stop sandbox

Registered Agents

Agent	Command	Auth
`gemini`	`gemini --acp --yolo`	GOOGLE_API_KEY
`claude-agent-acp`	`claude-agent-acp`	ANTHROPIC_API_KEY
`codex-acp`	`codex-acp`	OPENAI_API_KEY
`openclaw`	`openclaw-acp-shim`	inferred from model
`pi-acp`	`pi-acp`	ANTHROPIC_API_KEY

Adding a Custom Agent

Any ACP-native agent works. Create agent.toml:

name = "my-agent"
launch_cmd = "my-agent --acp"
install_cmd = "npm install -g my-agent"
requires_env = ["MY_API_KEY"]

Development

uv venv -p 3.12 .venv && uv pip install -e ".[dev]"
.venv/bin/python -m pytest tests/       # 580+ unit tests
.venv/bin/ty check src/                 # type check

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.2

Apr 23, 2026

0.3.1

Apr 22, 2026

0.3.0

Apr 21, 2026

0.3.0a10 pre-release

Apr 20, 2026

0.3.0a9 pre-release

Apr 20, 2026

0.3.0a8 pre-release

Apr 20, 2026

0.3.0a7 pre-release

Apr 20, 2026

0.3.0a6 pre-release

Apr 20, 2026

0.3.0a5 pre-release

Apr 20, 2026

0.3.0a4 pre-release

Apr 20, 2026

0.3.0a3 pre-release

Apr 20, 2026

0.3.0a2 pre-release

Apr 20, 2026

0.3.0a1 pre-release

Apr 20, 2026

0.2.3

Apr 16, 2026

0.2.2

Apr 14, 2026

0.2.1

Apr 13, 2026

0.2.0

Apr 9, 2026

0.1.13

Mar 10, 2025

0.1.12

Mar 6, 2025

0.1.11

Mar 6, 2025

0.1.10

Mar 6, 2025

0.1.9

Feb 28, 2025

0.1.8

Feb 27, 2025

0.1.7

Feb 19, 2025

0.1.6

Feb 17, 2025

0.1.5

Feb 7, 2025

0.1.4

Feb 4, 2025

0.1.3

Jan 31, 2025

0.1.2

Jan 25, 2025

0.1.1

Jan 24, 2025

0.1.0

Jan 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchflow-0.3.2.tar.gz (205.9 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchflow-0.3.2-py3-none-any.whl (154.9 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file benchflow-0.3.2.tar.gz.

File metadata

Download URL: benchflow-0.3.2.tar.gz
Upload date: Apr 23, 2026
Size: 205.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for benchflow-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`b4dade6401c2292080f59fab1fa65bd2f8b3a26fda86bde4616971587092b871`
MD5	`6947492afbb72fbce42524177fdf1ec5`
BLAKE2b-256	`48363a9d44dc0e3347417da23dc4dfa01ded76ffa1496452080f85a59237c963`

See more details on using hashes here.

File details

Details for the file benchflow-0.3.2-py3-none-any.whl.

File metadata

Download URL: benchflow-0.3.2-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 154.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for benchflow-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6089222278fe26651bea50a7c9305866362e60166ebc4bcc9d512ee1df3cb0d`
MD5	`6e5e6312d40ce95d4e2d6b080b6bf51a`
BLAKE2b-256	`4d689a27ef22da5148dba9345016a03831c1186b85a7304a939560fcafb3b661`

See more details on using hashes here.

benchflow 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BenchFlow

What

Install

Quick Start

CLI

Python

YAML Trial Config

CLI Reference

Terminology

Architecture

Registered Agents

Adding a Custom Agent

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes