Skip to main content

Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.

Project description

BenchFlow

Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent

PyPI Discord

What

BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It supports single-agent, multi-agent, and multi-turn evaluation patterns through a Scene-based lifecycle.

  • Any ACP agent — Gemini CLI, Claude, Codex, OpenClaw, Pi, or your own
  • Multi-scene trials — skill generation → solve, coder → reviewer → revision
  • Cloud sandboxes — Daytona backend for parallel execution at scale
  • YAML-driven — same task folder, different trial configs for ablation

Install

pip install benchflow==0.3.0a3

Requires Python 3.12+. For cloud sandboxes, set DAYTONA_API_KEY.

Quick Start

CLI

# Run a single task with Gemini
bench eval create -t tasks/my-task -a gemini -m gemini-3.1-flash-lite-preview -e daytona

# Run from YAML config (batch, concurrent)
bench eval create -f benchmarks/tb2-gemini-baseline.yaml

# List agents
bench agent list

# Check task validity
bench tasks check tasks/my-task

Python

import benchflow as bf
from benchflow.trial import TrialConfig, Scene, Role, Turn

# Simplest: one agent, one task
result = await bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview")
print(result.rewards)  # {"reward": 1.0}

# Scene-based: skill-gen → solve (BYOS pattern)
config = TrialConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="skill-gen",
              roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("gen", "Analyze the task and write a skill to /app/generated-skill.md")]),
        Scene(name="solve",
              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("solver")]),  # None prompt = use instruction.md
    ],
    environment="daytona",
)
result = await bf.run(config)

# Multi-agent: coder + reviewer
config = TrialConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="review-loop",
              roles=[
                  Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
                  Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
              ],
              turns=[
                  Turn("coder", "Solve the task. Write to /app/.outbox/reviewer.json when done."),
                  Turn("reviewer", "Review the coder's work. Write feedback to /app/.outbox/coder.json."),
                  Turn("coder", "Read the reviewer's feedback and revise your solution."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)

YAML Trial Config

# trial-baseline.yaml
task_dir: .ref/terminal-bench-2
agent: gemini
model: gemini-3.1-flash-lite-preview
environment: daytona
concurrency: 89

# trial-byos.yaml (same tasks, different config)
task_dir: .ref/terminal-bench-2
scenes:
  - name: skill-gen
    roles: [{name: gen, agent: gemini, model: gemini-3.1-flash-lite-preview}]
    turns: [{role: gen, prompt: "Generate a skill for this task..."}]
  - name: solve
    roles: [{name: solver, agent: gemini, model: gemini-3.1-flash-lite-preview}]

CLI Reference

bench agent list              List registered agents
bench agent show <name>       Agent details + conformance status

bench eval create             Create + run evaluation (returns job-id)
bench eval list               List completed evaluations

bench skills eval             Evaluate skill via evals.json

bench tasks init <name>       Scaffold new task
bench tasks check <dir>       Validate task (--rubric for custom)

bench train create            Reward-based training sweep

bench environment create      Spin up sandbox from task dir
bench environment list        List active sandboxes

Architecture

Trial = sequence of Scenes in a shared sandbox
Scene = Roles + Turns (one interaction region)
Role  = agent + model
Turn  = one prompt for one role

bf.run(config)
  → Trial.create(config)
    → trial.setup()      # resolve config, create env object
    → trial.start()      # spin up sandbox, upload task files
    → for scene in config.scenes:
        → trial._run_scene(scene)  # connect/execute/disconnect per role
    → trial.verify()     # run verifier, score
    → trial.cleanup()    # stop sandbox

Registered Agents

Agent Command Auth
gemini gemini --acp --yolo GOOGLE_API_KEY
claude-agent-acp claude-agent-acp ANTHROPIC_API_KEY
codex-acp codex-acp OPENAI_API_KEY
openclaw openclaw-acp-shim inferred from model
pi-acp pi-acp ANTHROPIC_API_KEY

Adding a Custom Agent

Any ACP-native agent works. Create agent.toml:

name = "my-agent"
launch_cmd = "my-agent --acp"
install_cmd = "npm install -g my-agent"
requires_env = ["MY_API_KEY"]

Development

uv venv -p 3.12 .venv && uv pip install -e ".[dev]"
.venv/bin/python -m pytest tests/       # 580+ unit tests
.venv/bin/ty check src/                 # type check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchflow-0.3.0a10.tar.gz (201.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchflow-0.3.0a10-py3-none-any.whl (152.7 kB view details)

Uploaded Python 3

File details

Details for the file benchflow-0.3.0a10.tar.gz.

File metadata

  • Download URL: benchflow-0.3.0a10.tar.gz
  • Upload date:
  • Size: 201.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for benchflow-0.3.0a10.tar.gz
Algorithm Hash digest
SHA256 785c018bfa4ecd80f5d2608c0e659ec4e93fe6cdcdea0efb89c0b472fca68273
MD5 27d8dc1b5b448cc9a9ac39907c1abaae
BLAKE2b-256 c90110bd5ca79c99f478f82e4556edf2c6b0ade96680b1e3b6a8e79eb5eb0abf

See more details on using hashes here.

File details

Details for the file benchflow-0.3.0a10-py3-none-any.whl.

File metadata

  • Download URL: benchflow-0.3.0a10-py3-none-any.whl
  • Upload date:
  • Size: 152.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for benchflow-0.3.0a10-py3-none-any.whl
Algorithm Hash digest
SHA256 a2454e02a44e738b3274a18bdb61c4d3da00a7accb40bd8e0d1e3853c8a0e74f
MD5 21cb473fbb8ecb81a771718c9620d549
BLAKE2b-256 80dff0a80337403ba60d22024ac9fee6c26e3c325c774518b7eb935ea84a834f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page