Skip to main content

Multi-agent coordination benchmark for collaborative code generation

Project description

CooperBench

arXiv Website Dataset License: MIT

Can AI agents work together as teammates? CooperBench is the first benchmark designed to measure how well AI agents can cooperate when handling individual tasks with potential conflicts.

We find that coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.

Installation

Quick Install (Planning Only)

pip install cooperbench[llm]
# or
uv pip install cooperbench[llm]

Full Install (Planning + Execution + Evaluation)

Execution requires custom OpenHands Docker images. Install from source:

# Clone with submodules
git clone --recurse-submodules https://github.com/cooperbench/cooperbench.git
cd cooperbench
pip install -e ".[all]"

# Build custom OpenHands images (requires Docker)
cd src/cooperbench/execution/openhands_colab
./build

This builds two Docker images:

  • colab/openhands_colab:latest - OpenHands core with MCP support
  • colab/openhands_runtime_colab:latest - Runtime environment

Experiment Settings

CooperBench supports four experiment modes:

Setting Agents Features Communication Description
single 1 1 N/A Baseline single-task performance
solo 1 2 N/A Single agent handling multiple tasks
coop 2 2 Yes Full multi-agent coordination
coop_ablation 2 2 Planning only Ablation study

CLI Usage

Quick Start

Run the full pipeline (plan → execute → evaluate) in one command:

cooperbench run \
    --setting coop \
    --repo-name pallets_jinja_task \
    --task-id 1621 \
    --feature1-id 1 \
    --feature2-id 2 \
    --model1 anthropic/claude-sonnet-4-5-20250929 \
    --model2 anthropic/claude-sonnet-4-5-20250929

Or run individual phases:

cooperbench plan --setting coop --repo-name pallets_jinja_task --task-id 1621 \
    --feature1-id 1 --feature2-id 2 --model1 gpt-5 --model2 gpt-5

cooperbench execute --setting coop --repo-name pallets_jinja_task --task-id 1621 \
    --feature1-id 1 --feature2-id 2 --model1 gpt-5 --model2 gpt-5

cooperbench evaluate --setting coop --repo-name pallets_jinja_task --task-id 1621 \
    --feature1-id 1 --feature2-id 2 --model1 gpt-5 --model2 gpt-5 --eval-type merge

CLI Options

Option Description
--setting, -s Experiment mode: single, solo, coop, coop_ablation
--repo-name Repository/task name
--task-id Task number
--model1, -m1 Model for first agent
--model2, -m2 Model for second agent (coop modes)
--feature1-id, -i First feature ID
--feature2-id, -j Second feature ID (non-single modes)
--k Experiment run identifier (default: 1)
--save-to-hf Save results to HuggingFace
--create-pr Create PR when saving to HF

Python API

Basic Usage

import asyncio
from cooperbench import BenchSetting, FileInterface
from cooperbench.planning import create_plan
from cooperbench.execution import create_execution
from cooperbench.evaluation import evaluate

async def run_experiment():
    interface = FileInterface(
        setting=BenchSetting.COOP,
        repo_name="pallets_jinja_task",
        task_id=1621,
        k=1,
        feature1_id=1,
        feature2_id=2,
        model1="anthropic/claude-sonnet-4-5-20250929",
        model2="anthropic/claude-sonnet-4-5-20250929",
    )
    
    await create_plan(interface, max_iterations=25)
    await create_execution(interface, plan_location="logs")
    await evaluate(interface, eval_type="merge", patch_location="logs")

asyncio.run(run_experiment())

Dataset Structure

CooperBench expects tasks organized as:

dataset/
  <repo_name>/
    task<id>/
      setup.sh          # Repository setup script
      run_tests.sh      # Test runner script
      feature1/
        feature.md      # Feature description
        feature.patch   # Golden implementation
        tests.patch     # Test cases
      feature2/
        feature.md
        feature.patch
        tests.patch

Output Structure

Results are saved to:

logs/<setting>/<repo_name>/task<task_id>/
  plan_<model>_k<k>_feature<id>.md      # Implementation plan
  patch_<model>_k<k>_feature<id>.patch  # Generated code
  planning_traj_<model>_k<k>.json       # Full trajectory

Environment Setup

Create a .env file:

# Required for LLM calls
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here

# Optional: HuggingFace for result storage
HF_TOKEN=your_token_here

Requirements

  • Python 3.12+
  • Docker (for execution phase)
  • Git

Benchmark Statistics

Metric Value
Tasks 652
Repositories 12
Languages Python, TypeScript, Go, Rust
Annotators 8

Each task assigns two agents different features that can be implemented independently but may conflict without proper coordination.

Key Findings

  1. Agents perform worse together than alone — GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.

  2. Communication reduces conflicts but not failures — Agents spend up to 20% of their budget on communication, reducing merge conflicts but not improving overall success.

  3. Three capability gaps underlie coordination failures:

    • Expectation failures (42%) — agents fail to integrate partner state information
    • Communication failures (26%) — questions go unanswered, breaking decision loops
    • Commitment failures (32%) — agents break promises or make unverifiable claims

Development

pip install -e ".[dev]"
pytest tests/
ruff check .
mypy src/

Citation

@article{cooperbench2026,
  title={CooperBench: Why Coding Agents Cannot be Your Teammates Yet},
  author={Khatua*, Arpandeep and Zhu*, Hao and Tran†, Peter and Prabhudesai†, Arya
          and Sadrieh†, Frederic and Lieberwirth†, Johann K. and Yu, Xinkai
          and Fu, Yicheng and Ryan, Michael J. and Pei, Jiaxin and Yang, Diyi},
  journal={arXiv preprint},
  year={2026},
  url={https://arxiv.org/abs/2601.13295},
  note={*Equal contribution (Stanford) · †Equal contribution (SAP Labs)}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooperbench-0.0.1.tar.gz (58.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cooperbench-0.0.1-py3-none-any.whl (78.4 kB view details)

Uploaded Python 3

File details

Details for the file cooperbench-0.0.1.tar.gz.

File metadata

  • Download URL: cooperbench-0.0.1.tar.gz
  • Upload date:
  • Size: 58.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ae7bf100d0573bd8847e152eb18069b397f25d745c494a9c500d3e06a269be63
MD5 601b99aec8986216eb039747b472d5b2
BLAKE2b-256 d7458e9f8f331cf14ca7978f3cc074fa444207e527d4308baa95cc559675b0d7

See more details on using hashes here.

File details

Details for the file cooperbench-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: cooperbench-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 78.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 62e335e2ca4a83fa992311d96f0c568e591042456af1aecad36c3ddec4507e35
MD5 4cad00bc051231859c6583383979dda4
BLAKE2b-256 ac6d2cea617cdbe990697eaf3dd37c2794c2f3e7ece403ef933c4aa145f6ef29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page