Multi-agent coordination benchmark for collaborative code generation

These details have not been verified by PyPI

Project links

Project description

CooperBench

Can AI agents work together as teammates? CooperBench is the first benchmark designed to measure how well AI agents can cooperate when handling individual tasks with potential conflicts.

We find that coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.

Installation

Quick Install (Planning Only)

pip install cooperbench[llm]
# or
uv pip install cooperbench[llm]

Full Install (Planning + Execution + Evaluation)

Execution requires custom OpenHands Docker images. Install from source:

# Clone with submodules
git clone --recurse-submodules https://github.com/cooperbench/cooperbench.git
cd cooperbench
pip install -e ".[all]"

# Build custom OpenHands images (requires Docker)
cd src/cooperbench/execution/openhands_colab
./build

This builds two Docker images:

colab/openhands_colab:latest - OpenHands core with MCP support
colab/openhands_runtime_colab:latest - Runtime environment

Experiment Settings

CooperBench supports four experiment modes:

Setting	Agents	Features	Communication	Description
`single`	1	1	N/A	Baseline single-task performance
`solo`	1	2	N/A	Single agent handling multiple tasks
`coop`	2	2	Yes	Full multi-agent coordination
`coop_ablation`	2	2	Planning only	Ablation study

CLI Usage

Quick Start

Run the full pipeline (plan → execute → evaluate) in one command:

cooperbench run \
    --setting coop \
    --repo-name pallets_jinja_task \
    --task-id 1621 \
    --feature1-id 1 \
    --feature2-id 2 \
    --model1 anthropic/claude-sonnet-4-5-20250929 \
    --model2 anthropic/claude-sonnet-4-5-20250929

Or run individual phases:

cooperbench plan --setting coop --repo-name pallets_jinja_task --task-id 1621 \
    --feature1-id 1 --feature2-id 2 --model1 gpt-5 --model2 gpt-5

cooperbench execute --setting coop --repo-name pallets_jinja_task --task-id 1621 \
    --feature1-id 1 --feature2-id 2 --model1 gpt-5 --model2 gpt-5

cooperbench evaluate --setting coop --repo-name pallets_jinja_task --task-id 1621 \
    --feature1-id 1 --feature2-id 2 --model1 gpt-5 --model2 gpt-5 --eval-type merge

CLI Options

Option	Description
`--setting, -s`	Experiment mode: `single`, `solo`, `coop`, `coop_ablation`
`--repo-name`	Repository/task name
`--task-id`	Task number
`--model1, -m1`	Model for first agent
`--model2, -m2`	Model for second agent (coop modes)
`--feature1-id, -i`	First feature ID
`--feature2-id, -j`	Second feature ID (non-single modes)
`--k`	Experiment run identifier (default: 1)
`--save-to-hf`	Save results to HuggingFace
`--create-pr`	Create PR when saving to HF

Python API

Basic Usage

import asyncio
from cooperbench import BenchSetting, FileInterface
from cooperbench.planning import create_plan
from cooperbench.execution import create_execution
from cooperbench.evaluation import evaluate

async def run_experiment():
    interface = FileInterface(
        setting=BenchSetting.COOP,
        repo_name="pallets_jinja_task",
        task_id=1621,
        k=1,
        feature1_id=1,
        feature2_id=2,
        model1="anthropic/claude-sonnet-4-5-20250929",
        model2="anthropic/claude-sonnet-4-5-20250929",
    )
    
    await create_plan(interface, max_iterations=25)
    await create_execution(interface, plan_location="logs")
    await evaluate(interface, eval_type="merge", patch_location="logs")

asyncio.run(run_experiment())

Dataset Structure

CooperBench expects tasks organized as:

dataset/
  <repo_name>/
    task<id>/
      setup.sh          # Repository setup script
      run_tests.sh      # Test runner script
      feature1/
        feature.md      # Feature description
        feature.patch   # Golden implementation
        tests.patch     # Test cases
      feature2/
        feature.md
        feature.patch
        tests.patch

Output Structure

Results are saved to:

logs/<setting>/<repo_name>/task<task_id>/
  plan_<model>_k<k>_feature<id>.md      # Implementation plan
  patch_<model>_k<k>_feature<id>.patch  # Generated code
  planning_traj_<model>_k<k>.json       # Full trajectory

Environment Setup

Create a .env file:

# Required for LLM calls
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here

# Optional: HuggingFace for result storage
HF_TOKEN=your_token_here

Requirements

Python 3.12+
Docker (for execution phase)
Git

Benchmark Statistics

Metric	Value
Tasks	652
Repositories	12
Languages	Python, TypeScript, Go, Rust
Annotators	8

Each task assigns two agents different features that can be implemented independently but may conflict without proper coordination.

Key Findings

Agents perform worse together than alone — GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.
Communication reduces conflicts but not failures — Agents spend up to 20% of their budget on communication, reducing merge conflicts but not improving overall success.
Three capability gaps underlie coordination failures:
- Expectation failures (42%) — agents fail to integrate partner state information
- Communication failures (26%) — questions go unanswered, breaking decision loops
- Commitment failures (32%) — agents break promises or make unverifiable claims

Development

pip install -e ".[dev]"
pytest tests/
ruff check .
mypy src/

Citation

@article{cooperbench2026,
  title={CooperBench: Why Coding Agents Cannot be Your Teammates Yet},
  author={Khatua*, Arpandeep and Zhu*, Hao and Tran†, Peter and Prabhudesai†, Arya
          and Sadrieh†, Frederic and Lieberwirth†, Johann K. and Yu, Xinkai
          and Fu, Yicheng and Ryan, Michael J. and Pei, Jiaxin and Yang, Diyi},
  journal={arXiv preprint},
  year={2026},
  url={https://arxiv.org/abs/2601.13295},
  note={*Equal contribution (Stanford) · †Equal contribution (SAP Labs)}
}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.14

Apr 30, 2026

0.0.13

Apr 30, 2026

0.0.12

Apr 30, 2026

0.0.11

Apr 18, 2026

0.0.10

Apr 18, 2026

0.0.9

Apr 17, 2026

0.0.8

Apr 17, 2026

0.0.7

Apr 17, 2026

0.0.6

Apr 17, 2026

0.0.5

Feb 15, 2026

0.0.4

Feb 15, 2026

0.0.3

Feb 4, 2026

0.0.2

Feb 1, 2026

This version

0.0.1

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooperbench-0.0.1.tar.gz (58.2 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cooperbench-0.0.1-py3-none-any.whl (78.4 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file cooperbench-0.0.1.tar.gz.

File metadata

Download URL: cooperbench-0.0.1.tar.gz
Upload date: Jan 27, 2026
Size: 58.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ae7bf100d0573bd8847e152eb18069b397f25d745c494a9c500d3e06a269be63`
MD5	`601b99aec8986216eb039747b472d5b2`
BLAKE2b-256	`d7458e9f8f331cf14ca7978f3cc074fa444207e527d4308baa95cc559675b0d7`

See more details on using hashes here.

File details

Details for the file cooperbench-0.0.1-py3-none-any.whl.

File metadata

Download URL: cooperbench-0.0.1-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 78.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62e335e2ca4a83fa992311d96f0c568e591042456af1aecad36c3ddec4507e35`
MD5	`4cad00bc051231859c6583383979dda4`
BLAKE2b-256	`ac6d2cea617cdbe990697eaf3dd37c2794c2f3e7ece403ef933c4aa145f6ef29`

See more details on using hashes here.

cooperbench 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CooperBench

Installation

Quick Install (Planning Only)

Full Install (Planning + Execution + Evaluation)

Experiment Settings

CLI Usage

Quick Start

CLI Options

Python API

Basic Usage

Dataset Structure

Output Structure

Environment Setup

Requirements

Benchmark Statistics

Key Findings

Development

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes