Multi-agent coordination benchmark for collaborative code generation
Project description
CooperBench
Can AI agents work together as teammates? CooperBench is the first benchmark designed to measure how well AI agents can cooperate when handling individual tasks with potential conflicts.
We find that coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.
Installation
pip install cooperbench
For development:
git clone https://github.com/cooperbench/CooperBench.git
cd CooperBench
pip install -e ".[dev]"
Requirements
- Python 3.12+
- Execution Backend (choose one):
- Redis (for inter-agent communication in coop mode)
Setup
Option 1: Modal (Default)
- Modal: Sign up at modal.com and run
modal setup - Redis: Run locally with
docker run -p 6379:6379 redis:7or use a cloud provider - LLM API keys: Set in
.envfile:
ANTHROPIC_API_KEY=your_key
OPENAI_API_KEY=your_key
GEMINI_API_KEY=your_key
Option 2: GCP (Recommended for Scale)
Prerequisites: Install gcloud CLI first
- macOS:
brew install google-cloud-sdk - Linux:
curl https://sdk.cloud.google.com | bash
Setup:
# 1. Install GCP dependencies
pip install 'cooperbench[gcp]'
# 2. Run configuration wizard (handles authentication, project setup, validation)
cooperbench config gcp
# 3. You're ready to run experiments!
cooperbench run --backend gcp -s lite
Also needed: Redis and LLM API keys (same as Option 1)
See GCP Setup Guide for detailed instructions.
Dataset
Download the benchmark dataset:
git clone https://huggingface.co/datasets/cooperbench/cooperbench dataset/
Quick Start
CLI
Run agents on a task:
# Run cooperative agents (2 agents, shared communication)
cooperbench run -n my-experiment -r llama_index_task -m gpt-4o
# Run solo agent (1 agent handling both features)
cooperbench run -n my-experiment -r llama_index_task -m gpt-4o --setting solo
# Evaluate results
cooperbench eval -n my-experiment
Python API
from cooperbench import run, evaluate
# Run agents
run(
run_name="my-experiment",
repo="llama_index_task",
model_name="gpt-4o",
setting="coop", # or "solo"
)
# Evaluate patches
evaluate(run_name="my-experiment")
CLI Reference
cooperbench config
Configure execution backends (GCP, Modal, etc.).
# Configure GCP backend
cooperbench config gcp
# Skip validation tests for faster setup
cooperbench config gcp --skip-tests
See GCP Setup Guide for details.
cooperbench run
Run agents on benchmark tasks.
cooperbench run -n NAME [OPTIONS]
| Option | Description | Default |
|---|---|---|
-n, --name |
Experiment name (required) | - |
-r, --repo |
Filter by repository | all |
-t, --task |
Filter by task ID | all |
-f, --features |
Feature pair (e.g., 1,2) |
all pairs |
-m, --model |
LLM model | gemini/gemini-3-flash-preview |
-a, --agent |
Agent framework | mini_swe_agent |
-c, --concurrency |
Parallel tasks | 20 |
--setting |
coop or solo |
coop |
--backend |
modal, docker, or gcp |
modal |
--redis |
Redis URL | redis://localhost:6379 |
--git |
Enable git collaboration | disabled |
--no-messaging |
Disable agent messaging | enabled |
--force |
Rerun existing results | skip |
--agent-config |
Path to agent config file | none |
Agent Configuration: Pass agent-specific parameters via a config file. CooperBench forwards the file path to your agent without parsing it.
cooperbench eval
Evaluate completed runs.
cooperbench eval -n NAME [OPTIONS]
| Option | Description | Default |
|---|---|---|
-n, --name |
Experiment name (required) | - |
-r, --repo |
Filter by repository | all |
-t, --task |
Filter by task ID | all |
-f, --features |
Feature pair (e.g., 1,2) |
all pairs |
-c, --concurrency |
Parallel evaluations | 10 |
--backend |
modal, docker, or gcp |
modal |
--force |
Re-evaluate existing | skip |
Experiment Settings
| Setting | Agents | Description |
|---|---|---|
coop |
2 | Two agents with Redis messaging, each handles one feature |
solo |
1 | Single agent handles both features sequentially |
Dataset Structure
dataset/
<repo_name>/
task<id>/
setup.sh # Repository setup script
run_tests.sh # Test runner script
feature1/
feature.md # Feature description
feature.patch # Golden implementation
tests.patch # Test cases
feature2/
...
Output Structure
Results are saved to logs/:
logs/<run_name>/<repo>/task<id>/features_<i>_<j>/
agent1/
trajectory.json # Full agent trajectory
patch.diff # Generated patch
agent2/
...
eval.json # Evaluation results
Benchmark Statistics
| Metric | Value |
|---|---|
| Tasks | 652 |
| Repositories | 12 |
| Languages | Python, TypeScript, Go, Rust |
Key Findings
-
Agents perform worse together than alone — GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.
-
Communication reduces conflicts but not failures — Agents spend up to 20% of their budget on communication, reducing merge conflicts but not improving overall success.
-
Three capability gaps underlie coordination failures:
- Expectation failures (42%) — agents fail to integrate partner state information
- Communication failures (26%) — questions go unanswered, breaking decision loops
- Commitment failures (32%) — agents break promises or make unverifiable claims
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run integration tests (requires Modal)
pytest tests/ -v --run-modal
# Lint
ruff check src/
ruff format src/
# Type check
mypy src/cooperbench/
Citation
@article{cooperbench2026,
title={CooperBench: Why Coding Agents Cannot be Your Teammates Yet},
author={Khatua*, Arpandeep and Zhu*, Hao and Tran†, Peter and Prabhudesai†, Arya
and Sadrieh†, Frederic and Lieberwirth†, Johann K. and Yu, Xinkai
and Fu, Yicheng and Ryan, Michael J. and Pei, Jiaxin and Yang, Diyi},
journal={arXiv preprint},
year={2026},
url={https://arxiv.org/abs/2601.13295},
note={*Equal contribution (Stanford) · †Equal contribution (SAP Labs)}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cooperbench-0.0.5.tar.gz.
File metadata
- Download URL: cooperbench-0.0.5.tar.gz
- Upload date:
- Size: 842.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfea4512bccbeae1a5ed84df8cce1f83643dca5eb925c9bc937a6e4d0209f929
|
|
| MD5 |
42a477b080cc951a55e366cd33e7ffc0
|
|
| BLAKE2b-256 |
4450e7a28e470ddeb08c9371e62618acd073d3ce8b68dc710b7a2fbbcdd7f5fa
|
File details
Details for the file cooperbench-0.0.5-py3-none-any.whl.
File metadata
- Download URL: cooperbench-0.0.5-py3-none-any.whl
- Upload date:
- Size: 758.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe90f8189695b545e1cd4f3f724514470aa0ef695e4640dfe3bba42255ac646a
|
|
| MD5 |
d2f3b534ac52edebf57bde153fa61b43
|
|
| BLAKE2b-256 |
09e80340afb8b60f606e6f434dd75451afd8c36e2b1fc4cc7d69cfeef84bbfc0
|