Multi-agent coordination benchmark for collaborative code generation

These details have not been verified by PyPI

Project links

Project description

CooperBench

Can AI agents work together as teammates? CooperBench is the first benchmark designed to measure how well AI agents can cooperate when handling individual tasks with potential conflicts.

We find that coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.

Installation

pip install cooperbench

For development:

git clone https://github.com/cooperbench/CooperBench.git
cd CooperBench
pip install -e ".[dev]"

Requirements

Python 3.12+
Modal account (for sandbox execution)
Redis (for inter-agent communication in coop mode)

Setup

Modal: Sign up at modal.com and run modal setup
Redis: Run locally with docker run -p 6379:6379 redis:7 or use a cloud provider
LLM API keys: Set in .env file:

ANTHROPIC_API_KEY=your_key
OPENAI_API_KEY=your_key
GEMINI_API_KEY=your_key

Dataset

Download the benchmark dataset:

git clone https://huggingface.co/datasets/cooperbench/cooperbench dataset/

Quick Start

CLI

Run agents on a task:

# Run cooperative agents (2 agents, shared communication)
cooperbench run -n my-experiment -r llama_index_task -m gpt-4o

# Run solo agent (1 agent handling both features)
cooperbench run -n my-experiment -r llama_index_task -m gpt-4o --setting solo

# Evaluate results
cooperbench eval -n my-experiment

Python API

from cooperbench import run, evaluate

# Run agents
run(
    run_name="my-experiment",
    repo="llama_index_task",
    model_name="gpt-4o",
    setting="coop",  # or "solo"
)

# Evaluate patches
evaluate(run_name="my-experiment")

CLI Reference

`cooperbench run`

Run agents on benchmark tasks.

cooperbench run -n NAME [OPTIONS]

Option	Description	Default
`-n, --name`	Experiment name (required)	-
`-r, --repo`	Filter by repository	all
`-t, --task`	Filter by task ID	all
`-f, --features`	Feature pair (e.g., `1,2`)	all pairs
`-m, --model`	LLM model	`gemini/gemini-3-flash-preview`
`-a, --agent`	Agent framework	`mini_swe_agent`
`-c, --concurrency`	Parallel tasks	`20`
`--setting`	`coop` or `solo`	`coop`
`--redis`	Redis URL	`redis://localhost:6379`
`--git`	Enable git collaboration	disabled
`--no-messaging`	Disable agent messaging	enabled
`--force`	Rerun existing results	skip

`cooperbench eval`

Evaluate completed runs.

cooperbench eval -n NAME [OPTIONS]

Option	Description	Default
`-n, --name`	Experiment name (required)	-
`-r, --repo`	Filter by repository	all
`-t, --task`	Filter by task ID	all
`-f, --features`	Feature pair (e.g., `1,2`)	all pairs
`-c, --concurrency`	Parallel evaluations	`10`
`--force`	Re-evaluate existing	skip

Experiment Settings

Setting	Agents	Description
`coop`	2	Two agents with Redis messaging, each handles one feature
`solo`	1	Single agent handles both features sequentially

Dataset Structure

dataset/
  <repo_name>/
    task<id>/
      setup.sh          # Repository setup script
      run_tests.sh      # Test runner script
      feature1/
        feature.md      # Feature description
        feature.patch   # Golden implementation
        tests.patch     # Test cases
      feature2/
        ...

Output Structure

Results are saved to logs/:

logs/<run_name>/<repo>/task<id>/features_<i>_<j>/
  agent1/
    trajectory.json     # Full agent trajectory
    patch.diff          # Generated patch
  agent2/
    ...
  eval.json             # Evaluation results

Benchmark Statistics

Metric	Value
Tasks	652
Repositories	12
Languages	Python, TypeScript, Go, Rust

Key Findings

Agents perform worse together than alone — GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.
Communication reduces conflicts but not failures — Agents spend up to 20% of their budget on communication, reducing merge conflicts but not improving overall success.
Three capability gaps underlie coordination failures:
- Expectation failures (42%) — agents fail to integrate partner state information
- Communication failures (26%) — questions go unanswered, breaking decision loops
- Commitment failures (32%) — agents break promises or make unverifiable claims

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run integration tests (requires Modal)
pytest tests/ -v --run-modal

# Lint
ruff check src/
ruff format src/

# Type check
mypy src/cooperbench/

Citation

@article{cooperbench2026,
  title={CooperBench: Why Coding Agents Cannot be Your Teammates Yet},
  author={Khatua*, Arpandeep and Zhu*, Hao and Tran†, Peter and Prabhudesai†, Arya
          and Sadrieh†, Frederic and Lieberwirth†, Johann K. and Yu, Xinkai
          and Fu, Yicheng and Ryan, Michael J. and Pei, Jiaxin and Yang, Diyi},
  journal={arXiv preprint},
  year={2026},
  url={https://arxiv.org/abs/2601.13295},
  note={*Equal contribution (Stanford) · †Equal contribution (SAP Labs)}
}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.14

Apr 30, 2026

0.0.13

Apr 30, 2026

0.0.12

Apr 30, 2026

0.0.11

Apr 18, 2026

0.0.10

Apr 18, 2026

0.0.9

Apr 17, 2026

0.0.8

Apr 17, 2026

0.0.7

Apr 17, 2026

0.0.6

Apr 17, 2026

0.0.5

Feb 15, 2026

0.0.4

Feb 15, 2026

0.0.3

Feb 4, 2026

This version

0.0.2

Feb 1, 2026

0.0.1

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooperbench-0.0.2.tar.gz (142.6 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cooperbench-0.0.2-py3-none-any.whl (162.0 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file cooperbench-0.0.2.tar.gz.

File metadata

Download URL: cooperbench-0.0.2.tar.gz
Upload date: Feb 1, 2026
Size: 142.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`d2b321a7d26f689bdbd8d70457b226a56bfd312ca21c86c40ffcea74e9bc316e`
MD5	`a4a38bebd24516109b569c49c7c3cb3e`
BLAKE2b-256	`cd38d59005d6b9d1968f7ef849e51d86e567fff085a79dce2577f0973109a88f`

See more details on using hashes here.

File details

Details for the file cooperbench-0.0.2-py3-none-any.whl.

File metadata

Download URL: cooperbench-0.0.2-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 162.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bd84ce2eb32ec94b0ec7c46ee140114b277b21cda08d311484fc0fb52ae990d`
MD5	`9de83363592be31aa494245dbd6fe4f6`
BLAKE2b-256	`fafbd743b0e2dac3950049357128d9af761dbfa250fb6d60cf23d7dd34279670`

See more details on using hashes here.

cooperbench 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CooperBench

Installation

Requirements

Setup

Dataset

Quick Start

CLI

Python API

CLI Reference

cooperbench run

cooperbench eval

Experiment Settings

Dataset Structure

Output Structure

Benchmark Statistics

Key Findings

Development

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`cooperbench run`

`cooperbench eval`