Skip to main content

Multi-agent coordination benchmark for collaborative code generation

Project description

CooperBench

arXiv Website Dataset PyPI License: MIT

Can AI agents work together as teammates? CooperBench is the first benchmark designed to measure how well AI agents can cooperate when handling individual tasks with potential conflicts.

We find that coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.

Installation

pip install cooperbench

For development:

git clone https://github.com/cooperbench/CooperBench.git
cd CooperBench
pip install -e ".[dev]"

Requirements

  • Python 3.12+
  • Execution Backend (choose one):
    • Modal (default, cloud-based)
    • GCP (Google Cloud Platform)
    • Docker (local execution)
  • Redis (for inter-agent communication in coop mode)

Setup

Option 1: Modal (Default)

  1. Modal: Sign up at modal.com and run modal setup
  2. Redis: Run locally with docker run -p 6379:6379 redis:7 or use a cloud provider
  3. LLM API keys: Set in .env file:
ANTHROPIC_API_KEY=your_key
OPENAI_API_KEY=your_key
GEMINI_API_KEY=your_key

Option 2: GCP (Recommended for Scale)

Prerequisites: Install gcloud CLI first

  • macOS: brew install google-cloud-sdk
  • Linux: curl https://sdk.cloud.google.com | bash

Setup:

# 1. Install GCP dependencies
pip install 'cooperbench[gcp]'

# 2. Run configuration wizard (handles authentication, project setup, validation)
cooperbench config gcp

# 3. You're ready to run experiments!
cooperbench run --backend gcp -s lite

Also needed: Redis and LLM API keys (same as Option 1)

See GCP Setup Guide for detailed instructions.

Dataset

Download the benchmark dataset:

git clone https://huggingface.co/datasets/cooperbench/cooperbench dataset/

Quick Start

CLI

Run agents on a task:

# Run cooperative agents (2 agents, shared communication)
cooperbench run -n my-experiment -r llama_index_task -m gpt-4o

# Run solo agent (1 agent handling both features)
cooperbench run -n my-experiment -r llama_index_task -m gpt-4o --setting solo

# Evaluate results
cooperbench eval -n my-experiment

Python API

from cooperbench import run, evaluate

# Run agents
run(
    run_name="my-experiment",
    repo="llama_index_task",
    model_name="gpt-4o",
    setting="coop",  # or "solo"
)

# Evaluate patches
evaluate(run_name="my-experiment")

CLI Reference

cooperbench config

Configure execution backends (GCP, Modal, etc.).

# Configure GCP backend
cooperbench config gcp

# Skip validation tests for faster setup
cooperbench config gcp --skip-tests

See GCP Setup Guide for details.

cooperbench run

Run agents on benchmark tasks.

cooperbench run -n NAME [OPTIONS]
Option Description Default
-n, --name Experiment name (required) -
-r, --repo Filter by repository all
-t, --task Filter by task ID all
-f, --features Feature pair (e.g., 1,2) all pairs
-m, --model LLM model gemini/gemini-3-flash-preview
-a, --agent Agent framework mini_swe_agent
-c, --concurrency Parallel tasks 20
--setting coop or solo coop
--backend modal, docker, or gcp modal
--redis Redis URL redis://localhost:6379
--git Enable git collaboration disabled
--no-messaging Disable agent messaging enabled
--force Rerun existing results skip
--agent-config Path to agent config file none

Agent Configuration: Pass agent-specific parameters via a config file. CooperBench forwards the file path to your agent without parsing it.

cooperbench eval

Evaluate completed runs.

cooperbench eval -n NAME [OPTIONS]
Option Description Default
-n, --name Experiment name (required) -
-r, --repo Filter by repository all
-t, --task Filter by task ID all
-f, --features Feature pair (e.g., 1,2) all pairs
-c, --concurrency Parallel evaluations 10
--backend modal, docker, or gcp modal
--force Re-evaluate existing skip

Experiment Settings

Setting Agents Description
coop 2 Two agents with Redis messaging, each handles one feature
solo 1 Single agent handles both features sequentially

Dataset Structure

dataset/
  <repo_name>/
    task<id>/
      setup.sh          # Repository setup script
      run_tests.sh      # Test runner script
      feature1/
        feature.md      # Feature description
        feature.patch   # Golden implementation
        tests.patch     # Test cases
      feature2/
        ...

Output Structure

Results are saved to logs/:

logs/<run_name>/<repo>/task<id>/features_<i>_<j>/
  agent1/
    trajectory.json     # Full agent trajectory
    patch.diff          # Generated patch
  agent2/
    ...
  eval.json             # Evaluation results

Benchmark Statistics

Metric Value
Tasks 652
Repositories 12
Languages Python, TypeScript, Go, Rust

Key Findings

  1. Agents perform worse together than alone — GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.

  2. Communication reduces conflicts but not failures — Agents spend up to 20% of their budget on communication, reducing merge conflicts but not improving overall success.

  3. Three capability gaps underlie coordination failures:

    • Expectation failures (42%) — agents fail to integrate partner state information
    • Communication failures (26%) — questions go unanswered, breaking decision loops
    • Commitment failures (32%) — agents break promises or make unverifiable claims

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run integration tests (requires Modal)
pytest tests/ -v --run-modal

# Lint
ruff check src/
ruff format src/

# Type check
mypy src/cooperbench/

Citation

@article{cooperbench2026,
  title={CooperBench: Why Coding Agents Cannot be Your Teammates Yet},
  author={Khatua*, Arpandeep and Zhu*, Hao and Tran†, Peter and Prabhudesai†, Arya
          and Sadrieh†, Frederic and Lieberwirth†, Johann K. and Yu, Xinkai
          and Fu, Yicheng and Ryan, Michael J. and Pei, Jiaxin and Yang, Diyi},
  journal={arXiv preprint},
  year={2026},
  url={https://arxiv.org/abs/2601.13295},
  note={*Equal contribution (Stanford) · †Equal contribution (SAP Labs)}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooperbench-0.0.3.tar.gz (707.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cooperbench-0.0.3-py3-none-any.whl (753.0 kB view details)

Uploaded Python 3

File details

Details for the file cooperbench-0.0.3.tar.gz.

File metadata

  • Download URL: cooperbench-0.0.3.tar.gz
  • Upload date:
  • Size: 707.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.3.tar.gz
Algorithm Hash digest
SHA256 5505d2a3f6bbd4ed2d41c56f95015cea1bf84a84cc471e86890d6c0b699e5b7b
MD5 ce75a43b643ba73699727ea13a54dc89
BLAKE2b-256 58f6e2a7957156959a21c14bd354003b2e70705cd22a39fda8945b9fb3cdb502

See more details on using hashes here.

File details

Details for the file cooperbench-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: cooperbench-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 753.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cooperbench-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 60d56ad97f27360432866fc0a6e69a679863fa09b6525ef6e25c95eabf043175
MD5 b7ca9ac9a66e6c1aecce7d6df4e71327
BLAKE2b-256 b8c5ed9fc4e00e1f6cc55894fdee3f3abc0b819eac29d0116d07814d84f33fbd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page