Skip to main content

DPBench: A Benchmark for LLM Multi-Agent Coordination

Project description

DPBench

DPBench Architecture

A benchmark for evaluating LLM coordination under simultaneous resource contention.

License: MIT


Why DPBench?

Existing LLM benchmarks evaluate individual capabilities like reasoning (GSM8K), knowledge (MMLU), or coding (HumanEval). Multi-agent benchmarks typically use turn-based interaction where agents respond sequentially, but do not test simultaneous coordination under resource contention.

This capability matters for real deployments. Autonomous vehicles at intersections, collaborative robotics, and distributed systems all require agents to coordinate concurrent decisions without observing what others are doing. DPBench provides a standardized test for this capability.

What is DPBench?

DPBench is a framework built on the Dining Philosophers problem - a classic coordination challenge from distributed systems. The framework provides a standardized environment with automatic deadlock detection, two orchestration modes (simultaneous vs sequential), six reproducible metrics (deadlock rate, throughput, fairness, time to deadlock, starvation count, and message-action consistency), and eight experimental conditions that systematically vary decision timing, group size, and communication.

Our experiments show LLMs achieve near-zero deadlock in sequential mode but 25-95% deadlock rates in simultaneous mode, revealing a fundamental gap in coordination capabilities.

Installation

pip install dpbench

Quick Start

from dpbench import Benchmark

# Define your model (works with any LLM: API-based or local)
def my_model(system_prompt: str, user_prompt: str) -> str:
    # Your LLM call here
    return response

# Run benchmark
results = Benchmark.run(
    model_fn=my_model,
    system_prompt="System prompt here",
    decision_prompt="Decision prompt template",
    mode="simultaneous"
)

# Results
print(f"Deadlock Rate: {results['deadlock_rate']:.1%}")
print(f"Throughput: {results['avg_throughput']:.3f}")
print(f"Fairness: {results['avg_fairness']:.3f}")

See experiments/prompts/ for prompt templates used in our experiments.

How It Works

The Dining Philosophers Problem

N philosophers sit around a table with N forks between them. Each philosopher needs two adjacent forks to eat, but each fork can only be held by one philosopher. When all philosophers simultaneously grab one fork, they deadlock - each holding one fork and waiting for their neighbor's fork, creating a circular dependency.

This problem isolates the core challenge of resource coordination: agents must make compatible decisions without directly observing others' current actions.

Framework Architecture

Environment: Circular table with configurable number of philosophers (N) and N forks. Four actions per agent: GRAB_LEFT, GRAB_RIGHT, RELEASE, WAIT. Automatic deadlock detection when all agents are hungry and each holds exactly one fork. Partial observability enforces realistic constraints.

Orchestration: Simultaneous mode executes all agent decisions in parallel without state updates between decisions, testing true concurrent coordination. Sequential mode processes decisions one at a time with state updates after each action, providing an easier baseline.

Metrics: Six standardized metrics ensure reproducible evaluation. Deadlock rate captures coordination failure. Throughput measures efficiency as meals per timestep. Fairness uses Gini-normalized distribution. Time to deadlock, starvation count, and message-action consistency provide diagnostic information.

Standard Conditions

Eight conditions systematically vary three factors:

Code Decision Mode Philosophers Communication
sim5nc Simultaneous 5 No
sim5c Simultaneous 5 Yes
seq5nc Sequential 5 No
seq5c Sequential 5 Yes
sim3nc Simultaneous 3 No
sim3c Simultaneous 3 Yes
seq3nc Sequential 3 No
seq3c Sequential 3 Yes

Benchmark Results

We evaluated frontier LLMs to validate the framework and establish baselines. Results demonstrate that DPBench successfully distinguishes coordination capabilities across models and conditions.

Key Finding: Models show asymmetric performance. Sequential coordination succeeds (near 0% deadlock) while simultaneous coordination fails (25-95% deadlock), revealing that current LLMs struggle with concurrent resource decisions.

Sequential vs Simultaneous Performance

Model Comparison

Models coordinate effectively in sequential mode but exhibit high deadlock rates when decisions must be simultaneous.

Communication Does Not Solve Coordination

Communication Effect

Enabling inter-agent messaging does not reduce deadlock. Message latency (arriving one timestep late) and low intention-action consistency prevent effective coordination through communication alone.

Full Condition Breakdown

Performance by Condition

Deadlock patterns persist across group sizes and communication settings, demonstrating systematic coordination failures in simultaneous modes.

Reproducing Our Experiments

git clone https://github.com/najmulhasan-code/dpbench.git
cd dpbench
pip install -e .

# Configure API keys (only needed to reproduce our specific experiments)
cp .env.example .env
# Edit .env with your API keys for OpenAI, Anthropic, Google, and xAI

# Run experiments
python experiments/scripts/run_full.py

The experiments in this repository use API-based models (GPT, Claude, Gemini, Grok), but the dpbench framework itself works with any model including local models. Configurations are in experiments/configs/. Modify experiments/configs/models.yaml to test your own models.

Citation

# Citation will be added upon publication

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpbench-0.1.0.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpbench-0.1.0-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file dpbench-0.1.0.tar.gz.

File metadata

  • Download URL: dpbench-0.1.0.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for dpbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 eb94bb58b1855224c39949561c078d81deb71bb6aca3ff9b40dc0c1b64df2e4e
MD5 b42f3eda4ad8226d807e94545114dac7
BLAKE2b-256 09defc1e6ccbca2f5290ad39b02b0133ed018cef94e71baec3dd785b746bba1e

See more details on using hashes here.

File details

Details for the file dpbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dpbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for dpbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 45bdc48b9dca4a87272048996fe27b0537242c0102d3fb1b6dcef71684f5adc9
MD5 61119502dd7abf58e473f3cef4baa08f
BLAKE2b-256 b2267ccb4f779ce6ee669d3ad9c53a3b68bc6812ec413d9c971b66951d4dac83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page