DPBench: A Benchmark for LLM Multi-Agent Coordination
Project description
DPBench
A benchmark for evaluating LLM coordination under simultaneous resource contention.
Why DPBench?
Existing LLM benchmarks evaluate individual capabilities like reasoning (GSM8K), knowledge (MMLU), or coding (HumanEval). Multi-agent benchmarks typically use turn-based interaction where agents respond sequentially, but do not test simultaneous coordination under resource contention.
This capability matters for real deployments. Autonomous vehicles at intersections, collaborative robotics, and distributed systems all require agents to coordinate concurrent decisions without observing what others are doing. DPBench provides a standardized test for this capability.
What is DPBench?
DPBench is a framework built on the Dining Philosophers problem - a classic coordination challenge from distributed systems. The framework provides a standardized environment with automatic deadlock detection, two orchestration modes (simultaneous vs sequential), six reproducible metrics (deadlock rate, throughput, fairness, time to deadlock, starvation count, and message-action consistency), and eight experimental conditions that systematically vary decision timing, group size, and communication.
Our experiments show LLMs achieve near-zero deadlock in sequential mode but 25-95% deadlock rates in simultaneous mode, revealing a fundamental gap in coordination capabilities.
Installation
pip install dpbench
Quick Start
from dpbench import Benchmark
# Define your model (works with any LLM: API-based or local)
def my_model(system_prompt: str, user_prompt: str) -> str:
# Your LLM call here
return response
# Run benchmark
results = Benchmark.run(
model_fn=my_model,
system_prompt="System prompt here",
decision_prompt="Decision prompt template",
mode="simultaneous"
)
# Results
print(f"Deadlock Rate: {results['deadlock_rate']:.1%}")
print(f"Throughput: {results['avg_throughput']:.3f}")
print(f"Fairness: {results['avg_fairness']:.3f}")
See experiments/prompts/ for prompt templates used in our experiments.
How It Works
The Dining Philosophers Problem
N philosophers sit around a table with N forks between them. Each philosopher needs two adjacent forks to eat, but each fork can only be held by one philosopher. When all philosophers simultaneously grab one fork, they deadlock - each holding one fork and waiting for their neighbor's fork, creating a circular dependency.
This problem isolates the core challenge of resource coordination: agents must make compatible decisions without directly observing others' current actions.
Framework Architecture
Environment: Circular table with configurable number of philosophers (N) and N forks. Four actions per agent: GRAB_LEFT, GRAB_RIGHT, RELEASE, WAIT. Automatic deadlock detection when all agents are hungry and each holds exactly one fork. Partial observability enforces realistic constraints.
Orchestration: Simultaneous mode executes all agent decisions in parallel without state updates between decisions, testing true concurrent coordination. Sequential mode processes decisions one at a time with state updates after each action, providing an easier baseline.
Metrics: Six standardized metrics ensure reproducible evaluation. Deadlock rate captures coordination failure. Throughput measures efficiency as meals per timestep. Fairness uses Gini-normalized distribution. Time to deadlock, starvation count, and message-action consistency provide diagnostic information.
Standard Conditions
Eight conditions systematically vary three factors:
| Code | Decision Mode | Philosophers | Communication |
|---|---|---|---|
sim5nc |
Simultaneous | 5 | No |
sim5c |
Simultaneous | 5 | Yes |
seq5nc |
Sequential | 5 | No |
seq5c |
Sequential | 5 | Yes |
sim3nc |
Simultaneous | 3 | No |
sim3c |
Simultaneous | 3 | Yes |
seq3nc |
Sequential | 3 | No |
seq3c |
Sequential | 3 | Yes |
Benchmark Results
We evaluated frontier LLMs to validate the framework and establish baselines. Results demonstrate that DPBench successfully distinguishes coordination capabilities across models and conditions.
Key Finding: Models show asymmetric performance. Sequential coordination succeeds (near 0% deadlock) while simultaneous coordination fails (25-95% deadlock), revealing that current LLMs struggle with concurrent resource decisions.
Sequential vs Simultaneous Performance
Models coordinate effectively in sequential mode but exhibit high deadlock rates when decisions must be simultaneous.
Communication Does Not Solve Coordination
Enabling inter-agent messaging does not reduce deadlock. Message latency (arriving one timestep late) and low intention-action consistency prevent effective coordination through communication alone.
Full Condition Breakdown
Deadlock patterns persist across group sizes and communication settings, demonstrating systematic coordination failures in simultaneous modes.
Reproducing Our Experiments
git clone https://github.com/najmulhasan-code/dpbench.git
cd dpbench
pip install -e .
# Configure API keys (only needed to reproduce our specific experiments)
cp .env.example .env
# Edit .env with your API keys for OpenAI, Anthropic, Google, and xAI
# Run experiments
python experiments/scripts/run_full.py
The experiments in this repository use API-based models (GPT, Claude, Gemini, Grok), but the dpbench framework itself works with any model including local models. Configurations are in experiments/configs/. Modify experiments/configs/models.yaml to test your own models.
Citation
# Citation will be added upon publication
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dpbench-0.1.0.tar.gz.
File metadata
- Download URL: dpbench-0.1.0.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb94bb58b1855224c39949561c078d81deb71bb6aca3ff9b40dc0c1b64df2e4e
|
|
| MD5 |
b42f3eda4ad8226d807e94545114dac7
|
|
| BLAKE2b-256 |
09defc1e6ccbca2f5290ad39b02b0133ed018cef94e71baec3dd785b746bba1e
|
File details
Details for the file dpbench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dpbench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45bdc48b9dca4a87272048996fe27b0537242c0102d3fb1b6dcef71684f5adc9
|
|
| MD5 |
61119502dd7abf58e473f3cef4baa08f
|
|
| BLAKE2b-256 |
b2267ccb4f779ce6ee669d3ad9c53a3b68bc6812ec413d9c971b66951d4dac83
|