DPBench: A Benchmark for LLM Multi-Agent Coordination

These details have not been verified by PyPI

Project links

Project description

DPBench

DPBench Architecture

A benchmark for evaluating LLM coordination under simultaneous resource contention.

Why DPBench?

Existing LLM benchmarks evaluate individual capabilities like reasoning (GSM8K), knowledge (MMLU), or coding (HumanEval). Multi-agent benchmarks typically use turn-based interaction where agents respond sequentially, but do not test simultaneous coordination under resource contention.

This capability matters for real deployments. Autonomous vehicles at intersections, collaborative robotics, and distributed systems all require agents to coordinate concurrent decisions without observing what others are doing. DPBench provides a standardized test for this capability.

What is DPBench?

DPBench is a framework built on the Dining Philosophers problem - a classic coordination challenge from distributed systems. The framework provides a standardized environment with automatic deadlock detection, two orchestration modes (simultaneous vs sequential), six reproducible metrics (deadlock rate, throughput, fairness, time to deadlock, starvation count, and message-action consistency), and eight experimental conditions that systematically vary decision timing, group size, and communication.

Our experiments show LLMs achieve near-zero deadlock in sequential mode but 25-95% deadlock rates in simultaneous mode, revealing a fundamental gap in coordination capabilities.

Installation

pip install dpbench

Quick Start

from dpbench import Benchmark

# Define your model (works with any LLM: API-based or local)
def my_model(system_prompt: str, user_prompt: str) -> str:
    # Your LLM call here
    return response

# Run benchmark
results = Benchmark.run(
    model_fn=my_model,
    system_prompt="System prompt here",
    decision_prompt="Decision prompt template",
    mode="simultaneous"
)

# Results
print(f"Deadlock Rate: {results['deadlock_rate']:.1%}")
print(f"Throughput: {results['avg_throughput']:.3f}")
print(f"Fairness: {results['avg_fairness']:.3f}")

See experiments/prompts/ for prompt templates used in our experiments.

How It Works

The Dining Philosophers Problem

N philosophers sit around a table with N forks between them. Each philosopher needs two adjacent forks to eat, but each fork can only be held by one philosopher. When all philosophers simultaneously grab one fork, they deadlock - each holding one fork and waiting for their neighbor's fork, creating a circular dependency.

This problem isolates the core challenge of resource coordination: agents must make compatible decisions without directly observing others' current actions.

Framework Architecture

Environment: Circular table with configurable number of philosophers (N) and N forks. Four actions per agent: GRAB_LEFT, GRAB_RIGHT, RELEASE, WAIT. Automatic deadlock detection when all agents are hungry and each holds exactly one fork. Partial observability enforces realistic constraints.

Orchestration: Simultaneous mode executes all agent decisions in parallel without state updates between decisions, testing true concurrent coordination. Sequential mode processes decisions one at a time with state updates after each action, providing an easier baseline.

Metrics: Six standardized metrics ensure reproducible evaluation. Deadlock rate captures coordination failure. Throughput measures efficiency as meals per timestep. Fairness uses Gini-normalized distribution. Time to deadlock, starvation count, and message-action consistency provide diagnostic information.

Standard Conditions

Eight conditions systematically vary three factors:

Code	Decision Mode	Philosophers	Communication
`sim5nc`	Simultaneous	5	No
`sim5c`	Simultaneous	5	Yes
`seq5nc`	Sequential	5	No
`seq5c`	Sequential	5	Yes
`sim3nc`	Simultaneous	3	No
`sim3c`	Simultaneous	3	Yes
`seq3nc`	Sequential	3	No
`seq3c`	Sequential	3	Yes

Benchmark Results

We evaluated frontier LLMs to validate the framework and establish baselines. Results demonstrate that DPBench successfully distinguishes coordination capabilities across models and conditions.

Key Finding: Models show asymmetric performance. Sequential coordination succeeds (near 0% deadlock) while simultaneous coordination fails (25-95% deadlock), revealing that current LLMs struggle with concurrent resource decisions.

Sequential vs Simultaneous Performance

Model Comparison

Models coordinate effectively in sequential mode but exhibit high deadlock rates when decisions must be simultaneous.

Communication Does Not Solve Coordination

Communication Effect

Enabling inter-agent messaging does not reduce deadlock. Message latency (arriving one timestep late) and low intention-action consistency prevent effective coordination through communication alone.

Full Condition Breakdown

Performance by Condition

Deadlock patterns persist across group sizes and communication settings, demonstrating systematic coordination failures in simultaneous modes.

Reproducing Our Experiments

git clone https://github.com/najmulhasan-code/dpbench.git
cd dpbench
pip install -e .

# Configure API keys (only needed to reproduce our specific experiments)
cp .env.example .env
# Edit .env with your API keys for OpenAI, Anthropic, Google, and xAI

# Run experiments
python experiments/scripts/run_full.py

The experiments in this repository use API-based models (GPT, Claude, Gemini, Grok), but the dpbench framework itself works with any model including local models. Configurations are in experiments/configs/. Modify experiments/configs/models.yaml to test your own models.

Citation

# Citation will be added upon publication

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpbench-0.1.0.tar.gz (27.5 kB view details)

Uploaded Jan 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dpbench-0.1.0-py3-none-any.whl (28.8 kB view details)

Uploaded Jan 31, 2026 Python 3

File details

Details for the file dpbench-0.1.0.tar.gz.

File metadata

Download URL: dpbench-0.1.0.tar.gz
Upload date: Jan 31, 2026
Size: 27.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for dpbench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`eb94bb58b1855224c39949561c078d81deb71bb6aca3ff9b40dc0c1b64df2e4e`
MD5	`b42f3eda4ad8226d807e94545114dac7`
BLAKE2b-256	`09defc1e6ccbca2f5290ad39b02b0133ed018cef94e71baec3dd785b746bba1e`

See more details on using hashes here.

File details

Details for the file dpbench-0.1.0-py3-none-any.whl.

File metadata

Download URL: dpbench-0.1.0-py3-none-any.whl
Upload date: Jan 31, 2026
Size: 28.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for dpbench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`45bdc48b9dca4a87272048996fe27b0537242c0102d3fb1b6dcef71684f5adc9`
MD5	`61119502dd7abf58e473f3cef4baa08f`
BLAKE2b-256	`b2267ccb4f779ce6ee669d3ad9c53a3b68bc6812ec413d9c971b66951d4dac83`

See more details on using hashes here.

dpbench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DPBench

Why DPBench?

What is DPBench?

Installation

Quick Start

How It Works

The Dining Philosophers Problem

Framework Architecture

Standard Conditions

Benchmark Results

Sequential vs Simultaneous Performance

Communication Does Not Solve Coordination

Full Condition Breakdown

Reproducing Our Experiments

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes