Multi-agent colosseum framework. Prove that agents debating, researching, and red-teaming produce better answers than a single agent.
Project description
agent-colosseum
Make AI agents fight, research, and review each other — then prove the result is better than a single agent.
agent-colosseum is a Python framework for running multi-agent interactions (debate, collaborative research, red-teaming, peer review) and benchmarking them against single-agent baselines. It is provider-agnostic (Claude, GPT, Gemini, local models) and ships with built-in datasets and evaluation metrics.
Why?
A single LLM has blind spots. It commits to an answer too early, hallucinates confidently, and can't challenge its own reasoning. Multiple agents attacking, questioning, and building on each other's work produce measurably better results.
This isn't theory. The papers prove it:
| Paper | Key Result |
|---|---|
| Du et al. (2023) | GSM8K: 77% → 85%, MMLU: 63.9% → 71.1% |
| Liang et al. (EMNLP 2024) | GPT-3.5 + debate > GPT-4 alone. Thought diversity 2.6x increase |
| Chan et al. (2023) | Same-role agents = zero multi-agent benefit. Diverse roles required |
| CAMEL (NeurIPS 2023) | Multi-agent wins 76.3% vs single-agent 10.4% |
| Park et al. (2023) | Full architecture (memory+reflection) +41% believability |
agent-colosseum lets you reproduce these results, run your own experiments, and use multi-agent patterns in production.
Installation
pip install agent-colosseum # core only
pip install agent-colosseum[anthropic] # + Claude support
pip install agent-colosseum[openai] # + GPT support
pip install agent-colosseum[all] # everything
Quick Start
import asyncio
from agent_colosseum import DebateExecutor, DebateConfig
from agent_colosseum.providers.anthropic import AnthropicProvider
async def main():
provider = AnthropicProvider(model="claude-sonnet-4-6")
executor = DebateExecutor(provider)
# Run a debate
result = await executor.debate("Should AI models be open-sourced without restrictions?")
print(result.final_answer)
print(f"Confidence: {result.confidence}")
print(f"Key arguments: {result.key_arguments}")
# Compare: single agent vs debate
single, debate = await executor.compare("Is Rust better than C++?")
print(f"Single: {single.final_answer[:200]}")
print(f"Debate: {debate.final_answer[:200]}")
print(f"Token cost: {debate.total_tokens / single.total_tokens:.1f}x")
asyncio.run(main())
Four Arena Modes
1. Debate — Agents Fight, Then Reach Consensus
Agents take opposing sides, attack each other's arguments, and are forced to synthesize a conclusion. Based on MAD (Liang et al.) and Du et al.
result = await executor.debate("Is remote work more productive than office work?")
Three debate protocols:
| Protocol | How it works | Best for |
|---|---|---|
mad (default) |
Pro vs Con + Judge, tit-for-tat forced opposition | Factual questions, reasoning |
du |
N agents answer independently → share → update → converge | Math, well-defined problems |
freestyle |
Free-form argument, no forced sides | Open-ended topics |
# Du-style consensus (no judge, natural convergence)
result = await executor.debate(topic, protocol="du")
# Freestyle with custom agents
result = await executor.debate(topic, protocol="freestyle", debaters=my_agents)
How it works:
Round 1-3: Fight (argument → rebuttal → concession → challenge)
↓
Synthesis Round: Each debater states what they concede and what they won't
↓
Final Integration: Merge synthesis statements into unified conclusion
↓
Result: final_answer + agreed_points + unresolved issues + insight
2. Research — Independent Investigation, Cross-Examination, Synthesis
Each agent investigates from their specialty (empiricist, theorist, methodologist), then they cross-examine each other's findings before synthesizing.
result = await executor.research("Are LLM agents production-ready in 2026?")
How it works:
Phase 1 (Independent): Each researcher analyzes from their specialty
Phase 2 (Cross-exam): Read each other's work, challenge methodology
Phase 3 (Synthesis): Confirmed findings + probable hypotheses + unresolved
Output includes:
confirmed_findings— all researchers agreeprobable_hypotheses— most agree, needs verificationunresolved— researchers disagreefurther_research— suggested next steps
3. Red Team — One Defends, Others Attack
One agent presents a claim. The others attack it from different angles (logic, evidence, edge cases, practicality, assumptions). The defender absorbs valid attacks and strengthens the claim.
result = await executor.redteam("Microservices are better than monoliths")
How it works:
Phase 1: Defender presents initial claim
Phase 2: Attackers hit from different angles → Defender revises (repeat)
Phase 3: Compare original vs hardened claim
Output includes:
original_weaknesses— weaknesses found in the initial claimimprovements— how the claim got strongersurviving_weaknesses— weaknesses that couldn't be fixed
4. Peer Review — Present, Review, Revise
One agent presents analysis. Others review it (strengths, weaknesses, questions, suggestions). The author revises based on feedback.
result = await executor.peer_review("What caused the 2008 financial crisis?")
How it works:
Phase 1: Author presents structured analysis
Phase 2: Reviewers critique from different angles → Author revises (repeat)
Phase 3: Final reviewed output
Custom Agents
Every mode accepts custom agents. Control the number, roles, personalities, and even which LLM model each agent uses.
from agent_colosseum import Debater, DebaterRole
debaters = [
Debater(
name="DataScientist",
role=DebaterRole.PROPONENT,
persona="Relies on statistical evidence and empirical data.",
model="claude-sonnet-4-6", # optional per-agent model
),
Debater(
name="Philosopher",
role=DebaterRole.OPPONENT,
persona="Questions fundamental assumptions and ethical implications.",
),
Debater(
name="Engineer",
role=DebaterRole.SKEPTIC,
persona="Demands practical feasibility. 'Will it actually work?'",
),
Debater(
name="Contrarian",
role=DebaterRole.DEVILS_ADVOCATE,
persona="Opposes whatever the majority thinks. Finds hidden problems.",
),
]
# Use in any mode
result = await executor.debate(topic, debaters=debaters)
result = await executor.research(topic, debaters=debaters)
result = await executor.redteam(topic, debaters=debaters) # first = defender
result = await executor.peer_review(topic, debaters=debaters) # first = author
Available roles:
| Role | Behavior |
|---|---|
PROPONENT |
Argues in favor |
OPPONENT |
Argues against |
SKEPTIC |
Demands evidence for everything |
DEVILS_ADVOCATE |
Opposes majority, reveals hidden problems |
WILDCARD |
Unpredictable, attacks from unexpected angles |
Cross-Model Debate
Du et al. showed that ChatGPT (14/20) + Bard (11/20) debating together scored 17/20 — better than either alone. agent-colosseum supports this natively:
from agent_colosseum.providers.anthropic import AnthropicProvider
from agent_colosseum.providers.openai import OpenAIProvider
debaters = [
Debater(name="Claude", role=DebaterRole.PROPONENT, model="claude-sonnet-4-6"),
Debater(name="GPT", role=DebaterRole.OPPONENT, model="gpt-4o"),
]
# Use MultiProvider to route by model name
# (see examples/cross_model_debate.py for full implementation)
Configuration
from agent_colosseum import DebateConfig
config = DebateConfig(
max_rounds=3, # Du: 2-3 optimal, plateau at 4+
disagreement_level=2, # 1=mild, 2=strong (optimal), 3=extreme
adaptive_break=True, # Liang: judge can end early on consensus
scoring_criteria=["logic", "evidence", "persuasion"],
temperature=0.7, # agent generation temperature
judge_temperature=0.0, # SOTOPIA: deterministic judge correlates r=0.71 with humans
max_tokens_per_statement=800,
max_tokens_final=2000,
)
executor = DebateExecutor(provider, config=config)
Why these defaults? Every default is backed by paper evidence:
| Setting | Default | Evidence |
|---|---|---|
max_rounds=3 |
3 | Du et al.: 2-3 rounds optimal, plateau at 4+ |
disagreement_level=2 |
2 | Liang: Level 2 optimal. Level 3 devolves into "trying to win" |
adaptive_break=True |
True | Liang: judge-based early termination is effective |
judge_temperature=0.0 |
0.0 | SOTOPIA: deterministic scoring correlates r=0.71 with human judgment |
Benchmarking
Built-in benchmarking to prove multi-agent > single agent with your own data:
from agent_colosseum.benchmark import BenchmarkRunner
runner = BenchmarkRunner(executor)
# Run on built-in datasets
result = await runner.run("reasoning", protocol="mad", max_items=5)
print(result.summary())
# Output:
# ═══ Benchmark: reasoning (protocol: mad) ═══
#
# ## Accuracy
# Single Agent: 3/5 (60.0%)
# Debate: 4/5 (80.0%)
# Difference: +20.0%
#
# ## Quality (0-10)
# Single Agent: 6.2
# Debate: 7.8
# Difference: +1.6
#
# ## Cost
# Single tokens: 2,100
# Debate tokens: 8,400
# Ratio: 4.0x
Built-in datasets:
| Dataset | Items | Type | Based on |
|---|---|---|---|
arithmetic |
10 | Exact-answer math | Du et al. |
reasoning |
8 | Counter-intuitive logic | Liang CIAR |
factual |
8 | Fact verification | Du MMLU/TruthfulQA |
opinion |
5 | Open-ended (no ground truth) | Quality eval only |
Quality metrics (SOTOPIA-inspired 7 dimensions):
| Dimension | Measures | Weight |
|---|---|---|
| depth | Argument depth (surface vs root cause) | 1.0x |
| diversity | Perspective diversity (Liang: debate = 2.6x) | 1.5x |
| evidence | Evidence specificity | 1.0x |
| logic | Logical consistency | 1.0x |
| novelty | Insights impossible for single agent | 2.0x |
| concession | Quality of concessions (intellectual honesty) | 1.0x |
| synthesis | Final conclusion better than individual positions | 2.0x |
Real-Time Streaming
All modes support a callback for real-time statement streaming:
async def on_statement(stmt, round_num):
label = "Synthesis" if round_num == 0 else f"R{round_num}"
print(f"[{label}] {stmt.debater_name}: {stmt.content[:200]}")
result = await executor.debate(topic, on_statement=on_statement)
Custom LLM Providers
Implement the LLMProvider interface to use any LLM:
from agent_colosseum.providers.base import LLMProvider, LLMResponse
class OllamaProvider(LLMProvider):
async def generate(
self,
system: str,
messages: list[dict[str, str]],
*,
max_tokens: int = 800,
temperature: float = 0.7,
model: str | None = None,
) -> LLMResponse:
# Your implementation here
return LLMResponse(content="...", input_tokens=0, output_tokens=0)
Built-in providers: AnthropicProvider, OpenAIProvider (also works with any OpenAI-compatible API via base_url).
Architecture
agent_colosseum/
├── models.py # Debater, Statement, Round, DebateResult
├── prompts.py # All prompts (English, provider-agnostic)
├── protocols.py # Debate protocols (MAD, Du consensus, freestyle)
├── executor.py # Main engine — debate/research/redteam/peer_review/compare
├── modes/
│ ├── research.py # Independent investigation → cross-exam → synthesis
│ ├── redteam.py # 1 defender + N attackers → hardened claim
│ └── peer_review.py # 1 author → N reviewers → improved output
├── providers/
│ ├── base.py # LLMProvider ABC
│ ├── anthropic.py # Claude
│ └── openai.py # GPT / OpenAI-compatible
└── benchmark/
├── datasets.py # Built-in datasets (arithmetic, reasoning, factual, opinion)
├── metrics.py # Accuracy + 7-dimension quality evaluation
└── runner.py # Single vs multi-agent comparison runner
Design Decisions
Every design choice in agent-colosseum is grounded in published research:
| Decision | Reasoning | Source |
|---|---|---|
| Default 2 agents | 2 debaters optimal; 3+ degrades due to context length | Liang et al. |
| Default 3 rounds | 2-3 optimal; 4+ shows plateau or decline | Du et al. |
| Diverse roles mandatory | Same role = zero multi-agent benefit | Chan et al. |
| Disagreement level 2 | Level 3 devolves into "arguing to win" | Liang et al. |
| One-by-one communication | Sequential > simultaneous (60% vs 55% accuracy) | Chan et al. |
| Adaptive early termination | Judge-based consensus detection saves cost | Liang et al. |
| Deterministic judge (temp=0) | Correlates r=0.71 with human judgment | SOTOPIA |
| Synthesis round | Debaters themselves reach conclusion, not external judge | Novel |
References
Core Multi-Agent Debate
- Du et al. (2023) — Improving Factuality and Reasoning in Language Models through Multiagent Debate
- Liang et al. (EMNLP 2024) — Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- Chan et al. (2023) — ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- CAMEL (NeurIPS 2023) — Communicative Agents for "Mind" Exploration of Large Language Model Society
Agent Architecture & Evaluation
- Park et al. (UIST 2023) — Generative Agents: Interactive Simulacra of Human Behavior
- SOTOPIA (ICLR 2024) — Interactive Evaluation for Social Intelligence in Language Agents
- MetaGPT (2023) — Meta Programming for A Multi-Agent Collaborative Framework
- Tree of Thoughts (2023) — Deliberate Problem Solving with Large Language Models
Personality & Behavior
- BIG5-CHAT (ACL 2025) — Shaping LLM Personalities via Training on Human-Grounded Data
- Agentic LLMs Survey (2025) — Theory of Mind, multi-agent cooperation, social norms
- Dynamic Personality (ACL 2025) — Personality evolution during interaction
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_colosseum-0.1.0.tar.gz.
File metadata
- Download URL: agent_colosseum-0.1.0.tar.gz
- Upload date:
- Size: 33.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1501b94e62aef4031e8a189df06a8d98f0868f035d9289495a058e2011e6f490
|
|
| MD5 |
601c328ef68392fad83b8e32ccfd1efb
|
|
| BLAKE2b-256 |
a4d8b4ee74e41d96554222a1f9910f2b1c88011f88829ace099e8becc5485d87
|
Provenance
The following attestation bundles were made for agent_colosseum-0.1.0.tar.gz:
Publisher:
publish.yml on jinsoo96/agent-colosseum
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_colosseum-0.1.0.tar.gz -
Subject digest:
1501b94e62aef4031e8a189df06a8d98f0868f035d9289495a058e2011e6f490 - Sigstore transparency entry: 1285369200
- Sigstore integration time:
-
Permalink:
jinsoo96/agent-colosseum@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jinsoo96
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc -
Trigger Event:
release
-
Statement type:
File details
Details for the file agent_colosseum-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent_colosseum-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afc08633c7588e1a9553d1ee795fec5e5a825133ddd7ae676aa6a429021f4b0f
|
|
| MD5 |
ba63c256452c74766b741de287113385
|
|
| BLAKE2b-256 |
1002a7ce04c9c62f364c1ecd693508d7015dd1265a955ea9d65bffff573dbd3d
|
Provenance
The following attestation bundles were made for agent_colosseum-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on jinsoo96/agent-colosseum
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_colosseum-0.1.0-py3-none-any.whl -
Subject digest:
afc08633c7588e1a9553d1ee795fec5e5a825133ddd7ae676aa6a429021f4b0f - Sigstore transparency entry: 1285369351
- Sigstore integration time:
-
Permalink:
jinsoo96/agent-colosseum@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jinsoo96
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc -
Trigger Event:
release
-
Statement type: