Skip to main content

Multi-agent colosseum framework. Prove that agents debating, researching, and red-teaming produce better answers than a single agent.

Project description

agent-colosseum

Make AI agents fight, research, and review each other — then prove the result is better than a single agent.

agent-colosseum is a Python framework for running multi-agent interactions (debate, collaborative research, red-teaming, peer review) and benchmarking them against single-agent baselines. It is provider-agnostic (Claude, GPT, Gemini, local models) and ships with built-in datasets and evaluation metrics.

Why?

A single LLM has blind spots. It commits to an answer too early, hallucinates confidently, and can't challenge its own reasoning. Multiple agents attacking, questioning, and building on each other's work produce measurably better results.

This isn't theory. The papers prove it:

Paper Key Result
Du et al. (2023) GSM8K: 77% → 85%, MMLU: 63.9% → 71.1%
Liang et al. (EMNLP 2024) GPT-3.5 + debate > GPT-4 alone. Thought diversity 2.6x increase
Chan et al. (2023) Same-role agents = zero multi-agent benefit. Diverse roles required
CAMEL (NeurIPS 2023) Multi-agent wins 76.3% vs single-agent 10.4%
Park et al. (2023) Full architecture (memory+reflection) +41% believability

agent-colosseum lets you reproduce these results, run your own experiments, and use multi-agent patterns in production.

Installation

pip install agent-colosseum              # core only
pip install agent-colosseum[anthropic]   # + Claude support
pip install agent-colosseum[openai]      # + GPT support
pip install agent-colosseum[all]         # everything

Quick Start

import asyncio
from agent_colosseum import DebateExecutor, DebateConfig
from agent_colosseum.providers.anthropic import AnthropicProvider

async def main():
    provider = AnthropicProvider(model="claude-sonnet-4-6")
    executor = DebateExecutor(provider)

    # Run a debate
    result = await executor.debate("Should AI models be open-sourced without restrictions?")
    print(result.final_answer)
    print(f"Confidence: {result.confidence}")
    print(f"Key arguments: {result.key_arguments}")

    # Compare: single agent vs debate
    single, debate = await executor.compare("Is Rust better than C++?")
    print(f"Single: {single.final_answer[:200]}")
    print(f"Debate: {debate.final_answer[:200]}")
    print(f"Token cost: {debate.total_tokens / single.total_tokens:.1f}x")

asyncio.run(main())

Four Arena Modes

1. Debate — Agents Fight, Then Reach Consensus

Agents take opposing sides, attack each other's arguments, and are forced to synthesize a conclusion. Based on MAD (Liang et al.) and Du et al.

result = await executor.debate("Is remote work more productive than office work?")

Three debate protocols:

Protocol How it works Best for
mad (default) Pro vs Con + Judge, tit-for-tat forced opposition Factual questions, reasoning
du N agents answer independently → share → update → converge Math, well-defined problems
freestyle Free-form argument, no forced sides Open-ended topics
# Du-style consensus (no judge, natural convergence)
result = await executor.debate(topic, protocol="du")

# Freestyle with custom agents
result = await executor.debate(topic, protocol="freestyle", debaters=my_agents)

How it works:

Round 1-3: Fight (argument → rebuttal → concession → challenge)
    ↓
Synthesis Round: Each debater states what they concede and what they won't
    ↓  
Final Integration: Merge synthesis statements into unified conclusion
    ↓
Result: final_answer + agreed_points + unresolved issues + insight

2. Research — Independent Investigation, Cross-Examination, Synthesis

Each agent investigates from their specialty (empiricist, theorist, methodologist), then they cross-examine each other's findings before synthesizing.

result = await executor.research("Are LLM agents production-ready in 2026?")

How it works:

Phase 1 (Independent): Each researcher analyzes from their specialty
Phase 2 (Cross-exam):  Read each other's work, challenge methodology
Phase 3 (Synthesis):   Confirmed findings + probable hypotheses + unresolved

Output includes:

  • confirmed_findings — all researchers agree
  • probable_hypotheses — most agree, needs verification
  • unresolved — researchers disagree
  • further_research — suggested next steps

3. Red Team — One Defends, Others Attack

One agent presents a claim. The others attack it from different angles (logic, evidence, edge cases, practicality, assumptions). The defender absorbs valid attacks and strengthens the claim.

result = await executor.redteam("Microservices are better than monoliths")

How it works:

Phase 1: Defender presents initial claim
Phase 2: Attackers hit from different angles → Defender revises (repeat)
Phase 3: Compare original vs hardened claim

Output includes:

  • original_weaknesses — weaknesses found in the initial claim
  • improvements — how the claim got stronger
  • surviving_weaknesses — weaknesses that couldn't be fixed

4. Peer Review — Present, Review, Revise

One agent presents analysis. Others review it (strengths, weaknesses, questions, suggestions). The author revises based on feedback.

result = await executor.peer_review("What caused the 2008 financial crisis?")

How it works:

Phase 1: Author presents structured analysis
Phase 2: Reviewers critique from different angles → Author revises (repeat)
Phase 3: Final reviewed output

Custom Agents

Every mode accepts custom agents. Control the number, roles, personalities, and even which LLM model each agent uses.

from agent_colosseum import Debater, DebaterRole

debaters = [
    Debater(
        name="DataScientist",
        role=DebaterRole.PROPONENT,
        persona="Relies on statistical evidence and empirical data.",
        model="claude-sonnet-4-6",  # optional per-agent model
    ),
    Debater(
        name="Philosopher",
        role=DebaterRole.OPPONENT,
        persona="Questions fundamental assumptions and ethical implications.",
    ),
    Debater(
        name="Engineer",
        role=DebaterRole.SKEPTIC,
        persona="Demands practical feasibility. 'Will it actually work?'",
    ),
    Debater(
        name="Contrarian",
        role=DebaterRole.DEVILS_ADVOCATE,
        persona="Opposes whatever the majority thinks. Finds hidden problems.",
    ),
]

# Use in any mode
result = await executor.debate(topic, debaters=debaters)
result = await executor.research(topic, debaters=debaters)
result = await executor.redteam(topic, debaters=debaters)   # first = defender
result = await executor.peer_review(topic, debaters=debaters)  # first = author

Available roles:

Role Behavior
PROPONENT Argues in favor
OPPONENT Argues against
SKEPTIC Demands evidence for everything
DEVILS_ADVOCATE Opposes majority, reveals hidden problems
WILDCARD Unpredictable, attacks from unexpected angles

Cross-Model Debate

Du et al. showed that ChatGPT (14/20) + Bard (11/20) debating together scored 17/20 — better than either alone. agent-colosseum supports this natively:

from agent_colosseum.providers.anthropic import AnthropicProvider
from agent_colosseum.providers.openai import OpenAIProvider

debaters = [
    Debater(name="Claude", role=DebaterRole.PROPONENT, model="claude-sonnet-4-6"),
    Debater(name="GPT", role=DebaterRole.OPPONENT, model="gpt-4o"),
]

# Use MultiProvider to route by model name
# (see examples/cross_model_debate.py for full implementation)

Configuration

from agent_colosseum import DebateConfig

config = DebateConfig(
    max_rounds=3,              # Du: 2-3 optimal, plateau at 4+
    disagreement_level=2,      # 1=mild, 2=strong (optimal), 3=extreme
    adaptive_break=True,       # Liang: judge can end early on consensus
    scoring_criteria=["logic", "evidence", "persuasion"],
    temperature=0.7,           # agent generation temperature
    judge_temperature=0.0,     # SOTOPIA: deterministic judge correlates r=0.71 with humans
    max_tokens_per_statement=800,
    max_tokens_final=2000,
)

executor = DebateExecutor(provider, config=config)

Why these defaults? Every default is backed by paper evidence:

Setting Default Evidence
max_rounds=3 3 Du et al.: 2-3 rounds optimal, plateau at 4+
disagreement_level=2 2 Liang: Level 2 optimal. Level 3 devolves into "trying to win"
adaptive_break=True True Liang: judge-based early termination is effective
judge_temperature=0.0 0.0 SOTOPIA: deterministic scoring correlates r=0.71 with human judgment

Benchmarking

Built-in benchmarking to prove multi-agent > single agent with your own data:

from agent_colosseum.benchmark import BenchmarkRunner

runner = BenchmarkRunner(executor)

# Run on built-in datasets
result = await runner.run("reasoning", protocol="mad", max_items=5)
print(result.summary())

# Output:
# ═══ Benchmark: reasoning (protocol: mad) ═══
#
# ## Accuracy
#   Single Agent:  3/5 (60.0%)
#   Debate:        4/5 (80.0%)
#   Difference:    +20.0%
#
# ## Quality (0-10)
#   Single Agent:  6.2
#   Debate:        7.8
#   Difference:    +1.6
#
# ## Cost
#   Single tokens: 2,100
#   Debate tokens: 8,400
#   Ratio:         4.0x

Built-in datasets:

Dataset Items Type Based on
arithmetic 10 Exact-answer math Du et al.
reasoning 8 Counter-intuitive logic Liang CIAR
factual 8 Fact verification Du MMLU/TruthfulQA
opinion 5 Open-ended (no ground truth) Quality eval only

Quality metrics (SOTOPIA-inspired 7 dimensions):

Dimension Measures Weight
depth Argument depth (surface vs root cause) 1.0x
diversity Perspective diversity (Liang: debate = 2.6x) 1.5x
evidence Evidence specificity 1.0x
logic Logical consistency 1.0x
novelty Insights impossible for single agent 2.0x
concession Quality of concessions (intellectual honesty) 1.0x
synthesis Final conclusion better than individual positions 2.0x

Real-Time Streaming

All modes support a callback for real-time statement streaming:

async def on_statement(stmt, round_num):
    label = "Synthesis" if round_num == 0 else f"R{round_num}"
    print(f"[{label}] {stmt.debater_name}: {stmt.content[:200]}")

result = await executor.debate(topic, on_statement=on_statement)

Custom LLM Providers

Implement the LLMProvider interface to use any LLM:

from agent_colosseum.providers.base import LLMProvider, LLMResponse

class OllamaProvider(LLMProvider):
    async def generate(
        self,
        system: str,
        messages: list[dict[str, str]],
        *,
        max_tokens: int = 800,
        temperature: float = 0.7,
        model: str | None = None,
    ) -> LLMResponse:
        # Your implementation here
        return LLMResponse(content="...", input_tokens=0, output_tokens=0)

Built-in providers: AnthropicProvider, OpenAIProvider (also works with any OpenAI-compatible API via base_url).

Architecture

agent_colosseum/
├── models.py              # Debater, Statement, Round, DebateResult
├── prompts.py             # All prompts (English, provider-agnostic)
├── protocols.py           # Debate protocols (MAD, Du consensus, freestyle)
├── executor.py            # Main engine — debate/research/redteam/peer_review/compare
├── modes/
│   ├── research.py        # Independent investigation → cross-exam → synthesis
│   ├── redteam.py         # 1 defender + N attackers → hardened claim
│   └── peer_review.py     # 1 author → N reviewers → improved output
├── providers/
│   ├── base.py            # LLMProvider ABC
│   ├── anthropic.py       # Claude
│   └── openai.py          # GPT / OpenAI-compatible
└── benchmark/
    ├── datasets.py        # Built-in datasets (arithmetic, reasoning, factual, opinion)
    ├── metrics.py         # Accuracy + 7-dimension quality evaluation
    └── runner.py          # Single vs multi-agent comparison runner

Design Decisions

Every design choice in agent-colosseum is grounded in published research:

Decision Reasoning Source
Default 2 agents 2 debaters optimal; 3+ degrades due to context length Liang et al.
Default 3 rounds 2-3 optimal; 4+ shows plateau or decline Du et al.
Diverse roles mandatory Same role = zero multi-agent benefit Chan et al.
Disagreement level 2 Level 3 devolves into "arguing to win" Liang et al.
One-by-one communication Sequential > simultaneous (60% vs 55% accuracy) Chan et al.
Adaptive early termination Judge-based consensus detection saves cost Liang et al.
Deterministic judge (temp=0) Correlates r=0.71 with human judgment SOTOPIA
Synthesis round Debaters themselves reach conclusion, not external judge Novel
References

Core Multi-Agent Debate

  • Du et al. (2023) — Improving Factuality and Reasoning in Language Models through Multiagent Debate
  • Liang et al. (EMNLP 2024) — Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
  • Chan et al. (2023) — ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
  • CAMEL (NeurIPS 2023) — Communicative Agents for "Mind" Exploration of Large Language Model Society

Agent Architecture & Evaluation

Personality & Behavior

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_colosseum-0.1.0.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_colosseum-0.1.0-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file agent_colosseum-0.1.0.tar.gz.

File metadata

  • Download URL: agent_colosseum-0.1.0.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_colosseum-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1501b94e62aef4031e8a189df06a8d98f0868f035d9289495a058e2011e6f490
MD5 601c328ef68392fad83b8e32ccfd1efb
BLAKE2b-256 a4d8b4ee74e41d96554222a1f9910f2b1c88011f88829ace099e8becc5485d87

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_colosseum-0.1.0.tar.gz:

Publisher: publish.yml on jinsoo96/agent-colosseum

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_colosseum-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agent_colosseum-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_colosseum-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afc08633c7588e1a9553d1ee795fec5e5a825133ddd7ae676aa6a429021f4b0f
MD5 ba63c256452c74766b741de287113385
BLAKE2b-256 1002a7ce04c9c62f364c1ecd693508d7015dd1265a955ea9d65bffff573dbd3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_colosseum-0.1.0-py3-none-any.whl:

Publisher: publish.yml on jinsoo96/agent-colosseum

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page