Multi-agent colosseum framework. Prove that agents debating, researching, and red-teaming produce better answers than a single agent.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Jinsookim

These details have not been verified by PyPI

Project description

agent-colosseum

Make AI agents fight, research, and review each other — then prove the result is better than a single agent.

agent-colosseum is a Python framework for running multi-agent interactions (debate, collaborative research, red-teaming, peer review) and benchmarking them against single-agent baselines. It is provider-agnostic (Claude, GPT, Gemini, local models) and ships with built-in datasets and evaluation metrics.

Why?

A single LLM has blind spots. It commits to an answer too early, hallucinates confidently, and can't challenge its own reasoning. Multiple agents attacking, questioning, and building on each other's work produce measurably better results.

This isn't theory. The papers prove it:

Paper	Key Result
Du et al. (2023)	GSM8K: 77% → 85%, MMLU: 63.9% → 71.1%
Liang et al. (EMNLP 2024)	GPT-3.5 + debate > GPT-4 alone. Thought diversity 2.6x increase
Chan et al. (2023)	Same-role agents = zero multi-agent benefit. Diverse roles required
CAMEL (NeurIPS 2023)	Multi-agent wins 76.3% vs single-agent 10.4%
Park et al. (2023)	Full architecture (memory+reflection) +41% believability

agent-colosseum lets you reproduce these results, run your own experiments, and use multi-agent patterns in production.

Installation

pip install agent-colosseum              # core only
pip install agent-colosseum[anthropic]   # + Claude support
pip install agent-colosseum[openai]      # + GPT support
pip install agent-colosseum[all]         # everything

Quick Start

import asyncio
from agent_colosseum import DebateExecutor, DebateConfig
from agent_colosseum.providers.anthropic import AnthropicProvider

async def main():
    provider = AnthropicProvider(model="claude-sonnet-4-6")
    executor = DebateExecutor(provider)

    # Run a debate
    result = await executor.debate("Should AI models be open-sourced without restrictions?")
    print(result.final_answer)
    print(f"Confidence: {result.confidence}")
    print(f"Key arguments: {result.key_arguments}")

    # Compare: single agent vs debate
    single, debate = await executor.compare("Is Rust better than C++?")
    print(f"Single: {single.final_answer[:200]}")
    print(f"Debate: {debate.final_answer[:200]}")
    print(f"Token cost: {debate.total_tokens / single.total_tokens:.1f}x")

asyncio.run(main())

Four Arena Modes

1. Debate — Agents Fight, Then Reach Consensus

Agents take opposing sides, attack each other's arguments, and are forced to synthesize a conclusion. Based on MAD (Liang et al.) and Du et al.

result = await executor.debate("Is remote work more productive than office work?")

Three debate protocols:

Protocol	How it works	Best for
`mad` (default)	Pro vs Con + Judge, tit-for-tat forced opposition	Factual questions, reasoning
`du`	N agents answer independently → share → update → converge	Math, well-defined problems
`freestyle`	Free-form argument, no forced sides	Open-ended topics

# Du-style consensus (no judge, natural convergence)
result = await executor.debate(topic, protocol="du")

# Freestyle with custom agents
result = await executor.debate(topic, protocol="freestyle", debaters=my_agents)

How it works:

Round 1-3: Fight (argument → rebuttal → concession → challenge)
    ↓
Synthesis Round: Each debater states what they concede and what they won't
    ↓  
Final Integration: Merge synthesis statements into unified conclusion
    ↓
Result: final_answer + agreed_points + unresolved issues + insight

2. Research — Independent Investigation, Cross-Examination, Synthesis

Each agent investigates from their specialty (empiricist, theorist, methodologist), then they cross-examine each other's findings before synthesizing.

result = await executor.research("Are LLM agents production-ready in 2026?")

How it works:

Phase 1 (Independent): Each researcher analyzes from their specialty
Phase 2 (Cross-exam):  Read each other's work, challenge methodology
Phase 3 (Synthesis):   Confirmed findings + probable hypotheses + unresolved

Output includes:

confirmed_findings — all researchers agree
probable_hypotheses — most agree, needs verification
unresolved — researchers disagree
further_research — suggested next steps

3. Red Team — One Defends, Others Attack

One agent presents a claim. The others attack it from different angles (logic, evidence, edge cases, practicality, assumptions). The defender absorbs valid attacks and strengthens the claim.

result = await executor.redteam("Microservices are better than monoliths")

How it works:

Phase 1: Defender presents initial claim
Phase 2: Attackers hit from different angles → Defender revises (repeat)
Phase 3: Compare original vs hardened claim

Output includes:

original_weaknesses — weaknesses found in the initial claim
improvements — how the claim got stronger
surviving_weaknesses — weaknesses that couldn't be fixed

4. Peer Review — Present, Review, Revise

One agent presents analysis. Others review it (strengths, weaknesses, questions, suggestions). The author revises based on feedback.

result = await executor.peer_review("What caused the 2008 financial crisis?")

How it works:

Phase 1: Author presents structured analysis
Phase 2: Reviewers critique from different angles → Author revises (repeat)
Phase 3: Final reviewed output

Custom Agents

Every mode accepts custom agents. Control the number, roles, personalities, and even which LLM model each agent uses.

from agent_colosseum import Debater, DebaterRole

debaters = [
    Debater(
        name="DataScientist",
        role=DebaterRole.PROPONENT,
        persona="Relies on statistical evidence and empirical data.",
        model="claude-sonnet-4-6",  # optional per-agent model
    ),
    Debater(
        name="Philosopher",
        role=DebaterRole.OPPONENT,
        persona="Questions fundamental assumptions and ethical implications.",
    ),
    Debater(
        name="Engineer",
        role=DebaterRole.SKEPTIC,
        persona="Demands practical feasibility. 'Will it actually work?'",
    ),
    Debater(
        name="Contrarian",
        role=DebaterRole.DEVILS_ADVOCATE,
        persona="Opposes whatever the majority thinks. Finds hidden problems.",
    ),
]

# Use in any mode
result = await executor.debate(topic, debaters=debaters)
result = await executor.research(topic, debaters=debaters)
result = await executor.redteam(topic, debaters=debaters)   # first = defender
result = await executor.peer_review(topic, debaters=debaters)  # first = author

Available roles:

Role	Behavior
`PROPONENT`	Argues in favor
`OPPONENT`	Argues against
`SKEPTIC`	Demands evidence for everything
`DEVILS_ADVOCATE`	Opposes majority, reveals hidden problems
`WILDCARD`	Unpredictable, attacks from unexpected angles

Cross-Model Debate

Du et al. showed that ChatGPT (14/20) + Bard (11/20) debating together scored 17/20 — better than either alone. agent-colosseum supports this natively:

from agent_colosseum.providers.anthropic import AnthropicProvider
from agent_colosseum.providers.openai import OpenAIProvider

debaters = [
    Debater(name="Claude", role=DebaterRole.PROPONENT, model="claude-sonnet-4-6"),
    Debater(name="GPT", role=DebaterRole.OPPONENT, model="gpt-4o"),
]

# Use MultiProvider to route by model name
# (see examples/cross_model_debate.py for full implementation)

Configuration

from agent_colosseum import DebateConfig

config = DebateConfig(
    max_rounds=3,              # Du: 2-3 optimal, plateau at 4+
    disagreement_level=2,      # 1=mild, 2=strong (optimal), 3=extreme
    adaptive_break=True,       # Liang: judge can end early on consensus
    scoring_criteria=["logic", "evidence", "persuasion"],
    temperature=0.7,           # agent generation temperature
    judge_temperature=0.0,     # SOTOPIA: deterministic judge correlates r=0.71 with humans
    max_tokens_per_statement=800,
    max_tokens_final=2000,
)

executor = DebateExecutor(provider, config=config)

Why these defaults? Every default is backed by paper evidence:

Setting	Default	Evidence
`max_rounds=3`	3	Du et al.: 2-3 rounds optimal, plateau at 4+
`disagreement_level=2`	2	Liang: Level 2 optimal. Level 3 devolves into "trying to win"
`adaptive_break=True`	True	Liang: judge-based early termination is effective
`judge_temperature=0.0`	0.0	SOTOPIA: deterministic scoring correlates r=0.71 with human judgment

Benchmarking

Built-in benchmarking to prove multi-agent > single agent with your own data:

from agent_colosseum.benchmark import BenchmarkRunner

runner = BenchmarkRunner(executor)

# Run on built-in datasets
result = await runner.run("reasoning", protocol="mad", max_items=5)
print(result.summary())

# Output:
# ═══ Benchmark: reasoning (protocol: mad) ═══
#
# ## Accuracy
#   Single Agent:  3/5 (60.0%)
#   Debate:        4/5 (80.0%)
#   Difference:    +20.0%
#
# ## Quality (0-10)
#   Single Agent:  6.2
#   Debate:        7.8
#   Difference:    +1.6
#
# ## Cost
#   Single tokens: 2,100
#   Debate tokens: 8,400
#   Ratio:         4.0x

Built-in datasets:

Dataset	Items	Type	Based on
`arithmetic`	10	Exact-answer math	Du et al.
`reasoning`	8	Counter-intuitive logic	Liang CIAR
`factual`	8	Fact verification	Du MMLU/TruthfulQA
`opinion`	5	Open-ended (no ground truth)	Quality eval only

Quality metrics (SOTOPIA-inspired 7 dimensions):

Dimension	Measures	Weight
depth	Argument depth (surface vs root cause)	1.0x
diversity	Perspective diversity (Liang: debate = 2.6x)	1.5x
evidence	Evidence specificity	1.0x
logic	Logical consistency	1.0x
novelty	Insights impossible for single agent	2.0x
concession	Quality of concessions (intellectual honesty)	1.0x
synthesis	Final conclusion better than individual positions	2.0x

Real-Time Streaming

All modes support a callback for real-time statement streaming:

async def on_statement(stmt, round_num):
    label = "Synthesis" if round_num == 0 else f"R{round_num}"
    print(f"[{label}] {stmt.debater_name}: {stmt.content[:200]}")

result = await executor.debate(topic, on_statement=on_statement)

Custom LLM Providers

Implement the LLMProvider interface to use any LLM:

from agent_colosseum.providers.base import LLMProvider, LLMResponse

class OllamaProvider(LLMProvider):
    async def generate(
        self,
        system: str,
        messages: list[dict[str, str]],
        *,
        max_tokens: int = 800,
        temperature: float = 0.7,
        model: str | None = None,
    ) -> LLMResponse:
        # Your implementation here
        return LLMResponse(content="...", input_tokens=0, output_tokens=0)

Built-in providers: AnthropicProvider, OpenAIProvider (also works with any OpenAI-compatible API via base_url).

Architecture

agent_colosseum/
├── models.py              # Debater, Statement, Round, DebateResult
├── prompts.py             # All prompts (English, provider-agnostic)
├── protocols.py           # Debate protocols (MAD, Du consensus, freestyle)
├── executor.py            # Main engine — debate/research/redteam/peer_review/compare
├── modes/
│   ├── research.py        # Independent investigation → cross-exam → synthesis
│   ├── redteam.py         # 1 defender + N attackers → hardened claim
│   └── peer_review.py     # 1 author → N reviewers → improved output
├── providers/
│   ├── base.py            # LLMProvider ABC
│   ├── anthropic.py       # Claude
│   └── openai.py          # GPT / OpenAI-compatible
└── benchmark/
    ├── datasets.py        # Built-in datasets (arithmetic, reasoning, factual, opinion)
    ├── metrics.py         # Accuracy + 7-dimension quality evaluation
    └── runner.py          # Single vs multi-agent comparison runner

Design Decisions

Every design choice in agent-colosseum is grounded in published research:

Decision	Reasoning	Source
Default 2 agents	2 debaters optimal; 3+ degrades due to context length	Liang et al.
Default 3 rounds	2-3 optimal; 4+ shows plateau or decline	Du et al.
Diverse roles mandatory	Same role = zero multi-agent benefit	Chan et al.
Disagreement level 2	Level 3 devolves into "arguing to win"	Liang et al.
One-by-one communication	Sequential > simultaneous (60% vs 55% accuracy)	Chan et al.
Adaptive early termination	Judge-based consensus detection saves cost	Liang et al.
Deterministic judge (temp=0)	Correlates r=0.71 with human judgment	SOTOPIA
Synthesis round	Debaters themselves reach conclusion, not external judge	Novel

References

Core Multi-Agent Debate

Du et al. (2023) — Improving Factuality and Reasoning in Language Models through Multiagent Debate
Liang et al. (EMNLP 2024) — Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Chan et al. (2023) — ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
CAMEL (NeurIPS 2023) — Communicative Agents for "Mind" Exploration of Large Language Model Society

Agent Architecture & Evaluation

Park et al. (UIST 2023) — Generative Agents: Interactive Simulacra of Human Behavior
SOTOPIA (ICLR 2024) — Interactive Evaluation for Social Intelligence in Language Agents
MetaGPT (2023) — Meta Programming for A Multi-Agent Collaborative Framework
Tree of Thoughts (2023) — Deliberate Problem Solving with Large Language Models

Personality & Behavior

BIG5-CHAT (ACL 2025) — Shaping LLM Personalities via Training on Human-Grounded Data
Agentic LLMs Survey (2025) — Theory of Mind, multi-agent cooperation, social norms
Dynamic Personality (ACL 2025) — Personality evolution during interaction

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Jinsookim

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_colosseum-0.1.0.tar.gz (33.5 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_colosseum-0.1.0-py3-none-any.whl (44.3 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file agent_colosseum-0.1.0.tar.gz.

File metadata

Download URL: agent_colosseum-0.1.0.tar.gz
Upload date: Apr 13, 2026
Size: 33.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_colosseum-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1501b94e62aef4031e8a189df06a8d98f0868f035d9289495a058e2011e6f490`
MD5	`601c328ef68392fad83b8e32ccfd1efb`
BLAKE2b-256	`a4d8b4ee74e41d96554222a1f9910f2b1c88011f88829ace099e8becc5485d87`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_colosseum-0.1.0.tar.gz:

Publisher: publish.yml on jinsoo96/agent-colosseum

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_colosseum-0.1.0.tar.gz
- Subject digest: 1501b94e62aef4031e8a189df06a8d98f0868f035d9289495a058e2011e6f490
- Sigstore transparency entry: 1285369200
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: jinsoo96/agent-colosseum@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/jinsoo96
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc
- Trigger Event: release

File details

Details for the file agent_colosseum-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_colosseum-0.1.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 44.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_colosseum-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`afc08633c7588e1a9553d1ee795fec5e5a825133ddd7ae676aa6a429021f4b0f`
MD5	`ba63c256452c74766b741de287113385`
BLAKE2b-256	`1002a7ce04c9c62f364c1ecd693508d7015dd1265a955ea9d65bffff573dbd3d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_colosseum-0.1.0-py3-none-any.whl:

Publisher: publish.yml on jinsoo96/agent-colosseum

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_colosseum-0.1.0-py3-none-any.whl
- Subject digest: afc08633c7588e1a9553d1ee795fec5e5a825133ddd7ae676aa6a429021f4b0f
- Sigstore transparency entry: 1285369351
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: jinsoo96/agent-colosseum@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/jinsoo96
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ad29ef24aefad2233b19ea17c96c22d5ddd7e6cc
- Trigger Event: release

agent-colosseum 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agent-colosseum

Why?

Installation

Quick Start

Four Arena Modes

1. Debate — Agents Fight, Then Reach Consensus

2. Research — Independent Investigation, Cross-Examination, Synthesis

3. Red Team — One Defends, Others Attack

4. Peer Review — Present, Review, Revise

Custom Agents

Cross-Model Debate

Configuration

Benchmarking

Real-Time Streaming

Custom LLM Providers

Architecture

Design Decisions

Core Multi-Agent Debate

Agent Architecture & Evaluation

Personality & Behavior

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance