Formally verified multi-agent debates
Project description
Arbiter
Formally verified multi-agent debates on academic papers.
Point Arbiter at any PDF — it extracts claims, finds contradictions, generates verified proofs (Knuckledragger + SymPy + Z3), designs specialist debate agents, and produces a structured verdict with every argument tracked.
Paper: Arbiter: Formally Verified Multi-Agent Debate for Research Claim Evaluation
Benchmark: 244/244 (100%) on miniF2F with 89.3% machine-verified proof certificates using Claude Sonnet 4.5 — vs 55.7% for MiniF2F-Dafny with the same model. See benchmark results for details.
What it does
- Reads the paper — extracts every claim, assumption, proposition, and equation
- Finds the cracks — identifies contradictions, tensions, and Z3-encodable formal claims
- Designs the debate — creates specialist agents (e.g., Macroeconomist, TaxScholar, IO Theorist), gate rules, and a custom rubric
- Runs the debate — multi-round argumentation with real-time validity enforcement
- Delivers a verdict — multi-lab judge panel with scores, landed hits, and a structured argument map
Case study: "The AI Layoff Trap"
We ran Arbiter on "The AI Layoff Trap" (Falk & Tsoukalas, 2026), which claims to prove that AI over-automation is inevitable and only a Pigouvian tax can fix it.
Init — 280 claims extracted, 17 propositions, 10 assumptions, 7 policy claims identified:
Debate — 9 specialist agents (Proponent, Skeptic, IO Theorist, Macroeconomist, TaxScholar, LaborEconomist, PublicFinance, CausalInference, Generalist) across 3 providers (OpenAI gpt-5.4, Anthropic Claude Opus 4.6, xAI Grok) debated for 6 rounds:
Verdict — Skeptic wins 2-1. The paper's core theorem holds, but its policy claims overreach:
| Judge | Proponent | Skeptic | Verdict |
|---|---|---|---|
| Grok | 39 | 51 | Skeptic |
| OpenAI | 42 | 50 | Skeptic |
| Anthropic | 37 | 35 | Proponent |
Key findings:
- Core theorem (α_NE > α_CO when η<1, N>1) is mathematically correct — all 3 judges agree
- "Only a Pigouvian tax works" — conceded by Proponent (η-raising policies also work)
- "Boundless productivity" rhetoric — conceded (not implied by the formal model)
- 17 total Proponent concessions, 22 conceded hits, 141 total argument hits
- 0 gate violations across 54 turns, 0 mid-debate judge failures
Full outputs: experiments/ai_layoff_trap_v3/
Quickstart
# Install
pip install -e ".[all]"
# Set API keys
cp .env.example .env
# Edit .env with your keys (at minimum OPENAI_API_KEY)
# Generate a debate config from any PDF
arbiter init --from-pdf paper.pdf --output-dir my-debate/
# Run the debate
arbiter run my-debate/config.yaml
# Judge it
arbiter judge my-debate/output/debate_001.json
# Export the argument map
arbiter export my-debate/output/debate_001.json -f argdown
Use multiple frontier models
arbiter init --from-pdf paper.pdf \
--providers "openai:gpt-5.4,anthropic:claude-opus-4-6,grok:grok-4.20-0309-reasoning" \
--effort high \
--output-dir my-debate/
Watch it live in the browser
arbiter web --init --from-pdf paper.pdf
# Opens http://localhost:8741 with live dashboard
How it works
arbiter init --from-pdf paper.pdf
│
├─ 1. PDF → Markdown → Chunked text
├─ 2. Claim extraction (280 claims, tagged formal/logical/empirical)
├─ 2b. Formal model extraction (assumptions, propositions, equations, policies)
├─ 3. Contradiction detection + key terms + attack angles
├─ 4. Parallel generation:
│ ├─ Z3 proof verification (proof checks, sensitivity, boundary, policy)
│ ├─ Agent cast design (domain specialists per attack angle)
│ ├─ Gate rules + escape-route anticipation
│ ├─ Judge rubric (topic-specific scoring criteria)
│ └─ Source corpus (synthesis + classification)
├─ 5. Gate self-calibration
└─ 6. Config assembly → ready for `arbiter run`
Features
- Agentic init from PDF — one command generates a complete debate config with agents, gate, Z3, rubric
- Z3 verification suite — proof verification, counterexample search, assumption sensitivity, boundary analysis, policy verification
- 7 built-in providers — OpenAI, Anthropic, Google Gemini, Grok, DeepSeek, Ollama, custom plugins
- LLM validity gate — per-turn logical hygiene via LLM classifier, 0 violations across 54 frontier-model turns
- Structured argument ledger — every hit tracked as open/conceded/rebutted/dodged
- Multi-lab judge panel — N judges from different providers with spread-flagging
- Live web dashboard — watch init and debate in real-time with argdown syntax highlighting
- KaTeX math rendering — LaTeX equations render in the dashboard
- Knuckledragger integration — optional Python proof assistant for formal verification
- Adversarial red-team mode — test the gate against deliberately evasive agents
- Argdown export — machine-readable argument maps
CLI Commands
| Command | Description |
|---|---|
arbiter init --from-pdf paper.pdf |
Generate debate config from PDF |
arbiter init --topic "..." |
Generate config from topic description |
arbiter run config.yaml |
Run the debate |
arbiter judge output.json |
Run multi-lab judge panel |
arbiter export output.json -f argdown |
Export argument map |
arbiter web config.yaml |
Live dashboard for debate |
arbiter web --init --from-pdf paper.pdf |
Live dashboard for init + debate |
arbiter calibrate config.yaml --test-cases tests.yaml |
Calibrate validity gate |
arbiter validate config.yaml |
Validate config |
arbiter redteam config.yaml --target Proponent |
Red-team mode |
Architecture
arbiter init ─── PDF ──→ Claims ──→ Formal Model ──→ Contradictions
│
┌───────────────┼───────────────┐
▼ ▼ ▼
Z3 Proofs Agent Design Gate Rules ──→ config.yaml
arbiter run ─── config.yaml ──→ Round ──→ Gate ──→ Ledger ──→ Mid-Judge ──→ Round
▲ │ │
│ Rewrite converged?
└───────┘ │
Steelman
│
arbiter judge ────────────────────────────────────────────────────── Judge Panel
│
Verdict + Scores
See ARCHITECTURE.md for the full module map and design decisions.
Formal Verification
Arbiter uses three verification backends, with the LLM selecting the right one per claim:
| Backend | What it handles | Proof certificate |
|---|---|---|
| Knuckledragger | Integer constraints, polynomial inequalities, divisibility, induction | kd.Proof object (tamper-proof, Z3-verified) |
| SymPy | Trig identities, recurrences, series, symbolic algebra | minimal_polynomial == x (rigorous algebraic zero) |
| Z3 (raw) | Counterexample search, optimization, boundary analysis | SAT model extraction |
Benchmark results
| Benchmark | Sonnet 4.5 | gpt-5.4-mini |
|---|---|---|
| miniF2F (244 problems) | 100% proved, 89.3% certified | 88.9% proved, 74.2% certified |
| ProofNet (50 problems) | 100% proved, 94% certified | 88% proved, 66% certified |
| MATH Level 5 (50 problems) | 100% proved, 90% certified | 84% proved, 60% certified |
Comparison: MiniF2F-Dafny achieves 55.7% with the same Sonnet 4.5 model using Dafny proofs.
Verification check types
| Check Type | What it does | Example |
|---|---|---|
| Proof verification | Encode assumptions + negated proposition, check UNSAT | "α_NE > α_CO" → PROVEN |
| Counterexample | Extract concrete values when proof fails | "At N=1, η=0.92: wedge = 0" |
| Sensitivity | Drop each assumption, find load-bearing ones | "N > 1 is LOAD-BEARING" |
| Boundary | Find where results flip | "Wedge positive iff η < 0.83" |
| Policy verification | Check if proposed policies achieve goals | "Pigouvian tax implements α_CO" |
Configuration
Everything is a single YAML file. Key sections:
topology: gated # standard | gated | adversarial
providers:
openai:
model: gpt-5.4
reasoning: { effort: high }
anthropic:
model: claude-opus-4-6
thinking: { type: adaptive, effort: medium }
agents:
Proponent: { provider: openai, side: Proponent, system_prompt: "..." }
Skeptic: { provider: anthropic, side: Skeptic, system_prompt: "..." }
convergence:
max_rounds: 6
min_hits_addressed: 3 # agents must engage with open arguments
judge:
panel:
- { provider: openai }
- { provider: anthropic }
- { provider: grok }
See experiments/ai_layoff_trap_v3/config.yaml for a complete example.
Contributing
See CONTRIBUTING.md for setup, testing, and PR guidelines.
git clone https://github.com/vishk23/arbiter.git
cd arbiter
pip install -e ".[all]"
cp .env.example .env
pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arbiter_debate-0.1.0.tar.gz.
File metadata
- Download URL: arbiter_debate-0.1.0.tar.gz
- Upload date:
- Size: 13.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
928b52964b6ce079ee3d3f658ede67a2f92c7f832d4c45e1733b2d0c0d556f7f
|
|
| MD5 |
477705c4985cc260907c812bf62a8420
|
|
| BLAKE2b-256 |
1842b8a1db3eabadf556ef1142cf02ac21932d6077cb53abaaa7ed3826efb300
|
File details
Details for the file arbiter_debate-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arbiter_debate-0.1.0-py3-none-any.whl
- Upload date:
- Size: 148.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d448e2a216a63d027ba84fbb94503e65dca61a3ec30ab9dea957b93ca25f0a6
|
|
| MD5 |
fbfc8a1ae0dbce9c12c8954eb8cc5a99
|
|
| BLAKE2b-256 |
7e38219eef5d429717704b6d962ed6b59b52ac4c4e0cfb689b7596bd24dbb894
|