Skip to main content

Formally verified multi-agent debates

Project description

Arbiter

Formally verified multi-agent debates on academic papers.

Point Arbiter at any PDF — it extracts claims, finds contradictions, generates verified proofs (Knuckledragger + SymPy + Z3), designs specialist debate agents, and produces a structured verdict with every argument tracked.

Paper: Arbiter: Formally Verified Multi-Agent Debate for Research Claim Evaluation

Benchmark: 244/244 (100%) on miniF2F with 89.3% machine-verified proof certificates using Claude Sonnet 4.5 — vs 55.7% for MiniF2F-Dafny with the same model. See benchmark results for details.

Arbiter argdown output showing argument ledger with syntax highlighting

What it does

  1. Reads the paper — extracts every claim, assumption, proposition, and equation
  2. Finds the cracks — identifies contradictions, tensions, and Z3-encodable formal claims
  3. Designs the debate — creates specialist agents (e.g., Macroeconomist, TaxScholar, IO Theorist), gate rules, and a custom rubric
  4. Runs the debate — multi-round argumentation with real-time validity enforcement
  5. Delivers a verdict — multi-lab judge panel with scores, landed hits, and a structured argument map

Case study: "The AI Layoff Trap"

We ran Arbiter on "The AI Layoff Trap" (Falk & Tsoukalas, 2026), which claims to prove that AI over-automation is inevitable and only a Pigouvian tax can fix it.

Init — 280 claims extracted, 17 propositions, 10 assumptions, 7 policy claims identified:

Argdown showing concessions and rebuttals with formal math

Debate — 9 specialist agents (Proponent, Skeptic, IO Theorist, Macroeconomist, TaxScholar, LaborEconomist, PublicFinance, CausalInference, Generalist) across 3 providers (OpenAI gpt-5.4, Anthropic Claude Opus 4.6, xAI Grok) debated for 6 rounds:

Dashboard showing completed 6-round debate with agent cards

Verdict — Skeptic wins 2-1. The paper's core theorem holds, but its policy claims overreach:

Judge Proponent Skeptic Verdict
Grok 39 51 Skeptic
OpenAI 42 50 Skeptic
Anthropic 37 35 Proponent

Key findings:

  • Core theorem (α_NE > α_CO when η<1, N>1) is mathematically correct — all 3 judges agree
  • "Only a Pigouvian tax works" — conceded by Proponent (η-raising policies also work)
  • "Boundless productivity" rhetoric — conceded (not implied by the formal model)
  • 17 total Proponent concessions, 22 conceded hits, 141 total argument hits
  • 0 gate violations across 54 turns, 0 mid-debate judge failures

Full outputs: experiments/ai_layoff_trap_v3/

Quickstart

# Install
pip install -e ".[all]"

# Set API keys
cp .env.example .env
# Edit .env with your keys (at minimum OPENAI_API_KEY)

# Generate a debate config from any PDF
arbiter init --from-pdf paper.pdf --output-dir my-debate/

# Run the debate
arbiter run my-debate/config.yaml

# Judge it
arbiter judge my-debate/output/debate_001.json

# Export the argument map
arbiter export my-debate/output/debate_001.json -f argdown

Use multiple frontier models

arbiter init --from-pdf paper.pdf \
  --providers "openai:gpt-5.4,anthropic:claude-opus-4-6,grok:grok-4.20-0309-reasoning" \
  --effort high \
  --output-dir my-debate/

Watch it live in the browser

arbiter web --init --from-pdf paper.pdf
# Opens http://localhost:8741 with live dashboard

How it works

arbiter init --from-pdf paper.pdf
  │
  ├─ 1. PDF → Markdown → Chunked text
  ├─ 2. Claim extraction (280 claims, tagged formal/logical/empirical)
  ├─ 2b. Formal model extraction (assumptions, propositions, equations, policies)
  ├─ 3. Contradiction detection + key terms + attack angles
  ├─ 4. Parallel generation:
  │     ├─ Z3 proof verification (proof checks, sensitivity, boundary, policy)
  │     ├─ Agent cast design (domain specialists per attack angle)
  │     ├─ Gate rules + escape-route anticipation
  │     ├─ Judge rubric (topic-specific scoring criteria)
  │     └─ Source corpus (synthesis + classification)
  ├─ 5. Gate self-calibration
  └─ 6. Config assembly → ready for `arbiter run`

Features

  • Agentic init from PDF — one command generates a complete debate config with agents, gate, Z3, rubric
  • Z3 verification suite — proof verification, counterexample search, assumption sensitivity, boundary analysis, policy verification
  • 7 built-in providers — OpenAI, Anthropic, Google Gemini, Grok, DeepSeek, Ollama, custom plugins
  • LLM validity gate — per-turn logical hygiene via LLM classifier, 0 violations across 54 frontier-model turns
  • Structured argument ledger — every hit tracked as open/conceded/rebutted/dodged
  • Multi-lab judge panel — N judges from different providers with spread-flagging
  • Live web dashboard — watch init and debate in real-time with argdown syntax highlighting
  • KaTeX math rendering — LaTeX equations render in the dashboard
  • Knuckledragger integration — optional Python proof assistant for formal verification
  • Adversarial red-team mode — test the gate against deliberately evasive agents
  • Argdown export — machine-readable argument maps

CLI Commands

Command Description
arbiter init --from-pdf paper.pdf Generate debate config from PDF
arbiter init --topic "..." Generate config from topic description
arbiter run config.yaml Run the debate
arbiter judge output.json Run multi-lab judge panel
arbiter export output.json -f argdown Export argument map
arbiter web config.yaml Live dashboard for debate
arbiter web --init --from-pdf paper.pdf Live dashboard for init + debate
arbiter calibrate config.yaml --test-cases tests.yaml Calibrate validity gate
arbiter validate config.yaml Validate config
arbiter redteam config.yaml --target Proponent Red-team mode

Architecture

arbiter init ─── PDF ──→ Claims ──→ Formal Model ──→ Contradictions
                                         │
                         ┌───────────────┼───────────────┐
                         ▼               ▼               ▼
                      Z3 Proofs    Agent Design     Gate Rules ──→ config.yaml
                                                        
arbiter run ─── config.yaml ──→ Round ──→ Gate ──→ Ledger ──→ Mid-Judge ──→ Round
                                   ▲       │                                  │
                                   │    Rewrite                          converged?
                                   └───────┘                                  │
                                                                         Steelman
                                                                              │
arbiter judge ────────────────────────────────────────────────────── Judge Panel
                                                                         │
                                                                    Verdict + Scores

See ARCHITECTURE.md for the full module map and design decisions.

Formal Verification

Arbiter uses three verification backends, with the LLM selecting the right one per claim:

Backend What it handles Proof certificate
Knuckledragger Integer constraints, polynomial inequalities, divisibility, induction kd.Proof object (tamper-proof, Z3-verified)
SymPy Trig identities, recurrences, series, symbolic algebra minimal_polynomial == x (rigorous algebraic zero)
Z3 (raw) Counterexample search, optimization, boundary analysis SAT model extraction

Benchmark results

Benchmark Sonnet 4.5 gpt-5.4-mini
miniF2F (244 problems) 100% proved, 89.3% certified 88.9% proved, 74.2% certified
ProofNet (50 problems) 100% proved, 94% certified 88% proved, 66% certified
MATH Level 5 (50 problems) 100% proved, 90% certified 84% proved, 60% certified

Comparison: MiniF2F-Dafny achieves 55.7% with the same Sonnet 4.5 model using Dafny proofs.

Verification check types

Check Type What it does Example
Proof verification Encode assumptions + negated proposition, check UNSAT "α_NE > α_CO" → PROVEN
Counterexample Extract concrete values when proof fails "At N=1, η=0.92: wedge = 0"
Sensitivity Drop each assumption, find load-bearing ones "N > 1 is LOAD-BEARING"
Boundary Find where results flip "Wedge positive iff η < 0.83"
Policy verification Check if proposed policies achieve goals "Pigouvian tax implements α_CO"

Configuration

Everything is a single YAML file. Key sections:

topology: gated              # standard | gated | adversarial
providers:
  openai:
    model: gpt-5.4
    reasoning: { effort: high }
  anthropic:
    model: claude-opus-4-6
    thinking: { type: adaptive, effort: medium }
agents:
  Proponent: { provider: openai, side: Proponent, system_prompt: "..." }
  Skeptic: { provider: anthropic, side: Skeptic, system_prompt: "..." }
convergence:
  max_rounds: 6
  min_hits_addressed: 3      # agents must engage with open arguments
judge:
  panel:
    - { provider: openai }
    - { provider: anthropic }
    - { provider: grok }

See experiments/ai_layoff_trap_v3/config.yaml for a complete example.

Contributing

See CONTRIBUTING.md for setup, testing, and PR guidelines.

git clone https://github.com/vishk23/arbiter.git
cd arbiter
pip install -e ".[all]"
cp .env.example .env
pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbiter_debate-0.1.0.tar.gz (13.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arbiter_debate-0.1.0-py3-none-any.whl (148.6 kB view details)

Uploaded Python 3

File details

Details for the file arbiter_debate-0.1.0.tar.gz.

File metadata

  • Download URL: arbiter_debate-0.1.0.tar.gz
  • Upload date:
  • Size: 13.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for arbiter_debate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 928b52964b6ce079ee3d3f658ede67a2f92c7f832d4c45e1733b2d0c0d556f7f
MD5 477705c4985cc260907c812bf62a8420
BLAKE2b-256 1842b8a1db3eabadf556ef1142cf02ac21932d6077cb53abaaa7ed3826efb300

See more details on using hashes here.

File details

Details for the file arbiter_debate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arbiter_debate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 148.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for arbiter_debate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d448e2a216a63d027ba84fbb94503e65dca61a3ec30ab9dea957b93ca25f0a6
MD5 fbfc8a1ae0dbce9c12c8954eb8cc5a99
BLAKE2b-256 7e38219eef5d429717704b6d962ed6b59b52ac4c4e0cfb689b7596bd24dbb894

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page