Skip to main content

Multi-LLM debate engine — verdicts everywhere

Project description

verd

Five minds enter. They argue, challenge, cross-examine. Only the truth walks out.

verd spawns multiple AI models from different families — each with a specialized role — has them debate your question across rounds, then a stronger judge delivers the final verdict with strengths, issues, and actionable fixes.

Use it everywhere: CLI for code reviews, MCP inside Claude Code and Cursor, Slack as @verd in any conversation, or pipe anything into it.

Install

pip install verd

Setup

verd talks to any OpenAI-compatible API. Set two env vars (or put them in a .env file):

export OPENAI_API_KEY=your-key
export OPENAI_BASE_URL=https://openrouter.ai/api/v1  # or any compatible endpoint

Works with OpenRouter (easiest — all models, one key), direct OpenAI, LiteLLM proxy, Azure, Together, Groq, etc.

Edit verd/models.py to customize which models debate and their roles.

Usage

# Auto-scan current directory
cd backend && verd "is this production-ready?"

# Single file
verd "is this JWT implementation secure?" -f auth.py

# Multiple files
verd "any issues?" -f auth.py middleware.py routes.py

# Directory with smart file selection
verd "is this codebase sound?" -d src/ --ext .py

# Full codebase review (no smart selection, scans everything)
verdh "full security audit" -d . -a

# Inline question
verdl "is O(n^2) acceptable for n=1000?"

# Git diffs
verd "are these changes safe?" -g              # unstaged
verd "ready to commit?" -gs                    # staged
verdh "should we merge this?" -gb main         # branch diff

# Pipe
cat auth.py | verd "is this secure?"

# Quiet mode (verdict only, no transcript)
verd "any bugs?" -f app.py -q

# JSON output
verd "any bugs?" -f app.py --json

Modes

Command Debaters Roles Rounds Speed Cost
verdl 2 + judge analyst, devils_advocate 1 ~10s ~$0.01
verd 4 + judge analyst, devils_advocate, logic_checker, pragmatist 2 ~30s ~$0.05
verdh 5 + judge + web analyst, devils_advocate, logic_checker, fact_checker, pragmatist 3 ~70s ~$0.30

Roles

Each model in the debate gets a specialized role:

Role Job Example catch
analyst Balanced initial assessment, main arguments for and against "The architecture is sound but the auth flow has a gap"
devils_advocate Find what others miss — edge cases, hidden assumptions, failure modes "What happens when the token expires mid-transaction?"
logic_checker Verify reasoning quality — fallacies, off-by-one, race conditions "The pagination math is wrong: total_pages needs ceil division"
fact_checker Web-grounded verification — do these APIs/libraries actually work? "That library was deprecated in v3, use the new API"
pragmatist Real-world practicality — will this ship? What's the ops burden? "This works but needs 3 new infra dependencies your team doesn't know"

The judge weighs each reviewer's input by role — a fact_checker citing sources carries more weight than a devils_advocate pushing back.

Output

verd shows what makes multi-model debate valuable:

FAIL  77%  In-memory rate limiter is unsafe for production

claude:FAIL  gpt:FAIL  gemini:FAIL  gpt:FAIL  (FULL)

+ Conceptually correct sliding-window logic
+ Old timestamps pruned on every call

- Global dict is unsynchronized — race conditions in multi-thread servers
- State resets on restart, multiplied across horizontally scaled instances
- Per-user lists grow without bounds — memory leak / DoS vector

! gpt-5-mini caught the risk of system clock jumps with time.time()
! gpt-4.1 highlighted the O(N) per-request performance cost

→ Move state to Redis with atomic operations
→ Use time.monotonic() for interval calculations
→ Add TTL/eviction for inactive user keys

completed in 69.3s • 22,449 tokens • ~$0.07
  • Vote breakdown — who voted what, at a glance
  • Unique catches (!) — what each model uniquely spotted that others missed
  • Dissent — who disagreed, what they argued, and why it matters
  • Confidence — calculated from vote distribution weighted by role, not judge vibes

Flags

claim                     the question to evaluate (required)

Content input (pick one, or auto-scans current dir):
  -c, --context TEXT      inline content string
  -f FILE [FILE ...]      one or more files
  -d [DIR]                directory (default: current dir)
  -g, --git               unstaged git diff
  -gs, --git-staged       staged git diff
  -gb, --git-branch REF   git diff REF...HEAD

Directory filters (use with -d):
  -a, --all               scan all files, skip smart selection
  --ext EXT [EXT ...]     filter by extension (.py .ts)
  --exclude PAT [PAT ...] glob patterns to exclude (test_*)

Output:
  -q, --quiet             hide debate transcript, show only verdict
  --json                  raw JSON output
  --timeout SECONDS       override timeout per model call
  --version               show version

MCP — Claude Code / Cursor

Add to ~/.claude.json or ~/.cursor/mcp.json:

{
  "mcpServers": {
    "verd": {
      "command": "verd-mcp",
      "env": {
        "OPENAI_API_KEY": "your-key",
        "OPENAI_BASE_URL": "https://openrouter.ai/api/v1"
      }
    }
  }
}

Then use verd, verdl, or verdh as tools directly in chat. Ask a question, paste code, then say "use verd to check this."

Slack

Install with Slack dependencies:

pip install "verd[slack]"

Create a Slack app with Socket Mode enabled, add bot scopes (app_mentions:read, channels:history, groups:history, chat:write, reactions:write, im:history, im:write, users:read), then:

export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
export SLACK_SIGNING_SECRET=...
verd-slack

Usage in Slack:

  • @verd what do you think? — reads thread or last 20 channel messages, debates, replies in thread
  • @verd deep is this secure? — uses verdh (5 models + web search)
  • @verd quick is this right? — uses verdl (fast, 2 models)
  • @verd last 50 what's the consensus? — reads last 50 messages as context
  • /verd should we use Kafka? — slash command with live progress updates
  • /verdl is this correct? — quick slash command
  • /verdh any security issues? — deep slash command

Optional: restrict access via environment variables:

export VERD_ALLOWED_CHANNELS=C123,C456    # empty = all channels
export VERD_ALLOWED_USERS=U123,U456       # empty = all users

How it works

  1. Your question + content gets sent to multiple AI models in parallel
  2. Each model has a specialized role (analyst, devils_advocate, logic_checker, fact_checker, pragmatist)
  3. Models see each other's responses and cross-examine for 1-3 rounds
  4. Anti-groupthink prompts ensure models hold their ground when they have evidence — consensus without new evidence is rejected
  5. A stronger judge model synthesizes the debate, weighting each reviewer by their role
  6. Confidence is calculated from vote distribution — a fact_checker's dissent lowers confidence more than a devils_advocate's expected pushback
  7. You get: verdict, vote breakdown, strengths, issues, unique catches, dissent, and actionable fixes

The key insight: different models have different blind spots. Claude spots nuance GPT misses. Gemini catches logic errors DeepSeek overlooks. The debate surfaces all of them — and tells you exactly which model caught what.

When to use verd

The second opinion you run before you ship.

  • "Should we?" decisions — Ask one model "Kafka or RabbitMQ?" and get one opinion at 50% confidence. Ask verd and get 4-5 perspectives that challenge each other, a clear recommendation, and dissent noted. A single model never tells you when it's wrong.

  • High-stakes code — Security reviews, auth flows, payment logic. Not because verd finds more bugs — but because it catches the 5% of cases where any single model would be confidently wrong. If sonnet says "this JWT code looks fine" and it has verify_signature: False, verd's debate catches it.

  • Defensible decisions — "I ran this through 5 AI models and they debated for 3 rounds. 4 agreed, 1 dissented on X. Here's the full transcript." That's more defensible than "Claude said it's fine."

Like a code review from 5 senior engineers that costs $0.05-$0.30. You don't use it on every line — you use it on the 3 things that matter.

Don't use verd for simple factual questions, writing code, or anything where speed matters more than thoroughness.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verd-0.2.0.tar.gz (34.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verd-0.2.0-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file verd-0.2.0.tar.gz.

File metadata

  • Download URL: verd-0.2.0.tar.gz
  • Upload date:
  • Size: 34.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verd-0.2.0.tar.gz
Algorithm Hash digest
SHA256 24a4c8b2e94964fe44f4142978050da5402d0c6bc0cf4b428feaf1bba268209a
MD5 35abef992a2872fef3cfeea5709f694d
BLAKE2b-256 a2007ea4ee54cc242e0b8cf040550bdd7ebab62bcb3955e1d5657d1b27ea0779

See more details on using hashes here.

File details

Details for the file verd-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: verd-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verd-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c9520c6e32993b5f8e961986b0f5e07cfd68a124b1ba91c915a75f1dca4ef4b
MD5 d09d245a465d20199c77e01d2ce70dbc
BLAKE2b-256 2124aa55d787d5423b52a67ae2647b11d5cc9283c47020bc7bf05f1292f6057b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page