Skip to main content

Multi-LLM debate engine — verdicts everywhere

Project description

verd

Five minds enter. They argue, challenge, cross-examine. Only the truth walks out.

verd spawns multiple AI models from different families — each with a specialized role — has them debate your question across rounds, then a stronger judge delivers the final verdict with strengths, issues, and actionable fixes.

Use it everywhere: CLI for code reviews, MCP inside Claude Code and Cursor, Slack as @verd in any conversation, or pipe anything into it.

Install

pip install verd

Setup

verd talks to any OpenAI-compatible API. Set two env vars (or put them in a .env file):

export OPENAI_API_KEY=your-key
export OPENAI_BASE_URL=https://openrouter.ai/api/v1  # or any compatible endpoint

Works with OpenRouter (easiest — all models, one key), direct OpenAI, LiteLLM proxy, Azure, Together, Groq, etc.

Usage

verd "Kafka or RabbitMQ for our event pipeline?" -f architecture.md
verd "can this auth middleware be bypassed?" -f auth.py middleware.py
verdh "should we merge this?" -gb main
verdl "is O(n^2) acceptable for n=1000?"
cat deploy.yaml | verd "any misconfigs that could expose prod?"

Output

FAIL  77%  In-memory rate limiter is unsafe for production

claude:FAIL  gpt:FAIL  gemini:FAIL  gpt:FAIL  (FULL)

+ Conceptually correct sliding-window logic
- Global dict is unsynchronized — race conditions in multi-thread servers
- Per-user lists grow without bounds — memory leak / DoS vector
! gpt-5-mini caught the risk of system clock jumps with time.time()
→ Move state to Redis with atomic operations

completed in 69.3s • 22,449 tokens • ~$0.07

Vote breakdown, unique catches (!), dissent, strengths, issues, and actionable fixes — all in one view.

Modes

Command Debaters Roles Rounds Speed Cost
verdl 2 + judge analyst, devils_advocate 1 ~15-30s ~$0.02
verd 4 + judge analyst, devils_advocate, logic_checker, pragmatist 2 ~30-60s ~$0.15+
verdh 5 + judge + web analyst, devils_advocate, logic_checker, fact_checker, pragmatist 3 ~60-120s ~$0.40+

Benchmark

Tested on the Martian Code Review Benchmark — 50 real PRs from Cal.com, Discourse, Grafana, Keycloak, and Sentry with expert-labeled golden comments. No code-review-specific tuning.

Mode Precision Recall F1 Score Avg Issues
GPT-5.4 (alone) 13.0% 70.6% 21.9% 14.6
Claude Opus 4.6 (alone) 18.5% 69.9% 29.2% 10.1
verdh (5-model debate) 29.1% 64.0% 40.0% 5.9

+37% F1 over Claude solo. 57% more precise. 42% fewer false positives. Fewer issues, more of them real.

On the Martian offline leaderboard this places verdh around #8 — ahead of Claude Code Reviewer, GitHub Copilot, and Greptile — with zero domain-specific optimization.

Full results and methodology →

How it works

  1. Your question + content gets sent to multiple AI models in parallel
  2. Each model has a specialized role (analyst, devils_advocate, logic_checker, fact_checker, pragmatist)
  3. Models see each other's responses and cross-examine for 1-3 rounds
  4. Anti-groupthink prompts ensure models hold their ground when they have evidence — consensus without new evidence is rejected
  5. A stronger judge model synthesizes the debate, weighting each reviewer by their role
  6. Confidence is calculated from vote distribution — a fact_checker's dissent lowers confidence more than a devils_advocate's expected pushback
  7. You get: verdict, vote breakdown, strengths, issues, unique catches, dissent, and actionable fixes

The key insight: different model families have different blind spots and training biases. Claude spots nuance GPT misses. Gemini catches logic errors DeepSeek overlooks. More importantly — if the same model writes the review and judges its quality, it's likely to agree with itself. Cross-model diversity means the judge is a genuine quality gate, not a model grading its own homework. The debate surfaces what each model uniquely caught and tells you exactly which model caught what.

Roles

Role Job Example catch
analyst Balanced initial assessment, main arguments for and against "The architecture is sound but the auth flow has a gap"
devils_advocate Find what others miss — edge cases, hidden assumptions, failure modes "What happens when the token expires mid-transaction?"
logic_checker Verify reasoning quality — fallacies, off-by-one, race conditions "The pagination math is wrong: total_pages needs ceil division"
fact_checker Web-grounded verification — do these APIs/libraries actually work? "That library was deprecated in v3, use the new API"
pragmatist Real-world practicality — will this ship? What's the ops burden? "This works but needs 3 new infra dependencies your team doesn't know"

The judge weighs each reviewer's input by role — a fact_checker citing sources carries more weight than a devils_advocate pushing back.

Config

Customize models via ~/.verd.yaml, env vars (VERD_JUDGE, VERD_DEBATERS, VERD_BUDGET, VERD_TIMEOUT), or CLI flags. Precedence: CLI > env > file > defaults.

# ~/.verd.yaml
judge: gpt-5.4
debaters: claude-sonnet-4-6, gpt-4.1, gemini-2.5-flash
budget: 1.00

Flags

-f FILE [FILE ...]    files to review         -g / -gs / -gb REF   git diffs
-d [DIR]              scan directory           -a / --ext / --exclude   filters
-q                    verdict only             --json                raw JSON
--judge MODEL         override judge           --debaters MODEL ...  override debaters
--budget USD          cost limit               --timeout SECONDS     per-call timeout

MCP — Claude Code / Cursor

Add to ~/.claude.json or ~/.cursor/mcp.json:

{
  "mcpServers": {
    "verd": {
      "command": "verd-mcp",
      "env": {
        "OPENAI_API_KEY": "your-key",
        "OPENAI_BASE_URL": "https://openrouter.ai/api/v1"
      }
    }
  }
}

Then use verd, verdl, or verdh as tools directly in chat. Ask a question, paste code, then say "use verd to check this."

Slack

Install with Slack dependencies:

pip install "verd[slack]"

Create a Slack app with Socket Mode enabled, add bot scopes (app_mentions:read, channels:history, groups:history, chat:write, reactions:write, im:history, im:write, users:read), then:

export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
export SLACK_SIGNING_SECRET=...
verd-slack

Usage in Slack:

  • @verd what do you think? — reads thread or last 20 channel messages, debates, replies in thread
  • @verd deep is this secure? — uses verdh (5 models + web search)
  • @verd quick is this right? — uses verdl (fast, 2 models)
  • @verd last 50 what's the consensus? — reads last 50 messages as context
  • /verd should we use Kafka? — slash command with live progress updates
  • /verdl is this correct? — quick slash command
  • /verdh any security issues? — deep slash command

Optional: restrict access via environment variables:

export VERD_ALLOWED_CHANNELS=C123,C456    # empty = all channels
export VERD_ALLOWED_USERS=U123,U456       # empty = all users

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verd-0.2.3.tar.gz (44.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verd-0.2.3-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file verd-0.2.3.tar.gz.

File metadata

  • Download URL: verd-0.2.3.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verd-0.2.3.tar.gz
Algorithm Hash digest
SHA256 ee3cc0603865c79f1ab5b6bfeaf8a3f30d698427a259757e215631c737bbe510
MD5 6caa1bf711afb0688d6e5d3ef095d753
BLAKE2b-256 7ae45d23e836e8bfbb575c0d0ec688460455bce95b1ea9d98252ff00d8498cc2

See more details on using hashes here.

File details

Details for the file verd-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: verd-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 39.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verd-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6cbf6cb6c77f617e8b4c2f686d255fe85a8d060d8a4b5401df62865d246ed835
MD5 676ed2ed3f9b83975ac68b8364925a32
BLAKE2b-256 5178f2bad72a7c34a894e12dedeb4241be88c642ffed5f84ba34eb5102e3fb82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page