Multi-LLM debate engine — verdicts everywhere

These details have not been verified by PyPI

Project description

verd

Five minds enter. They argue, challenge, cross-examine. Only the truth walks out.

verd spawns multiple AI models from different families — each with a specialized role, has them debate your question across rounds, then a stronger judge delivers the final verdict with strengths, issues, and actionable fixes.

Use it everywhere: CLI for code reviews, MCP inside Claude Code and Cursor, and Slack as @verd in any conversation.

Getting Started

Requires Python 3.11+.

pip install verd
verd setup

The setup wizard walks you through provider selection (OpenRouter, LiteLLM, or other) and outputs the exact config you need — for both CLI (.env) and MCP (JSON to paste into your editor config).

verd runs multiple models in parallel (Claude, Gemini, GPT, DeepSeek) so it needs a multi-provider router. OpenRouter is the easiest — one key, all models. LiteLLM proxy works too.

Usage

CLI

verd "can this auth middleware be bypassed?" -f auth.py middleware.py
verdh "should we merge this?" -gb main         # deep mode — 5 models + web search

MCP (Claude Code / Cursor) — use verd, verdl, verdh as tools directly in chat:

verdh based on the context above and this file, do you think we can proceed?
is this approach correct given what we discussed, use verdl

Slack — mention @verd in any channel or thread:

@verd what do you think?          — reads thread context, debates, replies
@verd deep is this secure?        — uses verdh (5 models + web search)
/verd Kafka, SQS, or RabbitMQ for our event pipeline?  — slash command with live progress

verd is for critical decisions and deep analysis — not simple lookups. If a single model can answer it, verd is overkill.

Output

FAIL  77%  In-memory rate limiter is unsafe for production

claude:FAIL  gpt:FAIL  gemini:FAIL  gpt:FAIL  (FULL)

+ Conceptually correct sliding-window logic
- Global dict is unsynchronized — race conditions in multi-thread servers
- Per-user lists grow without bounds — memory leak / DoS vector
! gpt-5-mini caught the risk of system clock jumps with time.time()
→ Move state to Redis with atomic operations

completed in 69.3s • 22,449 tokens • ~$0.07

Vote breakdown, unique catches (!), dissent, strengths, issues, and actionable fixes — all in one view.

Modes

Command	Debaters	Roles	Rounds	Speed	Cost
`verdl`	2 + judge	analyst, devils_advocate	1	~15s+	~$0.01
`verd`	4 + judge	analyst, devils_advocate, logic_checker, pragmatist	2	~30s+	~$0.05+
`verdh`	5 + judge + web	analyst, devils_advocate, logic_checker, fact_checker, pragmatist	3	~60s+	~$0.25+

Benchmark

Tested on the Martian Code Review Benchmark — 50 real PRs from Cal.com, Discourse, Grafana, Keycloak, and Sentry with expert-labeled golden comments. No code-review-specific tuning.

Mode	Precision	Recall	F1 Score	Avg Issues
GPT-5.4 (alone)	13.0%	70.6%	21.9%	14.6
Claude Opus 4.6 (alone)	18.5%	69.9%	29.2%	10.1
verdh (5-model debate)	29.1%	64.0%	40.0%	5.9

+37% F1 over Claude solo. 57% more precise. 42% fewer false positives.

How it works

Your question + content gets sent to multiple AI models in parallel
Each model has a specialized role (analyst, devils_advocate, logic_checker, fact_checker, pragmatist)
Models see each other's responses and cross-examine for 1-3 rounds
Anti-groupthink prompts ensure models hold their ground when they have evidence — consensus without new evidence is rejected
A stronger judge model synthesizes the debate, weighting each reviewer by their role
Confidence is calculated from vote distribution — a fact_checker's dissent lowers confidence more than a devils_advocate's expected pushback
You get: verdict, vote breakdown, strengths, issues, unique catches, dissent, and actionable fixes

The key insight: different model families have different blind spots and training biases. Claude spots nuance GPT misses. Gemini catches logic errors DeepSeek overlooks. More importantly — if the same model writes the review and judges its quality, it's likely to agree with itself. Cross-model diversity means the judge is a genuine quality gate, not a model grading its own homework. The debate surfaces what each model uniquely caught and tells you exactly which model caught what.

Roles

Role	Job	Example catch
analyst	Balanced initial assessment, main arguments for and against	"The architecture is sound but the auth flow has a gap"
devils_advocate	Find what others miss — edge cases, hidden assumptions, failure modes	"What happens when the token expires mid-transaction?"
logic_checker	Verify reasoning quality — fallacies, off-by-one, race conditions	"The pagination math is wrong: total_pages needs ceil division"
fact_checker	Web-grounded verification — do these APIs/libraries actually work?	"That library was deprecated in v3, use the new API"
pragmatist	Real-world practicality — will this ship? What's the ops burden?	"This works but needs 3 new infra dependencies your team doesn't know"

The judge weighs each reviewer's input by role — a fact_checker citing sources carries more weight than a devils_advocate pushing back.

Config

Override models via env vars or CLI flags. Per-tier env vars let you set different models for each mode:

VERDL_JUDGE=o4-mini            VERDL_DEBATERS=gpt-4.1-mini,gemini-3.1-flash-lite-preview
VERD_JUDGE=o3                  VERD_DEBATERS=claude-sonnet-4-6,gpt-4.1,gemini-3.1-pro-preview,gpt-4.1-mini
VERDH_JUDGE=o3                 VERDH_DEBATERS=claude-opus-4-6,deepseek-r1,gemini-3.1-pro-preview,sonar-pro,gpt-4.1

Or use VERD_JUDGE / VERD_DEBATERS as a global override for all tiers. verd setup generates the right config for your provider.

Flags

-c TEXT               inline content string
-f FILE [FILE ...]    one or more files to evaluate
-d [DIR]              read all files in a directory (default: current dir)
-g                    use unstaged git diff as content
-gs                   use staged git diff as content
-gb REF               use git diff REF...HEAD as content (e.g. main)
-a / --all            scan all files, skip smart selection (use with -d)
--ext EXT [EXT ...]   filter by extension (use with -d)
--exclude PATTERN     glob patterns to exclude (use with -d)
-q / --quiet          hide debate transcript, show only verdict
--json                output raw JSON
--judge MODEL         override judge model
--debaters MODEL ...  override debater models
--budget USD          max cost in USD — abort if estimate exceeds budget
--timeout SECONDS     override timeout per model call
--version             show version and exit

MCP — Claude Code / Cursor

verd setup    # select "MCP" and your provider

This prints the exact JSON to paste into ~/.claude/settings.json (Claude Code) or ~/.cursor/mcp.json (Cursor), with the correct absolute path to verd-mcp and model overrides for your provider. Then use verd, verdl, or verdh as tools directly in chat.

Slack

Install with Slack dependencies:

pip install "verd[slack]"

Create a Slack app with Socket Mode enabled, add bot scopes (app_mentions:read, channels:history, groups:history, chat:write, reactions:write, im:history, im:write, users:read), then:

export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
export SLACK_SIGNING_SECRET=...
verd-slack

Optional: restrict access via environment variables:

export VERD_ALLOWED_CHANNELS=C123,C456    # empty = all channels
export VERD_ALLOWED_USERS=U123,U456       # empty = all users

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.0

Apr 1, 2026

0.3.6

Mar 26, 2026

This version

0.3.5

Mar 26, 2026

0.3.4

Mar 26, 2026

0.3.3

Mar 26, 2026

0.3.2

Mar 26, 2026

0.3.1

Mar 26, 2026

0.3.0

Mar 26, 2026

0.2.8

Mar 26, 2026

0.2.7

Mar 25, 2026

0.2.6

Mar 25, 2026

0.2.5

Mar 25, 2026

0.2.4

Mar 25, 2026

0.2.3

Mar 25, 2026

0.2.2

Mar 24, 2026

0.2.1

Mar 24, 2026

0.2.0

Mar 23, 2026

0.1.3

Mar 23, 2026

0.1.2

Mar 23, 2026

0.1.1

Mar 23, 2026

0.1.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verd-0.3.5.tar.gz (46.8 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

verd-0.3.5-py3-none-any.whl (42.1 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file verd-0.3.5.tar.gz.

File metadata

Download URL: verd-0.3.5.tar.gz
Upload date: Mar 26, 2026
Size: 46.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verd-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`82ee524650e160bd9240ed19c15a76d6584cf3ccbfaa216f41102c19d6989550`
MD5	`2cecdd7e46e77d26ea0f3c75ccfb088a`
BLAKE2b-256	`83162fdf44fb3f11c76a073e85ca68656ea2a100ab49e15433eec7343caafee8`

See more details on using hashes here.

File details

Details for the file verd-0.3.5-py3-none-any.whl.

File metadata

Download URL: verd-0.3.5-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 42.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for verd-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b51c3fdaa556e2b91a1d817f0ffc372b897167b5aca28de9e25a62530492a862`
MD5	`7b1fa69ed1b274baa43f903b2e1cf739`
BLAKE2b-256	`fa43be0440d9a3e899657791d52a31aace9c8856814902b69a98ab62f150cca5`

See more details on using hashes here.

verd 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

verd

Getting Started

Usage

Output

Modes

Benchmark

How it works

Roles

Config

Flags

MCP — Claude Code / Cursor

Slack

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes