MCP as a Judge: a behavioral MCP that strengthens AI coding assistants via explicit LLM evaluations

These details have not been verified by PyPI

Project links

Project description

MCP as a Judge ⚖️

mcp-name: io.github.OtherVibes/mcp-as-a-judge

MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations for:

Research, system design, and planning
Code changes, testing, and task-completion verification

It enforces evidence-based research, reuse over reinvention, and human-in-the-loop decisions.

If your IDE has rules/agents (Copilot, Cursor, Claude Code), keep using them—this Judge adds enforceable approval gates on plan, code diffs, and tests.

Key problems with AI coding assistants and LLMs

Treat LLM output as ground truth; skip research and use outdated information
Reinvent the wheel instead of reusing libraries and existing code
Cut corners: code below engineering standards and weak tests
Make unilateral decisions when requirements are ambiguous or plans change
Security blind spots: missing input validation, injection risks/attack vectors, least‑privilege violations, and weak defensive programming

Vibe coding doesn’t have to be frustrating

What it enforces

Evidence‑based research and reuse (best practices, libraries, existing code)
Plan‑first delivery aligned to user requirements
Human‑in‑the‑loop decisions for ambiguity and blockers
Quality gates on code and tests (security, performance, maintainability)

Key capabilities

Intelligent code evaluation via MCP sampling; enforces software‑engineering standards and flags security/performance/maintainability risks
Comprehensive plan/design review: validates architecture, research depth, requirements fit, and implementation approach
User‑driven decisions via MCP elicitation: clarifies requirements, resolves obstacles, and keeps choices transparent
Security validation in system design and code changes

Tools and how they help

Tool	What it solves
`set_coding_task`	Creates/updates task metadata; classifies task_size; returns next-step workflow guidance
`get_current_coding_task`	Recovers the latest task_id and metadata to resume work safely
`judge_coding_plan`	Validates plan/design; requires library selection and internal reuse maps; flags risks
`judge_code_change`	Reviews unified Git diffs for correctness, reuse, security, and code quality
`judge_testing_implementation`	Validates tests using real runner output and optional coverage
`judge_coding_task_completion`	Final gate ensuring plan, code, and tests approvals before completion
`raise_missing_requirements`	Elicits missing details and decisions to unblock progress
`raise_obstacle`	Engages the user on trade‑offs, constraints, and enforced changes

🚀 Quick Start

Requirements & Recommendations

MCP Client Prerequisites

MCP as a Judge is heavily dependent on MCP Sampling and MCP Elicitation features for its core functionality:

MCP Sampling - Required for AI-powered code evaluation and judgment
MCP Elicitation - Required for interactive user decision prompts

System Prerequisites

Docker Desktop / Python 3.13+ - Required for running the MCP server

Supported AI Assistants

AI Assistant	Platform	MCP Support	Status	Notes
GitHub Copilot	Visual Studio Code	✅ Full	Recommended	Complete MCP integration with sampling and elicitation
Claude Code	-	⚠️ Partial	Requires LLM API key	Sampling Support feature request Elicitation Support feature request
Cursor	-	⚠️ Partial	Requires LLM API key	MCP support available, but sampling/elicitation limited
Augment	-	⚠️ Partial	Requires LLM API key	MCP support available, but sampling/elicitation limited
Qodo	-	⚠️ Partial	Requires LLM API key	MCP support available, but sampling/elicitation limited

✅ Recommended setup: GitHub Copilot + VS Code — full MCP sampling; no API key needed.

⚠️ Critical: For assistants without full MCP sampling (Cursor, Claude Code, Augment, Qodo), you MUST set LLM_API_KEY. Without it, the server cannot evaluate plans or code. See LLM API Configuration.

💡 Tip: Prefer large context models (≥ 1M tokens) for better analysis and judgments.

If the MCP server isn’t auto‑used

For troubleshooting, visit the FAQs section.

🔧 MCP Configuration

Configure MCP as a Judge in your MCP-enabled client:

Method 1: Using Docker (Recommended)

One‑click install for VS Code (MCP)

Notes:

VS Code controls the sampling model; select it via “MCP: List Servers → mcp-as-a-judge → Configure Model Access”.

Configure MCP Settings:

Add this to your MCP client configuration file:
```
{
  "command": "docker",
  "args": ["run", "--rm", "-i", "--pull=always", "ghcr.io/othervibes/mcp-as-a-judge:latest"],
  "env": {
    "LLM_API_KEY": "your-openai-api-key-here",
    "LLM_MODEL_NAME": "gpt-4o-mini"
  }
}
```
📝 Configuration Options (All Optional):
- LLM_API_KEY: Optional for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
- The --pull=always flag ensures you always get the latest version automatically
Then manually update when needed:
```
# Pull the latest version
docker pull ghcr.io/othervibes/mcp-as-a-judge:latest
```

Method 2: Using uv

Install the package:
```
uv tool install mcp-as-a-judge
```
Configure MCP Settings:

The MCP server may be automatically detected by your MCP‑enabled client.

📝 Notes:
- No additional configuration needed for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_API_KEY is optional and can be set via environment variable if needed

To update to the latest version:

# Update MCP as a Judge to the latest version
uv tool upgrade mcp-as-a-judge

Select a sampling model in VS Code

Open Command Palette (Cmd/Ctrl+Shift+P) → “MCP: List Servers”
Select the configured server “mcp-as-a-judge”
Choose “Configure Model Access”
Check your preferred model(s) to enable sampling

🔑 LLM API Configuration (Optional)

For AI assistants without full MCP sampling support you can configure an LLM API key as a fallback. This ensures MCP as a Judge works even when the client doesn't support MCP sampling.

Set LLM_API_KEY (unified key). Vendor is auto-detected; optionally set LLM_MODEL_NAME to override the default.

Supported LLM Providers

Rank	Provider	API Key Format	Default Model	Notes
1	OpenAI	`sk-...`	`gpt-4.1`	Fast and reliable model optimized for speed
2	Anthropic	`sk-ant-...`	`claude-sonnet-4-20250514`	High-performance with exceptional reasoning
3	Google	`AIza...`	`gemini-2.5-pro`	Most advanced model with built-in thinking
4	Azure OpenAI	`[a-f0-9]{32}`	`gpt-4.1`	Same as OpenAI but via Azure
5	AWS Bedrock	AWS credentials	`anthropic.claude-sonnet-4-20250514-v1:0`	Aligned with Anthropic
6	Vertex AI	Service Account JSON	`gemini-2.5-pro`	Enterprise Gemini via Google Cloud
7	Groq	`gsk_...`	`deepseek-r1`	Best reasoning model with speed advantage
8	OpenRouter	`sk-or-...`	`deepseek/deepseek-r1`	Best reasoning model available
9	xAI	`xai-...`	`grok-code-fast-1`	Latest coding-focused model (Aug 2025)
10	Mistral	`[a-f0-9]{64}`	`pixtral-large`	Most advanced model (124B params)

Client-Specific Setup

Cursor

Open Cursor Settings:
- Go to File → Preferences → Cursor Settings
- Navigate to the MCP tab
- Click + Add to add a new MCP server

Add MCP Server Configuration:

{
  "command": "uv",
  "args": ["tool", "run", "mcp-as-a-judge"],
  "env": {
    "LLM_API_KEY": "your-openai-api-key-here",
    "LLM_MODEL_NAME": "gpt-4.1"
  }
}

📝 Configuration Options:

LLM_API_KEY: Required for Cursor (limited MCP sampling)
LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)

Claude Code

Add MCP Server via CLI:

# Set environment variables first (optional model override)
export LLM_API_KEY="your_api_key_here"
export LLM_MODEL_NAME="claude-3-5-haiku"  # Optional: faster/cheaper model

# Add MCP server
claude mcp add mcp-as-a-judge -- uv tool run mcp-as-a-judge

Alternative: Manual Configuration:
- Create or edit ~/.config/claude-code/mcp_servers.json
```
{
  "command": "uv",
  "args": ["tool", "run", "mcp-as-a-judge"],
  "env": {
    "LLM_API_KEY": "your-anthropic-api-key-here",
    "LLM_MODEL_NAME": "claude-3-5-haiku"
  }
}
```
📝 Configuration Options:
- LLM_API_KEY: Required for Claude Code (limited MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)

Other MCP Clients

For other MCP-compatible clients, use the standard MCP server configuration:

{
  "command": "uv",
  "args": ["tool", "run", "mcp-as-a-judge"],
  "env": {
    "LLM_API_KEY": "your-openai-api-key-here",
    "LLM_MODEL_NAME": "gpt-5"
  }
}

📝 Configuration Options:

LLM_API_KEY: Required for most MCP clients (except GitHub Copilot + VS Code)
LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)

🔒 Privacy & Flexible AI Integration

🔑 MCP Sampling (Preferred) + LLM API Key Fallback

Primary Mode: MCP Sampling

All judgments are performed using MCP Sampling capability
No need to configure or pay for external LLM API services
Works directly with your MCP-compatible client's existing AI model
Currently supported by: GitHub Copilot + VS Code

Fallback Mode: LLM API Key

When MCP sampling is not available, the server can use LLM API keys
Supports multiple providers via LiteLLM: OpenAI, Anthropic, Google, Azure, Groq, Mistral, xAI
Automatic vendor detection from API key patterns
Default model selection per vendor when no model is specified

🛡️ Your Privacy Matters

The server runs locally on your machine
No data collection - your code and conversations stay private
No external API calls when using MCP Sampling. If you set LLM_API_KEY for fallback, the server will call your chosen LLM provider only to perform judgments (plan/code/test) with the evaluation content you provide.
Complete control over your development workflow and sensitive information

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/OtherVibes/mcp-as-a-judge.git
cd mcp-as-a-judge

# Install dependencies with uv
uv sync --all-extras --dev

# Install pre-commit hooks
uv run pre-commit install

# Run tests
uv run pytest

# Run all checks
uv run pytest && uv run ruff check && uv run ruff format --check && uv run mypy src

© Concepts and Methodology

© 2025 OtherVibes and Zvi Fried. The "MCP as a Judge" concept, the "behavioral MCP" approach, the staged workflow (plan → code → test → completion), tool taxonomy/descriptions, and prompt templates are original work developed in this repository.

Prior Art and Attribution

While “LLM‑as‑a‑judge” is a broadly known idea, this repository defines the original “MCP as a Judge” behavioral MCP pattern by OtherVibes and Zvi Fried. It combines task‑centric workflow enforcement (plan → code → test → completion), explicit LLM‑based validations, and human‑in‑the‑loop elicitation, along with the prompt templates and tool taxonomy provided here. Please attribute as: “OtherVibes – MCP as a Judge (Zvi Fried)”.

❓ FAQ

How is “MCP as a Judge” different from rules/subagents in IDE assistants (GitHub Copilot, Cursor, Claude Code)?

Feature	IDE Rules	Subagents	MCP as a Judge
Static behavior guidance	✓	✓	✗
Custom system prompts	✓	✓	✓
Project context integration	✓	✓	✓
Specialized task handling	✗	✓	✓
Active quality gates	✗	✗	✓
Evidence-based validation	✗	✗	✓
Approve/reject with feedback	✗	✗	✓
Workflow enforcement	✗	✗	✓
Cross-assistant compatibility	✗	✗	✓

References: GitHub Copilot Custom Instructions, Cursor Rules, Claude Code Subagents

How does the Judge workflow relate to the tasklist? Why do we need both?

Tasklist = planning/organization: tracks tasks, priorities, and status. It doesn’t guarantee engineering quality or readiness.
Judge workflow = quality gates: enforces approvals for plan/design, code diffs, tests, and final completion. It demands real evidence (e.g., unified Git diffs and raw test output) and returns structured approvals and required improvements.
Together: Use the tasklist to organize work; use the Judge to decide when each stage is actually ready to proceed. The server also emits next_tool guidance to keep progress moving through the gates.

If the Judge isn’t used automatically, how do I force it?

In your prompt: "use mcp-as-a-judge" or "Evaluate plan/code/test using the MCP server mcp-as-a-judge".
VS Code: Command Palette → "MCP: List Servers" → ensure "mcp-as-a-judge" is listed and enabled.
Ensure the MCP server is running and, in your client, the judge tools are enabled/approved.

How do I select models for sampling in VS Code?

Open Command Palette (Cmd/Ctrl+Shift+P) → "MCP: List Servers"
Select "mcp-as-a-judge" → "Configure Model Access"
Check your preferred model(s) to enable sampling

📄 License

This project is licensed under the MIT License (see LICENSE).

🙏 Acknowledgments

Model Context Protocol by Anthropic
LiteLLM for unified LLM API integration

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Sep 29, 2025

0.4.0

Sep 24, 2025

0.3.20

Sep 20, 2025

This version

0.3.19

Sep 20, 2025

0.3.18

Sep 20, 2025

0.3.17

Sep 20, 2025

0.3.15

Sep 20, 2025

0.3.14

Sep 18, 2025

0.3.12

Sep 18, 2025

0.3.11

Sep 18, 2025

0.3.10

Sep 18, 2025

0.3.9

Sep 18, 2025

0.3.8

Sep 18, 2025

0.3.3

Sep 18, 2025

0.3.2

Sep 18, 2025

0.3.1

Sep 18, 2025

0.3.0

Sep 17, 2025

0.2.0

Sep 15, 2025

0.1.9

Aug 30, 2025

0.1.8

Aug 30, 2025

0.1.1

Aug 30, 2025

0.1.0

Aug 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_as_a_judge-0.3.19.tar.gz (330.5 kB view details)

Uploaded Sep 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp_as_a_judge-0.3.19-py3-none-any.whl (155.2 kB view details)

Uploaded Sep 20, 2025 Python 3

File details

Details for the file mcp_as_a_judge-0.3.19.tar.gz.

File metadata

Download URL: mcp_as_a_judge-0.3.19.tar.gz
Upload date: Sep 20, 2025
Size: 330.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mcp_as_a_judge-0.3.19.tar.gz
Algorithm	Hash digest
SHA256	`9711c5e377a623a32489e75761ddaef4b5ecdc07b0e4752785aa5171fa983372`
MD5	`fa698cd1f7c02bbdd91a4becfb842614`
BLAKE2b-256	`ce95dc74b32233d47b9bbc9e2a39eee96fc696af45d9cbdd8f0d7467daa00ed7`

See more details on using hashes here.

File details

Details for the file mcp_as_a_judge-0.3.19-py3-none-any.whl.

File metadata

Download URL: mcp_as_a_judge-0.3.19-py3-none-any.whl
Upload date: Sep 20, 2025
Size: 155.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mcp_as_a_judge-0.3.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74b9d66daa02c4386de42a028c0ac04aefd602061204b8e84f209091744558c6`
MD5	`d16420272b9a990502648fa310a6929a`
BLAKE2b-256	`78a4df48187f039910b63ef1c7db2ca3bcefd6facbec693fe9b15e206994bf8f`

See more details on using hashes here.

mcp-as-a-judge 0.3.19

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MCP as a Judge ⚖️

Key problems with AI coding assistants and LLMs

Vibe coding doesn’t have to be frustrating

What it enforces

Key capabilities

Tools and how they help

🚀 Quick Start

Requirements & Recommendations

MCP Client Prerequisites

System Prerequisites

Supported AI Assistants

If the MCP server isn’t auto‑used

🔧 MCP Configuration

Method 1: Using Docker (Recommended)

One‑click install for VS Code (MCP)

Method 2: Using uv

Select a sampling model in VS Code

🔑 LLM API Configuration (Optional)

Supported LLM Providers

Client-Specific Setup

Cursor

Claude Code

Other MCP Clients

🔒 Privacy & Flexible AI Integration

🔑 MCP Sampling (Preferred) + LLM API Key Fallback

🛡️ Your Privacy Matters

🤝 Contributing

Development Setup

© Concepts and Methodology

Prior Art and Attribution

❓ FAQ

How is “MCP as a Judge” different from rules/subagents in IDE assistants (GitHub Copilot, Cursor, Claude Code)?

How does the Judge workflow relate to the tasklist? Why do we need both?

If the Judge isn’t used automatically, how do I force it?

How do I select models for sampling in VS Code?

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes