MCP as a Judge: a behavioral MCP that strengthens AI coding assistants via explicit LLM evaluations
Project description
MCP as a Judge ⚖️
mcp-name: io.github.OtherVibes/mcp-as-a-judge
MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.
MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations for:
- Research, system design, and planning
- Code changes, testing, and task-completion verification
It enforces evidence-based research, reuse over reinvention, and human-in-the-loop decisions.
If your IDE has rules/agents (Copilot, Cursor, Claude Code), keep using them—this Judge adds enforceable approval gates on plan, code diffs, and tests.
Key problems with AI coding assistants and LLMs
- Treat LLM output as ground truth; skip research and use outdated information
- Reinvent the wheel instead of reusing libraries and existing code
- Cut corners: code below engineering standards and weak tests
- Make unilateral decisions when requirements are ambiguous or plans change
- Security blind spots: missing input validation, injection risks/attack vectors, least‑privilege violations, and weak defensive programming
Vibe coding doesn’t have to be frustrating
What it enforces
- Evidence‑based research and reuse (best practices, libraries, existing code)
- Plan‑first delivery aligned to user requirements
- Human‑in‑the‑loop decisions for ambiguity and blockers
- Quality gates on code and tests (security, performance, maintainability)
Key capabilities
- Intelligent code evaluation via MCP sampling; enforces software‑engineering standards and flags security/performance/maintainability risks
- Comprehensive plan/design review: validates architecture, research depth, requirements fit, and implementation approach
- User‑driven decisions via MCP elicitation: clarifies requirements, resolves obstacles, and keeps choices transparent
- Security validation in system design and code changes
Tools and how they help
| Tool | What it solves |
|---|---|
set_coding_task |
Creates/updates task metadata; classifies task_size; returns next-step workflow guidance |
get_current_coding_task |
Recovers the latest task_id and metadata to resume work safely |
judge_coding_plan |
Validates plan/design; requires library selection and internal reuse maps; flags risks |
judge_code_change |
Reviews unified Git diffs for correctness, reuse, security, and code quality |
judge_testing_implementation |
Validates tests using real runner output and optional coverage |
judge_coding_task_completion |
Final gate ensuring plan, code, and tests approvals before completion |
raise_missing_requirements |
Elicits missing details and decisions to unblock progress |
raise_obstacle |
Engages the user on trade‑offs, constraints, and enforced changes |
🚀 Quick Start
Requirements & Recommendations
MCP Client Prerequisites
MCP as a Judge is heavily dependent on MCP Sampling and MCP Elicitation features for its core functionality:
- MCP Sampling - Required for AI-powered code evaluation and judgment
- MCP Elicitation - Required for interactive user decision prompts
System Prerequisites
- Docker Desktop / Python 3.13+ - Required for running the MCP server
Supported AI Assistants
| AI Assistant | Platform | MCP Support | Status | Notes |
|---|---|---|---|---|
| GitHub Copilot | Visual Studio Code | ✅ Full | Recommended | Complete MCP integration with sampling and elicitation |
| Claude Code | - | ⚠️ Partial | Requires LLM API key | Sampling Support feature request Elicitation Support feature request |
| Cursor | - | ⚠️ Partial | Requires LLM API key | MCP support available, but sampling/elicitation limited |
| Augment | - | ⚠️ Partial | Requires LLM API key | MCP support available, but sampling/elicitation limited |
| Qodo | - | ⚠️ Partial | Requires LLM API key | MCP support available, but sampling/elicitation limited |
✅ Recommended setup: GitHub Copilot + VS Code — full MCP sampling; no API key needed.
⚠️ Critical: For assistants without full MCP sampling (Cursor, Claude Code, Augment, Qodo), you MUST set LLM_API_KEY. Without it, the server cannot evaluate plans or code. See LLM API Configuration.
💡 Tip: Prefer large context models (≥ 1M tokens) for better analysis and judgments.
If the MCP server isn’t auto‑used
For troubleshooting, visit the FAQs section.
🔧 MCP Configuration
Configure MCP as a Judge in your MCP-enabled client:
Method 1: Using Docker (Recommended)
One‑click install for VS Code (MCP)
Notes:
- VS Code controls the sampling model; select it via “MCP: List Servers → mcp-as-a-judge → Configure Model Access”.
-
Configure MCP Settings:
Add this to your MCP client configuration file:
{ "command": "docker", "args": ["run", "--rm", "-i", "--pull=always", "ghcr.io/othervibes/mcp-as-a-judge:latest"], "env": { "LLM_API_KEY": "your-openai-api-key-here", "LLM_MODEL_NAME": "gpt-4o-mini" } }
📝 Configuration Options (All Optional):
- LLM_API_KEY: Optional for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
- The
--pull=alwaysflag ensures you always get the latest version automatically
Then manually update when needed:
# Pull the latest version docker pull ghcr.io/othervibes/mcp-as-a-judge:latest
Method 2: Using uv
-
Install the package:
uv tool install mcp-as-a-judge
-
Configure MCP Settings:
The MCP server may be automatically detected by your MCP‑enabled client.
📝 Notes:
- No additional configuration needed for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_API_KEY is optional and can be set via environment variable if needed
-
To update to the latest version:
# Update MCP as a Judge to the latest version uv tool upgrade mcp-as-a-judge
Select a sampling model in VS Code
- Open Command Palette (Cmd/Ctrl+Shift+P) → “MCP: List Servers”
- Select the configured server “mcp-as-a-judge”
- Choose “Configure Model Access”
- Check your preferred model(s) to enable sampling
🔑 LLM API Configuration (Optional)
For AI assistants without full MCP sampling support you can configure an LLM API key as a fallback. This ensures MCP as a Judge works even when the client doesn't support MCP sampling.
- Set
LLM_API_KEY(unified key). Vendor is auto-detected; optionally setLLM_MODEL_NAMEto override the default.
Supported LLM Providers
| Rank | Provider | API Key Format | Default Model | Notes |
|---|---|---|---|---|
| 1 | OpenAI | sk-... |
gpt-4.1 |
Fast and reliable model optimized for speed |
| 2 | Anthropic | sk-ant-... |
claude-sonnet-4-20250514 |
High-performance with exceptional reasoning |
| 3 | AIza... |
gemini-2.5-pro |
Most advanced model with built-in thinking | |
| 4 | Azure OpenAI | [a-f0-9]{32} |
gpt-4.1 |
Same as OpenAI but via Azure |
| 5 | AWS Bedrock | AWS credentials | anthropic.claude-sonnet-4-20250514-v1:0 |
Aligned with Anthropic |
| 6 | Vertex AI | Service Account JSON | gemini-2.5-pro |
Enterprise Gemini via Google Cloud |
| 7 | Groq | gsk_... |
deepseek-r1 |
Best reasoning model with speed advantage |
| 8 | OpenRouter | sk-or-... |
deepseek/deepseek-r1 |
Best reasoning model available |
| 9 | xAI | xai-... |
grok-code-fast-1 |
Latest coding-focused model (Aug 2025) |
| 10 | Mistral | [a-f0-9]{64} |
pixtral-large |
Most advanced model (124B params) |
Client-Specific Setup
Cursor
-
Open Cursor Settings:
- Go to
File→Preferences→Cursor Settings - Navigate to the
MCPtab - Click
+ Addto add a new MCP server
- Go to
-
Add MCP Server Configuration:
{ "command": "uv", "args": ["tool", "run", "mcp-as-a-judge"], "env": { "LLM_API_KEY": "your-openai-api-key-here", "LLM_MODEL_NAME": "gpt-4.1" } }
📝 Configuration Options:
- LLM_API_KEY: Required for Cursor (limited MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
Claude Code
-
Add MCP Server via CLI:
# Set environment variables first (optional model override) export LLM_API_KEY="your_api_key_here" export LLM_MODEL_NAME="claude-3-5-haiku" # Optional: faster/cheaper model # Add MCP server claude mcp add mcp-as-a-judge -- uv tool run mcp-as-a-judge
-
Alternative: Manual Configuration:
- Create or edit
~/.config/claude-code/mcp_servers.json
{ "command": "uv", "args": ["tool", "run", "mcp-as-a-judge"], "env": { "LLM_API_KEY": "your-anthropic-api-key-here", "LLM_MODEL_NAME": "claude-3-5-haiku" } }
📝 Configuration Options:
- LLM_API_KEY: Required for Claude Code (limited MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
- Create or edit
Other MCP Clients
For other MCP-compatible clients, use the standard MCP server configuration:
{
"command": "uv",
"args": ["tool", "run", "mcp-as-a-judge"],
"env": {
"LLM_API_KEY": "your-openai-api-key-here",
"LLM_MODEL_NAME": "gpt-5"
}
}
📝 Configuration Options:
- LLM_API_KEY: Required for most MCP clients (except GitHub Copilot + VS Code)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
🔒 Privacy & Flexible AI Integration
🔑 MCP Sampling (Preferred) + LLM API Key Fallback
Primary Mode: MCP Sampling
- All judgments are performed using MCP Sampling capability
- No need to configure or pay for external LLM API services
- Works directly with your MCP-compatible client's existing AI model
- Currently supported by: GitHub Copilot + VS Code
Fallback Mode: LLM API Key
- When MCP sampling is not available, the server can use LLM API keys
- Supports multiple providers via LiteLLM: OpenAI, Anthropic, Google, Azure, Groq, Mistral, xAI
- Automatic vendor detection from API key patterns
- Default model selection per vendor when no model is specified
🛡️ Your Privacy Matters
- The server runs locally on your machine
- No data collection - your code and conversations stay private
- No external API calls when using MCP Sampling. If you set
LLM_API_KEYfor fallback, the server will call your chosen LLM provider only to perform judgments (plan/code/test) with the evaluation content you provide. - Complete control over your development workflow and sensitive information
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Development Setup
# Clone the repository
git clone https://github.com/OtherVibes/mcp-as-a-judge.git
cd mcp-as-a-judge
# Install dependencies with uv
uv sync --all-extras --dev
# Install pre-commit hooks
uv run pre-commit install
# Run tests
uv run pytest
# Run all checks
uv run pytest && uv run ruff check && uv run ruff format --check && uv run mypy src
© Concepts and Methodology
© 2025 OtherVibes and Zvi Fried. The "MCP as a Judge" concept, the "behavioral MCP" approach, the staged workflow (plan → code → test → completion), tool taxonomy/descriptions, and prompt templates are original work developed in this repository.
Prior Art and Attribution
While “LLM‑as‑a‑judge” is a broadly known idea, this repository defines the original “MCP as a Judge” behavioral MCP pattern by OtherVibes and Zvi Fried. It combines task‑centric workflow enforcement (plan → code → test → completion), explicit LLM‑based validations, and human‑in‑the‑loop elicitation, along with the prompt templates and tool taxonomy provided here. Please attribute as: “OtherVibes – MCP as a Judge (Zvi Fried)”.
❓ FAQ
How is “MCP as a Judge” different from rules/subagents in IDE assistants (GitHub Copilot, Cursor, Claude Code)?
| Feature | IDE Rules | Subagents | MCP as a Judge |
|---|---|---|---|
| Static behavior guidance | ✓ | ✓ | ✗ |
| Custom system prompts | ✓ | ✓ | ✓ |
| Project context integration | ✓ | ✓ | ✓ |
| Specialized task handling | ✗ | ✓ | ✓ |
| Active quality gates | ✗ | ✗ | ✓ |
| Evidence-based validation | ✗ | ✗ | ✓ |
| Approve/reject with feedback | ✗ | ✗ | ✓ |
| Workflow enforcement | ✗ | ✗ | ✓ |
| Cross-assistant compatibility | ✗ | ✗ | ✓ |
How does the Judge workflow relate to the tasklist? Why do we need both?
- Tasklist = planning/organization: tracks tasks, priorities, and status. It doesn’t guarantee engineering quality or readiness.
- Judge workflow = quality gates: enforces approvals for plan/design, code diffs, tests, and final completion. It demands real evidence (e.g., unified Git diffs and raw test output) and returns structured approvals and required improvements.
- Together: Use the tasklist to organize work; use the Judge to decide when each stage is actually ready to proceed. The server also emits next_tool guidance to keep progress moving through the gates.
If the Judge isn’t used automatically, how do I force it?
- In your prompt: "use mcp-as-a-judge" or "Evaluate plan/code/test using the MCP server mcp-as-a-judge".
- VS Code: Command Palette → "MCP: List Servers" → ensure "mcp-as-a-judge" is listed and enabled.
- Ensure the MCP server is running and, in your client, the judge tools are enabled/approved.
How do I select models for sampling in VS Code?
- Open Command Palette (Cmd/Ctrl+Shift+P) → "MCP: List Servers"
- Select "mcp-as-a-judge" → "Configure Model Access"
- Check your preferred model(s) to enable sampling
📄 License
This project is licensed under the MIT License (see LICENSE).
🙏 Acknowledgments
- Model Context Protocol by Anthropic
- LiteLLM for unified LLM API integration
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_as_a_judge-0.3.19.tar.gz.
File metadata
- Download URL: mcp_as_a_judge-0.3.19.tar.gz
- Upload date:
- Size: 330.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9711c5e377a623a32489e75761ddaef4b5ecdc07b0e4752785aa5171fa983372
|
|
| MD5 |
fa698cd1f7c02bbdd91a4becfb842614
|
|
| BLAKE2b-256 |
ce95dc74b32233d47b9bbc9e2a39eee96fc696af45d9cbdd8f0d7467daa00ed7
|
File details
Details for the file mcp_as_a_judge-0.3.19-py3-none-any.whl.
File metadata
- Download URL: mcp_as_a_judge-0.3.19-py3-none-any.whl
- Upload date:
- Size: 155.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74b9d66daa02c4386de42a028c0ac04aefd602061204b8e84f209091744558c6
|
|
| MD5 |
d16420272b9a990502648fa310a6929a
|
|
| BLAKE2b-256 |
78a4df48187f039910b63ef1c7db2ca3bcefd6facbec693fe9b15e206994bf8f
|