MCP server that packages LLM evaluation gates as reusable CI/CD primitives
Project description
mcp-llm-eval
A local Model Context Protocol (MCP) server that packages LLM evaluation gates as reusable CI/CD primitives. Run datasets against multiple models, score responses with an LLM-as-judge, and enforce quality thresholds — all through MCP tools that AI agents can call.
Why?
There's no unit test for LLM quality. Teams ship prompt changes, swap models, or update system prompts with no automated way to verify that output quality didn't regress. Manual spot-checking doesn't scale, and existing eval frameworks are heavy, opinionated, and hard to wire into CI/CD.
mcp-llm-eval gives AI agents structured access to a lightweight eval pipeline. Instead of building custom scripts for every project, you define a dataset, point the agent at it, and get scored results with pass/fail gates — the same workflow whether you're testing locally or gating a deployment.
Features
| Tool | Description |
|---|---|
run_evaluation |
Load a dataset, query models via streaming, score with LLM-as-judge, return per-question scores and aggregate summary |
check_thresholds |
Validate evaluation results against quality gates (faithfulness, relevance, TTFT, cost) |
list_evaluations |
List past evaluation runs with metadata (timestamp, models, cost, pass/fail) |
get_evaluation |
Retrieve full details of a specific run (per-question scores, responses, judge reasoning) |
compare_runs |
Compare two evaluation runs and detect regressions beyond configurable tolerance |
format_pr_comment |
Generate a markdown PR comment from evaluation results with regression details and threshold status |
What it measures
- Faithfulness (0-1) — Is the response grounded in the provided context?
- Relevance (0-1) — Does the response actually answer the question?
- Time to First Token — Streaming latency in milliseconds
- Cost per Query — Estimated cost based on token usage and provider pricing
Quick Start
1. Install
pip install mcp-llm-eval
Then install the provider SDKs you need (they are not bundled):
# Pick what you use
pip install anthropic # for Claude models
pip install openai # for GPT models + judge
pip install google-genai # for Gemini models
2. Configure Claude Desktop
Add this to your Claude Desktop MCP configuration file:
| OS | Path |
|---|---|
| macOS | ~/Library/Application Support/Claude/claude_desktop_config.json |
| Windows | %APPDATA%\Claude\claude_desktop_config.json |
Recommended — with uvx (no install required):
{
"mcpServers": {
"llm-eval": {
"command": "uvx",
"args": ["mcp-llm-eval"],
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-...",
"GOOGLE_API_KEY": "AIza..."
}
}
}
}
Note: Only include API keys for the providers you plan to evaluate. For example, if you only use Anthropic and OpenAI (for the judge), omit
GOOGLE_API_KEY.
Note: Claude Desktop may not inherit your terminal's
$PATH. If the server fails to connect, use the absolute path touvx(find it withwhich uvx):{ "mcpServers": { "llm-eval": { "command": "/full/path/to/uvx", "args": ["mcp-llm-eval"], "env": { "ANTHROPIC_API_KEY": "sk-ant-...", "OPENAI_API_KEY": "sk-..." } } } }
Alternative — installed via pip:
{
"mcpServers": {
"llm-eval": {
"command": "mcp-llm-eval",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-...",
"GOOGLE_API_KEY": "AIza..."
}
}
}
}
Alternative — from source (virtualenv):
{
"mcpServers": {
"llm-eval": {
"command": "/absolute/path/to/mcp-llm-eval/.venv/bin/python",
"args": ["-m", "mcp_llm_eval.server"],
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-...",
"GOOGLE_API_KEY": "AIza..."
}
}
}
}
3. Restart Claude Desktop
Fully quit (Cmd+Q on macOS) and reopen. Look for the tools icon to confirm the server is connected.
4. Ask a question
"Run the eval dataset at
/path/to/dataset.jsonagainst Claude Sonnet and GPT-4o, then check if faithfulness is above 0.8."
Example interaction
Claude autonomously chains the tools — running the evaluation, then checking thresholds:
Running evaluation...
- Dataset: 10 questions (4 factual, 3 reasoning, 3 summarization)
- Models: claude-sonnet-4-6, gpt-4o-mini
- Judge: gpt-4o-mini
Results:
claude-sonnet-4-6: avg faithfulness=0.92, relevance=0.88, TTFT=340ms, cost=$0.0045/q
gpt-4o-mini: avg faithfulness=0.85, relevance=0.82, TTFT=180ms, cost=$0.0003/q
Threshold check:
avg_faithfulness >= 0.80: PASS (actual: 0.885)
avg_relevance >= 0.75: PASS (actual: 0.850)
p95_ttft_ms <= 500: PASS (actual: 420ms)
max_cost_per_query <= 0.01: PASS (actual: $0.0045)
Overall: PASS
Configuration
Create an .eval-gate.yml in your project root for repeatable threshold configs:
thresholds:
avg_faithfulness: 0.80
avg_relevance: 0.75
p95_ttft_ms: 500
max_cost_per_query: 0.01
models:
- provider: anthropic
model: claude-sonnet-4-6
input_cost_per_mtok: 3.0
output_cost_per_mtok: 15.0
- provider: openai
model: gpt-4o-mini
input_cost_per_mtok: 0.15
output_cost_per_mtok: 0.60
judge:
provider: openai
model: gpt-4o-mini
temperature: 0
Dataset schema
The evaluation dataset is a JSON array of entries:
[
{
"id": "unique-id",
"category": "factual",
"context": "The system prompt / context provided to the model",
"question": "The question asked",
"expected_response": "Reference answer for the judge to compare against",
"tags": ["optional", "tags"]
}
]
Required fields: id, category, context, question, expected_response. The tags field is optional.
Usage modes
MCP agent
Connect to Claude Desktop or any MCP-compatible agent. The agent calls tools directly — run evals, check thresholds, browse past runs, compare runs, and generate PR comments.
CLI
The same mcp-llm-eval binary doubles as a CLI for CI/CD pipelines:
# Run a full evaluation
mcp-llm-eval run --config .eval-gate.yml --dataset eval/dataset.json --output-dir eval/results
# Check thresholds (exit code 1 on failure — blocks PRs)
mcp-llm-eval check --results eval/results/latest_summary.json --config .eval-gate.yml
# Compare against baseline (exit code 1 on regression)
mcp-llm-eval compare --baseline eval/results/main_summary.json --current eval/results/pr_summary.json
# Generate PR comment markdown
mcp-llm-eval comment --summary eval/results/latest_summary.json --config .eval-gate.yml --output pr-comment.md
GitHub Actions
name: LLM Eval Gate
on:
pull_request:
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install mcp-llm-eval anthropic openai
- run: mcp-llm-eval run --config .eval-gate.yml --dataset eval/dataset.json --output-dir eval/results
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- run: mcp-llm-eval check --results eval/results/latest_summary.json --config .eval-gate.yml
- run: |
mcp-llm-eval comment --summary eval/results/latest_summary.json --config .eval-gate.yml --output pr-comment.md
gh pr comment ${{ github.event.number }} --body-file pr-comment.md
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Troubleshooting
Server not appearing in Claude Desktop
- Ensure Claude Desktop is fully restarted (quit with
Cmd+Q, not just close the window). - Check your config JSON is valid — a trailing comma or typo will silently break it.
- Use absolute paths if
uvxormcp-llm-evalaren't found.
"Provider SDK not installed" errors
Provider SDKs are optional. Install the ones you need:
pip install anthropic openai google-genai
"Dataset file not found" errors
Use the full absolute path to your dataset file, not a relative path.
Judge scoring fails
The default judge uses OpenAI's gpt-4o-mini. Make sure the openai package is installed and OPENAI_API_KEY is set in your environment.
This is Claude Desktop only
MCP servers work with the Claude Desktop app, not claude.ai in your browser.
Development
# Clone and set up
git clone https://github.com/berkayildi/mcp-llm-eval.git
cd mcp-llm-eval
make setup
# Run tests
make test
# Build distribution
make build
# Run the server locally (stdio)
make start
# Clean everything
make clean
License
MIT © Berkay Yildirim
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_llm_eval-0.3.0.tar.gz.
File metadata
- Download URL: mcp_llm_eval-0.3.0.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a77e203488aba768748af07e57d0e95fac65bfa5e2b153af8cb90e59cfc3e93f
|
|
| MD5 |
11ed09fcaed2f8054a424960070ce536
|
|
| BLAKE2b-256 |
8e387848fc617e7cbbd31030bec3b81d7339fdbd49bc9986f7d7f02b491d7202
|
Provenance
The following attestation bundles were made for mcp_llm_eval-0.3.0.tar.gz:
Publisher:
release.yml on berkayildi/mcp-llm-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcp_llm_eval-0.3.0.tar.gz -
Subject digest:
a77e203488aba768748af07e57d0e95fac65bfa5e2b153af8cb90e59cfc3e93f - Sigstore transparency entry: 1322777347
- Sigstore integration time:
-
Permalink:
berkayildi/mcp-llm-eval@28e7f90e711e66e13921d7ae695de4203675b7a6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/berkayildi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@28e7f90e711e66e13921d7ae695de4203675b7a6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mcp_llm_eval-0.3.0-py3-none-any.whl.
File metadata
- Download URL: mcp_llm_eval-0.3.0-py3-none-any.whl
- Upload date:
- Size: 26.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4431429ec86d90dc1617446f6f5a95c40bd161a580b428b22d0745103d7f9c4f
|
|
| MD5 |
cce46df411651ffc00e0371c38fc12b1
|
|
| BLAKE2b-256 |
c2688e803ca7bf8201df94406ddb1a4a94cc3401668e66c66a00d45cae96e2ef
|
Provenance
The following attestation bundles were made for mcp_llm_eval-0.3.0-py3-none-any.whl:
Publisher:
release.yml on berkayildi/mcp-llm-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcp_llm_eval-0.3.0-py3-none-any.whl -
Subject digest:
4431429ec86d90dc1617446f6f5a95c40bd161a580b428b22d0745103d7f9c4f - Sigstore transparency entry: 1322777464
- Sigstore integration time:
-
Permalink:
berkayildi/mcp-llm-eval@28e7f90e711e66e13921d7ae695de4203675b7a6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/berkayildi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@28e7f90e711e66e13921d7ae695de4203675b7a6 -
Trigger Event:
push
-
Statement type: