MCP server that packages LLM evaluation gates as reusable CI/CD primitives

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

mcp-llm-eval

A local Model Context Protocol (MCP) server that packages LLM evaluation gates as reusable CI/CD primitives. Run datasets against multiple models, score responses with an LLM-as-judge, and enforce quality thresholds — all through MCP tools that AI agents can call.

flowchart LR
    A[PR opened] --> B[Run dataset<br/>through models]
    B --> C[Judge scores<br/>faithfulness + relevance]
    C --> D{Thresholds met?}
    D -->|Yes| E[PR passes]
    D -->|No| F[PR blocked<br/>with diff comment]

Why?

There's no unit test for LLM quality. Teams ship prompt changes, swap models, or update system prompts with no automated way to verify that output quality didn't regress. Manual spot-checking doesn't scale, and existing eval frameworks are heavy, opinionated, and hard to wire into CI/CD.

mcp-llm-eval gives AI agents structured access to a lightweight eval pipeline. Instead of building custom scripts for every project, you define a dataset, point the agent at it, and get scored results with pass/fail gates — the same workflow whether you're testing locally or gating a deployment.

Features

Tool	Description
`run_evaluation`	Load a dataset, query models via streaming, score with LLM-as-judge, return per-question scores and aggregate summary
`check_thresholds`	Validate evaluation results against quality gates (faithfulness, relevance, TTFT, cost)
`list_evaluations`	List past evaluation runs with metadata (timestamp, models, cost, pass/fail)
`get_evaluation`	Retrieve full details of a specific run (per-question scores, responses, judge reasoning)
`compare_runs`	Compare two evaluation runs and detect regressions beyond configurable tolerance
`format_pr_comment`	Generate a markdown PR comment from evaluation results with regression details and threshold status

What it measures

Faithfulness (0-1) — Is the response grounded in the provided context?
Relevance (0-1) — Does the response actually answer the question?
Time to First Token — Streaming latency in milliseconds
Cost per Query — Estimated cost based on token usage and provider pricing

Quick Start

1. Install

pip install mcp-llm-eval

Then install the provider SDKs you need (they are not bundled):

# Pick what you use
pip install anthropic    # for Claude models
pip install openai       # for GPT models + judge
pip install google-genai # for Gemini models

2. Configure Claude Desktop

Add this to your Claude Desktop MCP configuration file:

OS	Path
macOS	`~/Library/Application Support/Claude/claude_desktop_config.json`
Windows	`%APPDATA%\Claude\claude_desktop_config.json`

Recommended — with uvx (no install required):

{
  "mcpServers": {
    "llm-eval": {
      "command": "uvx",
      "args": ["mcp-llm-eval"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY": "sk-...",
        "GOOGLE_API_KEY": "AIza..."
      }
    }
  }
}

Note: Only include API keys for the providers you plan to evaluate. For example, if you only use Anthropic and OpenAI (for the judge), omit GOOGLE_API_KEY.

Note: Claude Desktop may not inherit your terminal's $PATH. If the server fails to connect, use the absolute path to uvx (find it with which uvx):

{
  "mcpServers": {
    "llm-eval": {
      "command": "/full/path/to/uvx",
      "args": ["mcp-llm-eval"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Alternative — installed via pip:

{
  "mcpServers": {
    "llm-eval": {
      "command": "mcp-llm-eval",
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY": "sk-...",
        "GOOGLE_API_KEY": "AIza..."
      }
    }
  }
}

Alternative — from source (virtualenv):

{
  "mcpServers": {
    "llm-eval": {
      "command": "/absolute/path/to/mcp-llm-eval/.venv/bin/python",
      "args": ["-m", "mcp_llm_eval.server"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY": "sk-...",
        "GOOGLE_API_KEY": "AIza..."
      }
    }
  }
}

3. Restart Claude Desktop

Fully quit (Cmd+Q on macOS) and reopen. Look for the tools icon to confirm the server is connected.

4. Ask a question

"Run the eval dataset at /path/to/dataset.json against Claude Sonnet and GPT-4o, then check if faithfulness is above 0.8."

Example interaction

Claude autonomously chains the tools — running the evaluation, then checking thresholds (numbers below are illustrative):

Running evaluation...
- Dataset: 9 questions (3 factual, 3 reasoning, 3 summarization)
- Models: claude-sonnet-4-6, gpt-4o-mini
- Judge: gpt-4o-mini

Results:
  claude-sonnet-4-6: avg faithfulness=0.92, relevance=0.88, TTFT=340ms, cost=$0.0045/q
  gpt-4o-mini:           avg faithfulness=0.85, relevance=0.82, TTFT=180ms, cost=$0.0003/q

Threshold check:
  avg_faithfulness >= 0.80: PASS (actual: 0.885)
  avg_relevance >= 0.75:    PASS (actual: 0.850)
  p95_ttft_ms <= 500:       PASS (actual: 420ms)
  max_cost_per_query <= 0.01: PASS (actual: $0.0045)

Overall: PASS

Configuration

Create an .eval-gate.yml in your project root for repeatable threshold configs:

thresholds:
  avg_faithfulness: 0.80
  avg_relevance: 0.75
  p95_ttft_ms: 500
  max_cost_per_query: 0.01

models:
  - provider: anthropic
    model: claude-sonnet-4-6
    input_cost_per_mtok: 3.0
    output_cost_per_mtok: 15.0
  - provider: openai
    model: gpt-4o-mini
    input_cost_per_mtok: 0.15
    output_cost_per_mtok: 0.60

judge:
  provider: openai
  model: gpt-4o-mini
  temperature: 0

Dataset schema

The evaluation dataset is a JSON array of entries:

[
  {
    "id": "unique-id",
    "category": "factual",
    "context": "The system prompt / context provided to the model",
    "question": "The question asked",
    "expected_response": "Reference answer for the judge to compare against",
    "tags": ["optional", "tags"]
  }
]

Required fields: id, category, context, question, expected_response. The tags field is optional.

Usage modes

MCP agent

Connect to Claude Desktop or any MCP-compatible agent. The agent calls tools directly — run evals, check thresholds, browse past runs, compare runs, and generate PR comments.

CLI

The same mcp-llm-eval binary doubles as a CLI for CI/CD pipelines:

# Run a full evaluation
mcp-llm-eval run --config .eval-gate.yml --dataset eval/dataset.json --output-dir eval/results

# Check thresholds (exit code 1 on failure — blocks PRs)
mcp-llm-eval check --results eval/results/latest_summary.json --config .eval-gate.yml

# Compare against baseline (exit code 1 on regression)
mcp-llm-eval compare --baseline eval/results/main_summary.json --current eval/results/pr_summary.json

# Generate PR comment markdown
mcp-llm-eval comment --summary eval/results/latest_summary.json --config .eval-gate.yml --output pr-comment.md

GitHub Actions

name: LLM Eval Gate

on:
  pull_request:

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install mcp-llm-eval anthropic openai google-genai
      - run: mcp-llm-eval run --config .eval-gate.yml --dataset eval/dataset.json --output-dir eval/results
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - run: mcp-llm-eval check --results eval/results/latest_summary.json --config .eval-gate.yml
      - run: |
          mcp-llm-eval comment --summary eval/results/latest_summary.json --config .eval-gate.yml --output pr-comment.md
          gh pr comment ${{ github.event.number }} --body-file pr-comment.md
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Running benchmarks locally

mcp-llm-eval's own dataset (eval/dataset.json) dogfoods the evaluation engine across 5 models, 9 questions, 3 categories (factual, reasoning, summarization). The results feed into LLMShot as the Eval Gates benchmark.

Create a .env file in the project root with API keys for all providers:

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AIza...

Then run:

make benchmark        # Run eval against all 5 models
make benchmark-copy   # Copy results to llm-benchmarks repo

Results are written to eval/results/ (gitignored). The benchmark output feeds into LLMShot via the llm-benchmarks repo at text-generation/eval-gates-summary.json and text-generation/eval-gates-benchmark.json.

Troubleshooting

Server not appearing in Claude Desktop

Ensure Claude Desktop is fully restarted (quit with Cmd+Q, not just close the window).
Check your config JSON is valid — a trailing comma or typo will silently break it.
Use absolute paths if uvx or mcp-llm-eval aren't found.

"Provider SDK not installed" errors

Provider SDKs are optional. Install the ones you need:

pip install anthropic openai google-genai

"Dataset file not found" errors

Use the full absolute path to your dataset file, not a relative path.

Judge scoring fails

The default judge uses OpenAI's gpt-4o-mini. Make sure the openai package is installed and OPENAI_API_KEY is set in your environment.

This is Claude Desktop only

MCP servers work with the Claude Desktop app, not claude.ai in your browser.

Development

# Clone and set up
git clone https://github.com/berkayildi/mcp-llm-eval.git
cd mcp-llm-eval
make setup

# Run tests
make test

# Build distribution
make build

# Run the server locally (stdio)
make start

# Clean everything
make clean

License

MIT © Berkay Yildirim

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

berkayildi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.9.1

Apr 29, 2026

0.9.0

Apr 28, 2026

0.8.0

Apr 28, 2026

0.7.1

Apr 26, 2026

0.7.0

Apr 26, 2026

0.6.0

Apr 26, 2026

0.5.1

Apr 25, 2026

0.5.0

Apr 25, 2026

This version

0.4.1

Apr 19, 2026

0.4.0

Apr 19, 2026

0.3.0

Apr 16, 2026

0.2.0

Apr 12, 2026

0.1.0

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_llm_eval-0.4.1.tar.gz (45.5 kB view details)

Uploaded Apr 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp_llm_eval-0.4.1-py3-none-any.whl (27.2 kB view details)

Uploaded Apr 19, 2026 Python 3

File details

Details for the file mcp_llm_eval-0.4.1.tar.gz.

File metadata

Download URL: mcp_llm_eval-0.4.1.tar.gz
Upload date: Apr 19, 2026
Size: 45.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcp_llm_eval-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`051e577976fddefe24ec96bb371ba27fe7055e330a90d650b2c7d1b5bfeca439`
MD5	`3dbe9f5e16b3420172611458d8d96eaf`
BLAKE2b-256	`1765c962e48b38bda67f6cbcbac9ca4e0c0faf3f75d5516f1684f2f271248605`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_llm_eval-0.4.1.tar.gz:

Publisher: release.yml on berkayildi/mcp-llm-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcp_llm_eval-0.4.1.tar.gz
- Subject digest: 051e577976fddefe24ec96bb371ba27fe7055e330a90d650b2c7d1b5bfeca439
- Sigstore transparency entry: 1340741899
- Sigstore integration time: Apr 19, 2026
Source repository:
- Permalink: berkayildi/mcp-llm-eval@fb86ef8a335dd41418420e8e7e4f1a993aac43b8
- Branch / Tag: refs/heads/main
- Owner: https://github.com/berkayildi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fb86ef8a335dd41418420e8e7e4f1a993aac43b8
- Trigger Event: push

File details

Details for the file mcp_llm_eval-0.4.1-py3-none-any.whl.

File metadata

Download URL: mcp_llm_eval-0.4.1-py3-none-any.whl
Upload date: Apr 19, 2026
Size: 27.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcp_llm_eval-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92e5b5fafb7d729ccee4414fa695370dec776e200c55609236d8f6112806234f`
MD5	`0c16be2b89ab67bb7752239d2c8d5fd4`
BLAKE2b-256	`1c17cc6a817c17542140597dfa4368c1e6df662dc3d86ac5a16afd624210f2a7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_llm_eval-0.4.1-py3-none-any.whl:

Publisher: release.yml on berkayildi/mcp-llm-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcp_llm_eval-0.4.1-py3-none-any.whl
- Subject digest: 92e5b5fafb7d729ccee4414fa695370dec776e200c55609236d8f6112806234f
- Sigstore transparency entry: 1340741976
- Sigstore integration time: Apr 19, 2026
Source repository:
- Permalink: berkayildi/mcp-llm-eval@fb86ef8a335dd41418420e8e7e4f1a993aac43b8
- Branch / Tag: refs/heads/main
- Owner: https://github.com/berkayildi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fb86ef8a335dd41418420e8e7e4f1a993aac43b8
- Trigger Event: push

mcp-llm-eval 0.4.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

mcp-llm-eval

Why?

Features

What it measures

Quick Start

1. Install

2. Configure Claude Desktop

3. Restart Claude Desktop

4. Ask a question

Example interaction

Configuration

Dataset schema

Usage modes

MCP agent

CLI

GitHub Actions

Running benchmarks locally

Troubleshooting

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance