Skip to main content

Python package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring

Project description

PyMCPEvals

โš ๏ธ Still Under Development - APIs may change. Use with caution in production.

Server-focused evaluation framework for MCP (Model Context Protocol) servers.

๐Ÿš€ Test your MCP server capabilities, not LLM conversation patterns.

"Are my MCP server's tools working correctly and being used as expected?"

PyMCPEvals separates what you can control (server) from what you cannot (LLM behavior):

โœ… What You Control (We Test This)

  • Tool implementation correctness
  • Tool parameter validation
  • Error handling and recovery
  • Tool result formatting
  • Multi-turn state management

โŒ What You Cannot Control (We Ignore This)

  • LLM conversation patterns
  • How LLMs choose to use tools
  • LLM response formatting
  • Whether LLMs provide intermediate responses

Key Pain Points Solved

  • ๐Ÿšซ Manual Tool Testing: Automated assertions verify exact tool calls
  • โ“ Multi-step Failures: Track tool chaining across conversation turns
  • ๐Ÿ› Silent Tool Errors: Instant feedback when expected tools aren't called
  • ๐Ÿ“Š CI/CD Integration: JUnit XML output for automated testing pipelines

Quick Start

pip install pymcpevals
pymcpevals init                    # Create template config
pymcpevals run evals.yaml         # Run evaluations

Example Configuration

model:
  provider: openai
  name: gpt-4

server:
  command: ["python", "my_server.py"]

evaluations:
  - name: "weather_check"
    prompt: "What's the weather in Boston?"
    expected_tools: ["get_weather"]  # โœ… Validates tool usage
    expected_result: "Should call weather API and return conditions"
    threshold: 3.5
    
  - name: "multi_step"
    turns:
      - role: "user"
        content: "What's the weather in London?"
        expected_tools: ["get_weather"]
      - role: "user"  
        content: "And in Paris?"
        expected_tools: ["get_weather"]
    expected_result: "Should provide weather for both cities"
    threshold: 4.0

Output: Pass/fail status, tool validation, execution metrics, and server-focused scoring.

How It Works

  1. Connect to your MCP server via FastMCP
  2. Execute prompts and track tool calls
  3. Validate expected tools are called (instant feedback)
  4. Evaluate server performance (ignores LLM style)
  5. Report results with actionable insights

What Makes This Different

Precise Tool Assertions: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:

  • โœ… Exact tool calls: assert_tools_called(result, ["add", "multiply"])
  • โœ… Tool execution success: assert_no_tool_errors(result)
  • โœ… Multi-turn trajectories: Test tool chaining across conversation steps
  • โœ… Instant failure detection: No expensive LLM evaluation for obvious failures

Usage

CLI

# Basic usage
pymcpevals run evals.yaml

# Override server/model
pymcpevals run evals.yaml --server "node server.js" --model gpt-4

# Different outputs
pymcpevals run evals.yaml --output table    # Simple table
pymcpevals run evals.yaml --output json     # Full JSON
pymcpevals run evals.yaml --output junit    # CI/CD format

Pytest Integration

from pymcpevals import (
    assert_tools_called, 
    assert_evaluation_passed,
    assert_min_score,
    assert_no_tool_errors,
    ConversationTurn
)

# Simple marker-based test
@pytest.mark.mcp_eval(
    prompt="What is 15 + 27?",
    expected_tools=["add"],
    min_score=4.0
)
async def test_basic_addition(mcp_result):
    assert_evaluation_passed(mcp_result)
    assert_tools_called(mcp_result, ["add"])
    assert "42" in mcp_result.server_response

# Multi-turn trajectory testing
async def test_math_sequence(mcp_evaluator):
    turns = [
        ConversationTurn(role="user", content="What is 10 + 5?", expected_tools=["add"]),
        ConversationTurn(role="user", content="Now multiply by 2", expected_tools=["multiply"])
    ]
    result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)
    
    # Rich assertions
    assert_evaluation_passed(result)
    assert_tools_called(result, ["add", "multiply"])
    assert_no_tool_errors(result)
    assert_min_score(result, 4.0, dimension="accuracy")
    assert "30" in str(result.conversation_history)

# Run with: pytest -m mcp_eval

Examples

Check out the examples/ directory for:

  • calculator_server.py - Simple MCP server for testing
  • local_server_basic.yaml - Basic evaluation configuration examples
  • trajectory_evaluation.yaml - Multi-turn conversation examples
  • test_simple_plugin_example.py - Pytest integration examples

Run the examples:

# Test with the example calculator server
pymcpevals run examples/local_server_basic.yaml

# Run pytest examples
cd examples && pytest test_simple_plugin_example.py

Installation

pip install pymcpevals

Environment Setup

export OPENAI_API_KEY="sk-..."        # or ANTHROPIC_API_KEY
export GEMINI_API_KEY="..."           # for Gemini models

Output Formats

Table View (default)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Name                                     โ”‚ Status โ”‚ Acc โ”‚ Comp โ”‚ Rel โ”‚ Clar โ”‚ Reas โ”‚ Avg  โ”‚ Tools โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ What is 15 + 27?                         โ”‚ PASS   โ”‚ 4.5 โ”‚ 4.2  โ”‚ 5.0 โ”‚ 4.8  โ”‚ 4.1  โ”‚ 4.52 โ”‚ โœ“     โ”‚
โ”‚ What happens if I divide 10 by 0?        โ”‚ PASS   โ”‚ 4.0 โ”‚ 4.1  โ”‚ 4.5 โ”‚ 4.2  โ”‚ 3.8  โ”‚ 4.12 โ”‚ โœ“     โ”‚
โ”‚ Multi-turn test                          โ”‚ PASS   โ”‚ 4.2 โ”‚ 4.5  โ”‚ 4.8 โ”‚ 4.1  โ”‚ 4.3  โ”‚ 4.38 โ”‚ โœ“     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Summary: 3/3 passed (100.0%) - Average: 4.34/5.0

Detailed View (--output detailed)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Test                    โ”‚ Status โ”‚ Scoreโ”‚ Expected Tools     โ”‚ Tools Used         โ”‚ Time   โ”‚ Errors โ”‚ Notes                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ What is 15 + 27?        โ”‚ PASS   โ”‚ 4.5  โ”‚ add                โ”‚ add                โ”‚ 12ms   โ”‚ 0      โ”‚ OK                           โ”‚
โ”‚ What happens if I div...โ”‚ PASS   โ”‚ 4.1  โ”‚ divide             โ”‚ divide             โ”‚ 8ms    โ”‚ 1      โ”‚ Handled error correctly      โ”‚
โ”‚ Multi-turn test         โ”‚ PASS   โ”‚ 4.4  โ”‚ add, multiply      โ”‚ add, multiply      โ”‚ 23ms   โ”‚ 0      โ”‚ Tool chaining successful     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ง Tool Execution Details:
โ€ข add: Called 2 times, avg 10ms, 100% success rate
โ€ข divide: Called 1 time, 8ms, handled error gracefully  
โ€ข multiply: Called 1 time, 13ms, 100% success rate

Summary: 3/3 passed (100.0%) - Average: 4.33/5.0

Key Benefits

For MCP Server Developers

  • ๐ŸŽฏ Server-Focused Testing: Test your server capabilities, not LLM behavior
  • โœ… Instant Tool Validation: Get immediate feedback if wrong tools are called (no LLM needed)
  • ๐Ÿ”ง Tool Execution Insights: See success rates, timing, and error handling
  • ๐Ÿ”„ Multi-turn Validation: Test tool chaining and state management
  • ๐Ÿ“Š Capability Scoring: LLM judges server tool performance, ignoring conversation style
  • ๐Ÿ› ๏ธ Easy Integration: Works with any MCP server via FastMCP

For Development Teams

  • ๐Ÿš€ CI/CD Integration: JUnit XML output for automated testing pipelines
  • ๐Ÿ“ˆ Progress Tracking: Monitor improvement over time with consistent scoring
  • ๐Ÿ”„ Regression Testing: Ensure new changes don't break existing functionality
  • โš–๏ธ Model Comparison: Test across different LLM providers

Acknowledgments

๐Ÿ™ Huge kudos to mcp-evals - This Python package was heavily inspired by the excellent Node.js implementation by @mclenhard.

If you're working in a Node.js environment, definitely check out the original mcp-evals project, which also includes GitHub Action integration and monitoring capabilities.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT - see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymcpevals-0.1.1.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymcpevals-0.1.1-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file pymcpevals-0.1.1.tar.gz.

File metadata

  • Download URL: pymcpevals-0.1.1.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pymcpevals-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b3dd06182a662730ab2a8e44eda25738a849ca3908037a2aa9e6b5ff2ea25b5f
MD5 690376fa3e71809bbbe61f1734a1b670
BLAKE2b-256 a48095f5eaca0f503ef97520c6098a2ff5c704bbc438385244be8ebaf44fe2e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymcpevals-0.1.1.tar.gz:

Publisher: publish.yml on akshay5995/pymcpevals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pymcpevals-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pymcpevals-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pymcpevals-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 00bdf4d9e9ac22e14139488aa75082458cb1113298c031fcd20f12adfdb5bf69
MD5 879881ab6d35d209dc1497662d0df03e
BLAKE2b-256 3fcb21018f14e635711831d24930209a0eb66e9e1d8ba99cd5b6991a6562a854

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymcpevals-0.1.1-py3-none-any.whl:

Publisher: publish.yml on akshay5995/pymcpevals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page