Skip to main content

Evaluate MCP server accuracy against known questions and answers

Project description

mcp-data-check

Evaluate MCP server accuracy against known questions and answers.

Installation

pip install mcp-data-check

Or install from source:

pip install -e .

Usage

Python API

Anthropic (default)

from mcp_data_check import run_evaluation

results = run_evaluation(
    questions_filepath="questions.csv",
    api_key="sk-ant-...",
    server_url="https://mcp.example.com/sse"
)

print(f"Pass rate: {results['summary']['pass_rate']:.1%}")
print(f"Passed: {results['summary']['passed']}/{results['summary']['total']}")

OpenAI

from mcp_data_check import run_evaluation

results = run_evaluation(
    questions_filepath="questions.csv",
    api_key="sk-...",
    server_url="https://mcp.example.com/sse",
    provider="openai",
    model="gpt-4o"
)

Command Line

Anthropic (default)

mcp-data-check https://mcp.example.com/sse -q questions.csv -k YOUR_API_KEY

OpenAI

mcp-data-check https://mcp.example.com/sse -q questions.csv -p openai -m gpt-4o -k YOUR_API_KEY

Options:

  • -q, --questions: Path to questions CSV file (required)
  • -p, --provider: LLM provider to use: anthropic (default) or openai
  • -k, --api-key: API key for the chosen provider (defaults to ANTHROPIC_API_KEY or OPENAI_API_KEY env var)
  • -o, --output: Output directory for results (default: ./results)
  • -m, --model: Model to use for evaluation (default: claude-sonnet-4-20250514; use e.g. gpt-4o for OpenAI)
  • -n, --server-name: Name for the MCP server (default: mcp-server)
  • -v, --verbose: Print detailed progress

Questions CSV Format

The questions CSV file must have three columns:

Column Description
question The question to ask the MCP server
expected_answer The expected answer to compare against
eval_type Evaluation method: numeric, string, or llm_judge

Example:

question,expected_answer,eval_type
How many grants were awarded in 2023?,1234,numeric
What organization received the most funding?,NIH,string
Explain the grant distribution,Most grants went to research institutions...,llm_judge

Evaluation Types

  • numeric: Extracts numbers from responses and compares with 5% tolerance
  • string: Checks if expected string appears in response (case-insensitive)
  • llm_judge: Uses the selected model to semantically evaluate if the response is correct

Return Value

The run_evaluation function returns a dictionary:

{
    "summary": {
        "total": 10,
        "passed": 8,
        "failed": 2,
        "pass_rate": 0.8,
        "by_eval_type": {
            "numeric": {"total": 5, "passed": 4},
            "string": {"total": 3, "passed": 3},
            "llm_judge": {"total": 2, "passed": 1}
        }
    },
    "results": [
        {
            "question": "...",
            "expected_answer": "...",
            "eval_type": "numeric",
            "model_response": "...",
            "passed": True,
            "details": {...},
            "error": None,
            "time_to_answer": 2.35,
            "tools_called": [
                {
                    "tool_name": "get_grants",
                    "server_name": "mcp-server",
                    "input": {"year": 2023}
                }
            ]
        },
        ...
    ],
    "metadata": {
        "server_url": "https://mcp.example.com/sse",
        "model": "claude-sonnet-4-20250514",
        "provider": "anthropic",
        "timestamp": "20250127_143022"
    }
}

Result Fields

Each result in the results array contains:

Field Description
question The original question asked
expected_answer The expected answer from the CSV
eval_type Evaluation method used
model_response The model's full response text
passed Whether the evaluation passed
details Additional evaluation details
error Error message if the evaluation failed
time_to_answer Response time in seconds for the MCP server call
tools_called List of MCP tools invoked during the response

The tools_called array contains objects with:

  • tool_name: Name of the MCP tool called
  • server_name: Name of the MCP server that provided the tool
  • input: Parameters passed to the tool

Requirements

  • Python 3.10+
  • API key for your chosen provider (Anthropic or OpenAI)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_data_check-0.5.0.tar.gz (70.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_data_check-0.5.0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file mcp_data_check-0.5.0.tar.gz.

File metadata

  • Download URL: mcp_data_check-0.5.0.tar.gz
  • Upload date:
  • Size: 70.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcp_data_check-0.5.0.tar.gz
Algorithm Hash digest
SHA256 5e65ee2c5df2a67a72bccfe1f98e673863d23463122e939e5e0235b740889475
MD5 78643592ae7ab82686a8ce914d0944a1
BLAKE2b-256 a4568702dd052e8dc43de84cc529cfadf9bc306d062ee8a391e891588211629f

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_data_check-0.5.0.tar.gz:

Publisher: publish.yml on GSA-TTS/mcp-data-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mcp_data_check-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: mcp_data_check-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcp_data_check-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0f550505a67d40318b30e3640aca7e4459407bfcfe54798fbefd03acdd4111c
MD5 2e31edff1a99ec1518a21b5bd816faf1
BLAKE2b-256 dcc416727c2f57e4d3722657f8d7a63069f395e5d2a1d04b800a79a01ecd13ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_data_check-0.5.0-py3-none-any.whl:

Publisher: publish.yml on GSA-TTS/mcp-data-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page