Skip to main content

Evaluation suite for MCP AI agents — golden datasets, LLM-as-judge, security benchmarks

Project description

evalmcp

Evaluation suite for MCP AI agents -- golden datasets, LLM-as-judge, security benchmarks

Part of the MCP AI Suite.

Features

  • Built-in benchmark suites -- 6 golden datasets: memory, security, reasoning, tool-use, HumanEval-style code, and MMLU-style knowledge
  • Pluggable judges -- exact match, contains, or LLM-as-judge for semantic evaluation
  • Library, CLI & MCP server -- use it from Python, the evalmcp CLI, the evalmcp-server MCP server (3 tools: list_suites, run_suite, evaluate), or the evalmcp-api FastAPI service
  • Regression detection -- compare successive runs to catch quality drops above a threshold
  • Standard metrics -- accuracy, precision, recall, F1, per-tag breakdowns
  • HTML dashboard export -- visual report of evaluation results
  • JSON and CSV export -- machine-readable result output
  • Model comparison -- side-by-side evaluation of different models or configurations
  • Persistent store -- track evaluation runs over time for trend analysis

Installation

pip install mcpaisuite-evalmcp          # runtime deps: click + mcp
pip install "mcpaisuite-evalmcp[api]"   # adds the FastAPI server (evalmcp-api)

Quick Start

from evalmcp import EvalPipeline, EvalSuite, EvalCase

suite = EvalSuite(name="my_tests", cases=[
    EvalCase(input="What is 2+2?", expected_output="4", tool="run_task", tags=["math"]),
    EvalCase(input="Capital of France?", expected_output="Paris", tool="run_task", tags=["geography"]),
])

pipeline = EvalPipeline(judge="contains")
results = await pipeline.run_suite(suite)
summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")

CLI

# List available benchmark suites:
evalmcp list

# Run a benchmark suite:
evalmcp run memory_basic --judge contains

# CI mode with regression detection:
evalmcp run security --judge exact --ci --threshold 0.1

# Export HTML dashboard:
evalmcp run reasoning --html report.html

Configuration

EvalPipeline is configured programmatically via constructor parameters.

Parameter Description
kernel_pipeline Optional MCP kernel pipeline for generating outputs
judge Judge strategy: "exact", "contains", "llm", or a BaseJudge instance
llm_fn Async callable (prompt) -> str, required when judge="llm"
store Optional EvalStore for persistence and regression detection

API Reference

EvalPipeline

Orchestrates evaluation of MCP agent outputs against expected results.

pipeline = EvalPipeline(judge="llm", llm_fn=my_llm, store=EvalStore())
results = await pipeline.run_suite(suite) -> list[EvalResult]
result = await pipeline.run_case(case) -> EvalResult
summary = pipeline.summary(results) -> dict
metrics = pipeline.metrics(results) -> dict
regression = pipeline.detect_regression(suite_name, threshold=0.1) -> dict

Utility Functions

from evalmcp import export_json, export_csv, compute_metrics, generate_dashboard

export_json(results, summary, "results.json")
export_csv(results, "results.csv")
metrics = compute_metrics(results)
generate_dashboard(results, summary, metrics, output_path="report.html")

Architecture

EvalPipeline iterates over EvalCases in a suite, optionally generating outputs via a kernel_pipeline, then scoring each result through a pluggable judge (ExactMatchJudge, ContainsJudge, or LLMJudge). Results are aggregated into summary statistics with per-tag breakdowns. EvalStore persists run history to enable regression detection by comparing the two most recent runs.

Testing

pip install -e ".[dev]"
pytest tests/ -v

License

AGPL-3.0 — see LICENSE.

Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact gaeldev@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcpaisuite_evalmcp-1.0.3.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcpaisuite_evalmcp-1.0.3-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file mcpaisuite_evalmcp-1.0.3.tar.gz.

File metadata

  • Download URL: mcpaisuite_evalmcp-1.0.3.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcpaisuite_evalmcp-1.0.3.tar.gz
Algorithm Hash digest
SHA256 dca68483f50c57b23a3466fb978e2191d15f84e55316998ca13cd090159258e7
MD5 3cd7b4014bc3d17870b78c0a8c765b8b
BLAKE2b-256 d65f2b888898bfb05c236867db42eea264c77143b4815c9669ccdd6e496ffe0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_evalmcp-1.0.3.tar.gz:

Publisher: release.yml on gashel01/evalmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mcpaisuite_evalmcp-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for mcpaisuite_evalmcp-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 eda1ddc1992d09e97dcaeab03049f2098d125f0e9b2484e2cf6622a663899b7a
MD5 b38f0af0f6b4062e3ce0e23d4d0c1f1a
BLAKE2b-256 c41eeb4995781c3095bbec6016e8ad46a9eed0f72fd5cab44e75eb3971cf62e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_evalmcp-1.0.3-py3-none-any.whl:

Publisher: release.yml on gashel01/evalmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page