Evaluation suite for MCP AI agents — golden datasets, LLM-as-judge, security benchmarks
Project description
evalmcp
Evaluation suite for MCP AI agents -- golden datasets, LLM-as-judge, security benchmarks
Part of the MCP AI Suite.
Features
- Built-in benchmark suites -- 6 golden datasets: memory, security, reasoning, tool-use, HumanEval-style code, and MMLU-style knowledge
- Pluggable judges -- exact match, contains, or LLM-as-judge for semantic evaluation
- Library, CLI & MCP server -- use it from Python, the
evalmcpCLI, theevalmcp-serverMCP server (3 tools:list_suites,run_suite,evaluate), or theevalmcp-apiFastAPI service - Regression detection -- compare successive runs to catch quality drops above a threshold
- Standard metrics -- accuracy, precision, recall, F1, per-tag breakdowns
- HTML dashboard export -- visual report of evaluation results
- JSON and CSV export -- machine-readable result output
- Model comparison -- side-by-side evaluation of different models or configurations
- Persistent store -- track evaluation runs over time for trend analysis
Installation
pip install mcpaisuite-evalmcp # runtime deps: click + mcp
pip install "mcpaisuite-evalmcp[api]" # adds the FastAPI server (evalmcp-api)
Quick Start
from evalmcp import EvalPipeline, EvalSuite, EvalCase
suite = EvalSuite(name="my_tests", cases=[
EvalCase(input="What is 2+2?", expected_output="4", tool="run_task", tags=["math"]),
EvalCase(input="Capital of France?", expected_output="Paris", tool="run_task", tags=["geography"]),
])
pipeline = EvalPipeline(judge="contains")
results = await pipeline.run_suite(suite)
summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")
CLI
# List available benchmark suites:
evalmcp list
# Run a benchmark suite:
evalmcp run memory_basic --judge contains
# CI mode with regression detection:
evalmcp run security --judge exact --ci --threshold 0.1
# Export HTML dashboard:
evalmcp run reasoning --html report.html
Configuration
EvalPipeline is configured programmatically via constructor parameters.
| Parameter | Description |
|---|---|
kernel_pipeline |
Optional MCP kernel pipeline for generating outputs |
judge |
Judge strategy: "exact", "contains", "llm", or a BaseJudge instance |
llm_fn |
Async callable (prompt) -> str, required when judge="llm" |
store |
Optional EvalStore for persistence and regression detection |
API Reference
EvalPipeline
Orchestrates evaluation of MCP agent outputs against expected results.
pipeline = EvalPipeline(judge="llm", llm_fn=my_llm, store=EvalStore())
results = await pipeline.run_suite(suite) -> list[EvalResult]
result = await pipeline.run_case(case) -> EvalResult
summary = pipeline.summary(results) -> dict
metrics = pipeline.metrics(results) -> dict
regression = pipeline.detect_regression(suite_name, threshold=0.1) -> dict
Utility Functions
from evalmcp import export_json, export_csv, compute_metrics, generate_dashboard
export_json(results, summary, "results.json")
export_csv(results, "results.csv")
metrics = compute_metrics(results)
generate_dashboard(results, summary, metrics, output_path="report.html")
Architecture
EvalPipeline iterates over EvalCases in a suite, optionally generating outputs via a kernel_pipeline, then scoring each result through a pluggable judge (ExactMatchJudge, ContainsJudge, or LLMJudge). Results are aggregated into summary statistics with per-tag breakdowns. EvalStore persists run history to enable regression detection by comparing the two most recent runs.
Testing
pip install -e ".[dev]"
pytest tests/ -v
License
AGPL-3.0 — see LICENSE.
Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact gaeldev@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcpaisuite_evalmcp-1.0.3.tar.gz.
File metadata
- Download URL: mcpaisuite_evalmcp-1.0.3.tar.gz
- Upload date:
- Size: 40.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dca68483f50c57b23a3466fb978e2191d15f84e55316998ca13cd090159258e7
|
|
| MD5 |
3cd7b4014bc3d17870b78c0a8c765b8b
|
|
| BLAKE2b-256 |
d65f2b888898bfb05c236867db42eea264c77143b4815c9669ccdd6e496ffe0c
|
Provenance
The following attestation bundles were made for mcpaisuite_evalmcp-1.0.3.tar.gz:
Publisher:
release.yml on gashel01/evalmcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcpaisuite_evalmcp-1.0.3.tar.gz -
Subject digest:
dca68483f50c57b23a3466fb978e2191d15f84e55316998ca13cd090159258e7 - Sigstore transparency entry: 1841152497
- Sigstore integration time:
-
Permalink:
gashel01/evalmcp@032d46a90135269ab6db2f40bae52c333536b9b4 -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/gashel01
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@032d46a90135269ab6db2f40bae52c333536b9b4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mcpaisuite_evalmcp-1.0.3-py3-none-any.whl.
File metadata
- Download URL: mcpaisuite_evalmcp-1.0.3-py3-none-any.whl
- Upload date:
- Size: 44.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eda1ddc1992d09e97dcaeab03049f2098d125f0e9b2484e2cf6622a663899b7a
|
|
| MD5 |
b38f0af0f6b4062e3ce0e23d4d0c1f1a
|
|
| BLAKE2b-256 |
c41eeb4995781c3095bbec6016e8ad46a9eed0f72fd5cab44e75eb3971cf62e9
|
Provenance
The following attestation bundles were made for mcpaisuite_evalmcp-1.0.3-py3-none-any.whl:
Publisher:
release.yml on gashel01/evalmcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcpaisuite_evalmcp-1.0.3-py3-none-any.whl -
Subject digest:
eda1ddc1992d09e97dcaeab03049f2098d125f0e9b2484e2cf6622a663899b7a - Sigstore transparency entry: 1841152513
- Sigstore integration time:
-
Permalink:
gashel01/evalmcp@032d46a90135269ab6db2f40bae52c333536b9b4 -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/gashel01
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@032d46a90135269ab6db2f40bae52c333536b9b4 -
Trigger Event:
push
-
Statement type: