Evaluation suite for MCP AI agents — golden datasets, LLM-as-judge, security benchmarks

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gashel01

Project description

evalmcp

Evaluation suite for MCP AI agents -- golden datasets, LLM-as-judge, security benchmarks

Part of the MCP AI Suite.

Features

Built-in benchmark suites -- 6 golden datasets: memory, security, reasoning, tool-use, HumanEval-style code, and MMLU-style knowledge
Pluggable judges -- exact match, contains, or LLM-as-judge for semantic evaluation
Library, CLI & MCP server -- use it from Python, the evalmcp CLI, the evalmcp-server MCP server (3 tools: list_suites, run_suite, evaluate), or the evalmcp-api FastAPI service
Regression detection -- compare successive runs to catch quality drops above a threshold
Standard metrics -- accuracy, precision, recall, F1, per-tag breakdowns
HTML dashboard export -- visual report of evaluation results
JSON and CSV export -- machine-readable result output
Model comparison -- side-by-side evaluation of different models or configurations
Persistent store -- track evaluation runs over time for trend analysis

Installation

pip install mcpaisuite-evalmcp          # runtime deps: click + mcp
pip install "mcpaisuite-evalmcp[api]"   # adds the FastAPI server (evalmcp-api)

Quick Start

from evalmcp import EvalPipeline, EvalSuite, EvalCase

suite = EvalSuite(name="my_tests", cases=[
    EvalCase(input="What is 2+2?", expected_output="4", tool="run_task", tags=["math"]),
    EvalCase(input="Capital of France?", expected_output="Paris", tool="run_task", tags=["geography"]),
])

pipeline = EvalPipeline(judge="contains")
results = await pipeline.run_suite(suite)
summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")

CLI

# List available benchmark suites:
evalmcp list

# Run a benchmark suite:
evalmcp run memory_basic --judge contains

# CI mode with regression detection:
evalmcp run security --judge exact --ci --threshold 0.1

# Export HTML dashboard:
evalmcp run reasoning --html report.html

Configuration

EvalPipeline is configured programmatically via constructor parameters.

Parameter	Description
`kernel_pipeline`	Optional MCP kernel pipeline for generating outputs
`judge`	Judge strategy: `"exact"`, `"contains"`, `"llm"`, or a `BaseJudge` instance
`llm_fn`	Async callable `(prompt) -> str`, required when `judge="llm"`
`store`	Optional `EvalStore` for persistence and regression detection

API Reference

EvalPipeline

Orchestrates evaluation of MCP agent outputs against expected results.

pipeline = EvalPipeline(judge="llm", llm_fn=my_llm, store=EvalStore())
results = await pipeline.run_suite(suite) -> list[EvalResult]
result = await pipeline.run_case(case) -> EvalResult
summary = pipeline.summary(results) -> dict
metrics = pipeline.metrics(results) -> dict
regression = pipeline.detect_regression(suite_name, threshold=0.1) -> dict

Utility Functions

from evalmcp import export_json, export_csv, compute_metrics, generate_dashboard

export_json(results, summary, "results.json")
export_csv(results, "results.csv")
metrics = compute_metrics(results)
generate_dashboard(results, summary, metrics, output_path="report.html")

Architecture

EvalPipeline iterates over EvalCases in a suite, optionally generating outputs via a kernel_pipeline, then scoring each result through a pluggable judge (ExactMatchJudge, ContainsJudge, or LLMJudge). Results are aggregated into summary statistics with per-tag breakdowns. EvalStore persists run history to enable regression detection by comparing the two most recent runs.

Testing

pip install -e ".[dev]"
pytest tests/ -v

License

AGPL-3.0 — see LICENSE.

Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact gaeldev@gmail.com.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gashel01

Release history Release notifications | RSS feed

This version

1.0.3

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcpaisuite_evalmcp-1.0.3.tar.gz (40.1 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcpaisuite_evalmcp-1.0.3-py3-none-any.whl (44.7 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file mcpaisuite_evalmcp-1.0.3.tar.gz.

File metadata

Download URL: mcpaisuite_evalmcp-1.0.3.tar.gz
Upload date: Jun 16, 2026
Size: 40.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcpaisuite_evalmcp-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`dca68483f50c57b23a3466fb978e2191d15f84e55316998ca13cd090159258e7`
MD5	`3cd7b4014bc3d17870b78c0a8c765b8b`
BLAKE2b-256	`d65f2b888898bfb05c236867db42eea264c77143b4815c9669ccdd6e496ffe0c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_evalmcp-1.0.3.tar.gz:

Publisher: release.yml on gashel01/evalmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcpaisuite_evalmcp-1.0.3.tar.gz
- Subject digest: dca68483f50c57b23a3466fb978e2191d15f84e55316998ca13cd090159258e7
- Sigstore transparency entry: 1841152497
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: gashel01/evalmcp@032d46a90135269ab6db2f40bae52c333536b9b4
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/gashel01
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@032d46a90135269ab6db2f40bae52c333536b9b4
- Trigger Event: push

File details

Details for the file mcpaisuite_evalmcp-1.0.3-py3-none-any.whl.

File metadata

Download URL: mcpaisuite_evalmcp-1.0.3-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 44.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcpaisuite_evalmcp-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eda1ddc1992d09e97dcaeab03049f2098d125f0e9b2484e2cf6622a663899b7a`
MD5	`b38f0af0f6b4062e3ce0e23d4d0c1f1a`
BLAKE2b-256	`c41eeb4995781c3095bbec6016e8ad46a9eed0f72fd5cab44e75eb3971cf62e9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_evalmcp-1.0.3-py3-none-any.whl:

Publisher: release.yml on gashel01/evalmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcpaisuite_evalmcp-1.0.3-py3-none-any.whl
- Subject digest: eda1ddc1992d09e97dcaeab03049f2098d125f0e9b2484e2cf6622a663899b7a
- Sigstore transparency entry: 1841152513
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: gashel01/evalmcp@032d46a90135269ab6db2f40bae52c333536b9b4
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/gashel01
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@032d46a90135269ab6db2f40bae52c333536b9b4
- Trigger Event: push

mcpaisuite-evalmcp 1.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

evalmcp

Features

Installation

Quick Start

CLI

Configuration

API Reference

EvalPipeline

Utility Functions

Architecture

Testing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance