pytest for MCP agents — a testing framework for Model Context Protocol agents
Project description
mcptest
pytest for MCP agents. A testing framework for Model Context Protocol (MCP) agents that lets you mock MCP servers with YAML fixtures, run agents against them in isolation, and assert against the resulting tool-call trajectories.
pip install mcptest
mcptest init
mcptest run
Why
Building an agent that talks to real MCP servers means:
- Cost — every test run spends tokens and may hit paid APIs.
- Flakiness — external services go down, rate-limit, or return non-deterministic data.
- Slow feedback — end-to-end runs take minutes, not milliseconds.
- No regression safety — change a prompt or swap a model and you have no way to know if the agent's behavior changed until something breaks in production.
mcptest gives MCP agents what pytest gave Python code: fast, hermetic, asserted tests
and a regression safety net.
Core features
- Mock MCP servers from YAML — declare tools, responses, and error scenarios in a fixture file. No code required for the common case.
- Full MCP protocol — mocks speak real MCP over stdio (and SSE), so your agent connects to them the same way it connects to production servers.
- Trajectory assertions — assert which tools were called, in what order, with what parameters, how many times, and how quickly.
- Error injection — trigger named error scenarios to test your agent's recovery paths.
- Metric-gated assertions & scorecards — use any quality metric as a YAML assertion
gate (
metric_above,metric_below), compose assertions with boolean combinators (all_of,any_of,none_of,weighted_score), and generate a weighted quality report card withmcptest scorecardfor model comparison and prompt tuning. - Regression diffing — snapshot an agent's trajectory and detect drift when prompts, models, or MCP servers change.
- Watch mode —
mcptest watchmonitors your test files and fixtures, automatically re-running only the affected tests when anything changes. Smart dependency tracking means only the tests that reference a changed fixture are re-run. - pytest integration — use YAML files or write Python tests with fixtures.
- CI/CD ready — GitHub Action + PR comment bot for regression gating.
- Inline docs —
mcptest explain <name>shows Rich-formatted terminal docs for any assertion, metric, or check.mcptest docs buildgenerates a full MkDocs site with auto-generated reference pages that stay in sync with code automatically.
Capture — tests write themselves
The fastest way to get started is mcptest capture. Point it at any MCP server
and it auto-discovers tools, samples responses, and writes both fixture YAML and
test-spec YAML — no hand-writing required.
# 1. Install
pip install mcptest
# 2. Capture a live server → auto-generate fixture + tests
mcptest capture "python my_server.py" --output fixtures/ --generate-tests
# Generated files:
# fixtures/my-server.yaml ← fixture with real responses
# fixtures/my-server-tests.yaml ← ready-to-run test suite
# 3. Run the generated tests
mcptest run fixtures/my-server-tests.yaml
# 4. Watch mode — auto-run on save
mcptest watch --watch-extra src/
Options:
| Flag | Default | Description |
|---|---|---|
--output / -o |
. |
Directory where files are written |
--generate-tests |
off | Also write a test-spec YAML |
--samples-per-tool |
3 |
Argument variations tried per tool |
--dry-run |
off | Preview without writing files |
--agent |
python agent.py |
Agent command embedded in test suites |
Quickstart (manual)
# 1. Install
pip install mcptest
# 2. Scaffold a project
mcptest init
# 3. Edit fixtures/example.yaml and tests/example.yaml
# 4. Run
mcptest run
# 5. Watch mode — auto-run on save
mcptest watch --watch-extra src/
Example fixture
# fixtures/github.yaml
server:
name: mock-github
tools:
- name: create_issue
description: Create a GitHub issue
input_schema:
type: object
properties:
repo: { type: string }
title: { type: string }
required: [repo, title]
responses:
- match: { repo: acme/api }
return:
issue_number: 42
url: https://github.com/acme/api/issues/42
- default: true
return:
issue_number: 1
errors:
- name: rate_limited
tool: create_issue
error_code: -32000
message: GitHub API rate limit exceeded
Example test
# tests/issue_triage.yaml
name: Issue triage agent
fixtures:
- fixtures/github.yaml
agent:
command: python examples/issue_agent.py
cases:
- name: Creates issue for bug report
input: "File a bug: login page 500 error on Safari"
assertions:
- tool_called: create_issue
- param_matches:
tool: create_issue
param: title
contains: "500"
- max_tool_calls: 3
Metric-gated assertions
Use any of the 7 built-in quality metrics directly as YAML assertion gates:
assertions:
# Agent must be efficient (≥80% unique tool usage)
- metric_above: {metric: tool_efficiency, threshold: 0.8}
# Agent must not be repetitive (non-redundancy score ≥0.9)
- metric_above: {metric: redundancy, threshold: 0.9}
# Gate on a weighted composite quality score
- weighted_score:
threshold: 0.75
weights:
tool_efficiency: 0.3
redundancy: 0.2
error_recovery_rate: 0.5
Boolean combinators for complex logic:
assertions:
- all_of:
- tool_called: create_issue
- max_tool_calls: 5
- any_of:
- tool_called: create_issue
- output_contains: created
- none_of:
- tool_called: delete_all
- output_contains: ERROR
Agent scorecard
Generate a weighted quality report card from any saved trace:
# Render a human-readable table (exit 1 if composite score < 0.75)
mcptest scorecard trace.json
# Override the threshold
mcptest scorecard trace.json --fail-under 0.8
# Custom weights from a YAML config
mcptest scorecard trace.json --config scorecard.yaml
# Machine-readable JSON output (for CI pipelines)
mcptest scorecard trace.json --json
Example scorecard.yaml:
composite_threshold: 0.75
default_threshold: 0.7
thresholds:
tool_efficiency: 0.8
error_recovery_rate: 0.9
weights:
tool_efficiency: 2.0
redundancy: 1.0
error_recovery_rate: 3.0
Conformance testing
Verify that any MCP server implementation correctly implements the protocol. 19 checks across 5 sections, each tagged with RFC 2119 severity (MUST / SHOULD / MAY).
Quick start
# Test a server subprocess over stdio
mcptest conformance "python my_server.py"
# Test in-process using a fixture YAML (fast, no subprocess)
mcptest conformance --fixture fixtures/my_server.yaml
# Filter to a specific section
mcptest conformance --fixture fixtures/my_server.yaml --section initialization
# Only run MUST checks (CI gate — fail only on hard violations)
mcptest conformance --fixture fixtures/my_server.yaml --severity must
# Also fail on SHOULD violations
mcptest conformance --fixture fixtures/my_server.yaml --fail-on-should
# Machine-readable output for CI pipelines
mcptest conformance --fixture fixtures/my_server.yaml --json
Check catalogue
| ID | Section | Severity | Description |
|---|---|---|---|
| INIT-001 | initialization | MUST | Server provides non-empty name |
| INIT-002 | initialization | MUST | Server info includes version string |
| INIT-003 | initialization | MUST | Server reports capabilities object |
| INIT-004 | initialization | SHOULD | Capabilities includes tools when server has tools |
| TOOL-001 | tool_listing | MUST | list_tools() returns a list |
| TOOL-002 | tool_listing | MUST | Each tool has name and inputSchema fields |
| TOOL-003 | tool_listing | MUST | All tool names are unique |
| TOOL-004 | tool_listing | SHOULD | Each inputSchema has type: "object" at root |
| CALL-001 | tool_calling | MUST | Calling a valid tool with matching arguments returns result |
| CALL-002 | tool_calling | MUST | Result contains content list |
| CALL-003 | tool_calling | MUST | Successful result has isError absent or False |
| CALL-004 | tool_calling | MUST | Calling unknown tool name returns error |
| CALL-005 | tool_calling | SHOULD | Error response sets isError to True |
| ERR-001 | error_handling | MUST | Error result contains text content with message |
| ERR-002 | error_handling | SHOULD | Server handles empty arguments dict without crashing |
| ERR-003 | error_handling | SHOULD | Server handles None arguments without crashing |
| RES-001 | resources | MUST | list_resources() returns a list |
| RES-002 | resources | MUST | Each resource has uri and name fields |
| RES-003 | resources | MUST | Resource URIs are unique |
Resource checks (RES-*) are automatically skipped when the server has no resources capability.
CI integration
# .github/workflows/conformance.yml
- name: MCP conformance
run: mcptest conformance --fixture fixtures/server.yaml --severity must --json > conformance.json
Exit code is 1 when any MUST check fails (or any SHOULD check fails with --fail-on-should).
Programmatic usage
import anyio
from mcptest.conformance import ConformanceRunner, InProcessServer, Severity
from mcptest.fixtures.loader import load_fixture
from mcptest.mock_server.server import MockMCPServer
fixture = load_fixture("fixtures/my_server.yaml")
mock = MockMCPServer(fixture)
server = InProcessServer(mock=mock, fixture=fixture)
runner = ConformanceRunner(server=server, severities=[Severity.MUST])
results = anyio.run(runner.run)
must_failures = [r for r in results if not r.passed and not r.skipped]
Documentation
Full reference documentation is auto-generated from live registries so it never goes stale.
# Look up any assertion, metric, or check inline
mcptest explain tool_called
mcptest explain tool_efficiency
mcptest explain INIT-001
# List all available assertions, metrics, and checks
mcptest docs list
# Generate a full MkDocs documentation site
mcptest docs build --output ./site
cd site && mkdocs serve
The generated site includes:
- Getting Started — capture-first 5-minute quickstart
- Assertions Reference — all 19 assertions with YAML examples
- Metrics Reference — 7 quality metrics with score interpretation
- Conformance Checks Reference — 19 protocol checks with severity
- CLI Reference — every command with full option tables
Configuration
Place a mcptest.yaml file in your project root (or any parent directory) to
set defaults for all CLI flags. mcptest walks up from the current directory
to find it — the same discovery strategy git uses for .gitignore.
# mcptest.yaml
test_paths: ["tests/"]
fixture_paths: ["fixtures/"]
baseline_dir: .mcptest/baselines
retry: 3 # default retry count for every case
tolerance: 0.8 # default pass-rate tolerance (0.0–1.0)
parallel: 4 # default worker count (-j flag)
fail_fast: false # stop at first failure
fail_under: 0.0 # coverage gate (mcptest coverage --threshold)
# Per-metric thresholds used by mcptest scorecard
thresholds:
tool_efficiency: 0.7
redundancy: 0.3
# Plugins to load at startup (dotted module name or file path)
plugins:
- my_company.mcptest_extensions
- ./custom_assertions.py
# Cloud settings
cloud:
url: https://mcptest.example.com
api_key_env: MCPTEST_API_KEY
CLI flags always override config-file values. To inspect the resolved configuration and loaded plugins:
mcptest config
Benchmarking
mcptest bench runs the same test suite against multiple agent profiles and
produces a side-by-side quality comparison. Use it to quantify which model or
prompt strategy performs best before migrating, or to add a regression gate to
CI.
Define agent profiles
Create an agents.yaml file listing the agents to compare:
# agents.yaml
agents:
- name: claude-sonnet
command: python agents/claude_agent.py
env:
MODEL: claude-3-5-sonnet-20241022
description: Anthropic Claude Sonnet 3.5
- name: gpt-4o
command: python agents/openai_agent.py
env:
MODEL: gpt-4o
description: OpenAI GPT-4o
Or embed profiles directly in mcptest.yaml so no extra file is needed:
# mcptest.yaml
agents:
- name: claude-sonnet
command: python agents/claude_agent.py
env: { MODEL: claude-3-5-sonnet-20241022 }
- name: gpt-4o
command: python agents/openai_agent.py
env: { MODEL: gpt-4o }
Run the benchmark
# Explicit profiles file
mcptest bench tests/ --agents agents.yaml
# Profiles from mcptest.yaml
mcptest bench tests/
# Machine-readable JSON (pipe to jq, store in CI artifacts, etc.)
mcptest bench tests/ --agents agents.yaml --json | jq .best_agent
The output includes three Rich tables:
- Leaderboard — agents ranked by composite score with pass rate, duration,
and a
BESTbadge for the winner. - Metric Comparison — pivot table of per-agent average scores for each quality metric, colour-coded green/yellow/red.
- Per-Test Breakdown — pass/fail grid across agents and test cases, highlighting where agents diverge.
CI integration
# Exit 1 if the best agent's composite score is below 0.75
mcptest bench tests/ --agents agents.yaml --ci --fail-under 0.75
Add to your pipeline:
# .github/workflows/bench.yml (example)
- name: Benchmark agents
run: mcptest bench tests/ --agents agents.yaml --ci --fail-under 0.75
Options
| Flag | Description |
|---|---|
--agents <file> |
Load profiles from this YAML file |
--json |
Emit JSON instead of Rich tables |
--ci |
Exit non-zero when best score < --fail-under |
--fail-under <float> |
CI composite-score threshold (default 0.0) |
--retry <n> |
Override retry count for every case |
--tolerance <float> |
Override pass-rate tolerance (0.0–1.0) |
-j/--parallel <n> |
Parallel workers per agent |
Semantic evaluation
mcptest eval scores agent text output against named criteria — no LLM API
calls required. Grading is deterministic: keyword coverage, regex patterns,
and text similarity (levenshtein / Jaccard / cosine). It is ideal for CI
pipelines where speed and cost predictability matter.
Define a rubric
Create a rubric YAML that describes what a good answer looks like:
# rubrics/booking.yaml
rubric:
name: booking-quality
criteria:
- name: correctness
weight: 0.5
method: keywords
expected: [confirmed, booking_id, receipt]
threshold: 0.6 # ≥60 % of keywords must be present
- name: format
weight: 0.3
method: pattern
expected: "Booking \\w+ confirmed"
threshold: 1.0 # regex must match
- name: completeness
weight: 0.2
method: similarity
expected: "Your booking ABC123 is confirmed. You will receive a receipt."
threshold: 0.7 # text similarity must be ≥ 0.7
Grading methods
| Method | Description |
|---|---|
keywords |
Fraction of expected keywords found in the text (sub-string) |
pattern |
Binary regex match anywhere in the text (1.0 or 0.0) |
similarity |
Best of levenshtein / Jaccard / cosine similarity against a reference |
contains |
Binary sub-string check (1.0 or 0.0) |
custom |
Reserved for plug-in graders (returns 0.0 by default) |
Run evaluations
# Grade every test case against a rubric file
mcptest eval tests/ --rubric rubrics/booking.yaml
# Machine-readable JSON
mcptest eval tests/ --rubric rubrics/booking.yaml --json
# CI gate: exit 1 if mean composite score < 0.75 or any criterion fails
mcptest eval tests/ --rubric rubrics/booking.yaml --ci --fail-under 0.75
Inline rubric in test spec
Embed an eval: section directly in a test case to avoid a separate file:
# tests/booking.yaml
name: Booking agent
fixtures: [fixtures/booking.yaml]
agent:
command: python agent.py
cases:
- name: confirm booking
input: "Book a table for 2 at 7pm"
assertions:
- tool_called: create_booking
eval:
name: booking-quality
criteria:
- name: correctness
method: keywords
expected: [confirmed, booking_id]
weight: 1.0
threshold: 0.5
Output
The Rich table shows per-criterion average scores, pass rates, and verdicts:
Evaluation Report — rubric: booking-quality
Criterion Avg Score Pass Rate Verdict
──────────────────────────────────────────────
correctness 0.833 100.0% PASS
format 1.000 100.0% PASS
completeness 0.721 80.0% PARTIAL
Overall: 4/5 passed (80.0%) — composite score 0.851
Options
| Flag | Description |
|---|---|
--rubric <file> |
Load rubric from this YAML file (overrides inline eval:) |
--json |
Emit JSON instead of Rich tables |
--ci |
Exit non-zero when any criterion fails or score < --fail-under |
--fail-under <float> |
CI composite-score threshold (default 0.0) |
--retry <n> |
Override retry count for every case |
--tolerance <float> |
Override pass-rate tolerance (0.0–1.0) |
-j/--parallel <n> |
Parallel workers |
Programmatic usage
from pathlib import Path
from mcptest.eval import Grader, load_rubric, aggregate_results
rubric = load_rubric(Path("rubrics/booking.yaml"))
grader = Grader(rubric)
texts = [
"Your booking ABC123 is confirmed. Receipt sent.",
"Booking confirmed.",
]
results = [grader.grade(t) for t in texts]
summary = aggregate_results(results)
print(summary.pass_rate, summary.mean_composite)
Plugins
Plugins let you add custom assertions, metrics, and exporters without forking mcptest. Any module that calls the registration decorators at import time is a valid plugin.
Load via config file (dotted module name or file path):
# mcptest.yaml
plugins:
- my_company.mcptest_extensions # installed package
- ./custom_assertions.py # local file
Load via confmcptest.py (auto-discovered, like pytest's conftest.py):
# tests/confmcptest.py
from mcptest.assertions.base import register_assertion, TraceAssertion
from mcptest.runner.trace import Trace
from mcptest.assertions.base import AssertionResult
@register_assertion
class response_is_json(TraceAssertion):
yaml_key = "response_is_json"
def check(self, trace: Trace) -> AssertionResult:
ok = all(
"json" in (call.result or "").lower()
for call in trace.tool_calls
)
return AssertionResult(
passed=ok,
name=self.yaml_key,
message="all tool responses are JSON" if ok else "non-JSON response found",
)
Load via entry points (for distributable packages):
# pyproject.toml of your plugin package
[project.entry-points."mcptest.assertions"]
my_assertions = "my_package.assertions"
[project.entry-points."mcptest.metrics"]
my_metrics = "my_package.metrics"
[project.entry-points."mcptest.exporters"]
my_exporter = "my_package.exporters"
Once installed, your assertions are available by yaml_key in any test YAML:
assertions:
- response_is_json: true
Cloud dashboard
mcptest ships a lightweight web UI for the cloud backend that runs with zero build
tooling — Tailwind CSS, htmx, and Chart.js are all loaded from CDN.
Launch
# Install cloud extras
pip install 'mcptest[cloud]'
# Start the dashboard (opens browser automatically)
mcptest dashboard
# Custom host / port / database
mcptest dashboard --host 0.0.0.0 --port 8200 --db ./prod.db --no-browser
The server starts at http://127.0.0.1:8100/dashboard/ by default.
Pages
| Page | URL | Description |
|---|---|---|
| Overview | /dashboard/ |
Stats cards (total runs, pass rate, avg duration, tool calls, baselines), recent-runs table, per-suite pass/fail bars |
| Runs | /dashboard/runs |
Filterable, paginated run list. Dropdowns for suite, branch, status, and environment update the table live via htmx without a full page reload. |
| Run detail | /dashboard/runs/{id} |
Full run info: metric scores (horizontal bar chart), collapsible tool-call timeline with arguments and results, input/output panels, promote-as-baseline button. |
| Trends | /dashboard/trends |
Chart.js line chart of any metric over time. Baseline runs are marked with stars. Controls for metric, suite, branch, and data limit update the chart instantly. |
| Baselines | /dashboard/baselines |
Active baseline table with one-click demote (htmx). Compare any two runs by ID and see a metric delta table with regression indicators. |
| Webhooks | /dashboard/webhooks |
Register and manage HTTP webhook endpoints. Create webhooks with event subscriptions, view delivery history, and send test pings. |
Configuration
| Flag | Default | Description |
|---|---|---|
--host |
127.0.0.1 |
Bind address |
--port |
8100 |
Listen port |
--db |
./mcptest_cloud.db |
SQLite database path (or set MCPTEST_DATABASE_URL) |
--no-browser |
off | Skip auto-opening the browser |
The dashboard is backed by the same FastAPI app as mcptest cloud-push; all API
endpoints remain available at their existing paths alongside the dashboard routes.
Webhook system
mcptest cloud can POST signed JSON notifications to external HTTP endpoints when
key events occur — completing the CI loop from cloud-push through auto-regression
check to team alert.
Events
| Event | Fires when |
|---|---|
run.created |
A new test run is pushed via POST /runs |
regression.detected |
POST /runs/{id}/check finds metric regressions vs baseline |
baseline.promoted |
A run is promoted as the suite baseline |
baseline.demoted |
A baseline is demoted/removed |
API
# Register a webhook
curl -X POST http://localhost:8100/webhooks \
-H "X-API-Key: $KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://hooks.slack.com/services/...",
"secret": "my-signing-secret",
"events": ["regression.detected", "baseline.promoted"],
"suite_filter": "smoke"
}'
# List webhooks
curl http://localhost:8100/webhooks -H "X-API-Key: $KEY"
# Send a test ping
curl -X POST http://localhost:8100/webhooks/{id}/test -H "X-API-Key: $KEY"
# View delivery history
curl http://localhost:8100/webhooks/{id}/deliveries -H "X-API-Key: $KEY"
Payload format
Every delivery POSTs the same envelope:
{
"event": "regression.detected",
"timestamp": "2026-01-15T12:00:00+00:00",
"data": {
"head_id": 42,
"base_id": 37,
"suite": "smoke",
"branch": "feat/new-prompt",
"regression_count": 2,
"deltas": [
{"name": "tool_efficiency", "base_score": 0.9, "head_score": 0.6, "delta": -0.3}
]
}
}
Signature verification
When a secret is set, each request includes an X-MCPTest-Signature: sha256=<hex>
header. Verify it in your receiver:
import hashlib, hmac
def verify(secret: str, body: bytes, header: str) -> bool:
expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
provided = header.removeprefix("sha256=")
return hmac.compare_digest(expected, provided)
Delivery retries up to 3 times with exponential back-off (1 s, 4 s, 16 s) on
connection errors or 5xx responses. Every attempt is logged in the
webhook_deliveries table and visible in the dashboard.
Status
Alpha. The core loop (mock server → runner → assertions → CLI) is functional; SSE transport and test packs are under active development.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_agent_test-0.1.0.tar.gz.
File metadata
- Download URL: mcp_agent_test-0.1.0.tar.gz
- Upload date:
- Size: 309.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f21bb5a7484dd7543b2257717e1121436f9651f894ccf72d32c046cb697c09b8
|
|
| MD5 |
3cb092c247b4c11f1324e6b7c86eb8da
|
|
| BLAKE2b-256 |
e5dd246ce6fcefb6a92335c55e24b72c4a5d633e2b5e2dc0a2b22225f38c974d
|
File details
Details for the file mcp_agent_test-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mcp_agent_test-0.1.0-py3-none-any.whl
- Upload date:
- Size: 224.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1030c23668842d0b464bf1020bd8e1df12b5bd6a8a2665e9cc7d4d54787231a6
|
|
| MD5 |
1ac5168f6fd87c399e409bf3284135ca
|
|
| BLAKE2b-256 |
c385669eade25831344f450582ca5eeb487a8c90a7660c0737b8b7ad4490c9b3
|