Skip to main content

pytest for MCP agents — a testing framework for Model Context Protocol agents

Project description

mcptest

pytest for MCP agents. A testing framework for Model Context Protocol (MCP) agents that lets you mock MCP servers with YAML fixtures, run agents against them in isolation, and assert against the resulting tool-call trajectories.

pip install mcptest
mcptest init
mcptest run

Why

Building an agent that talks to real MCP servers means:

  • Cost — every test run spends tokens and may hit paid APIs.
  • Flakiness — external services go down, rate-limit, or return non-deterministic data.
  • Slow feedback — end-to-end runs take minutes, not milliseconds.
  • No regression safety — change a prompt or swap a model and you have no way to know if the agent's behavior changed until something breaks in production.

mcptest gives MCP agents what pytest gave Python code: fast, hermetic, asserted tests and a regression safety net.

Core features

  • Mock MCP servers from YAML — declare tools, responses, and error scenarios in a fixture file. No code required for the common case.
  • Full MCP protocol — mocks speak real MCP over stdio (and SSE), so your agent connects to them the same way it connects to production servers.
  • Trajectory assertions — assert which tools were called, in what order, with what parameters, how many times, and how quickly.
  • Error injection — trigger named error scenarios to test your agent's recovery paths.
  • Metric-gated assertions & scorecards — use any quality metric as a YAML assertion gate (metric_above, metric_below), compose assertions with boolean combinators (all_of, any_of, none_of, weighted_score), and generate a weighted quality report card with mcptest scorecard for model comparison and prompt tuning.
  • Regression diffing — snapshot an agent's trajectory and detect drift when prompts, models, or MCP servers change.
  • Watch modemcptest watch monitors your test files and fixtures, automatically re-running only the affected tests when anything changes. Smart dependency tracking means only the tests that reference a changed fixture are re-run.
  • pytest integration — use YAML files or write Python tests with fixtures.
  • CI/CD ready — GitHub Action + PR comment bot for regression gating.
  • Inline docsmcptest explain <name> shows Rich-formatted terminal docs for any assertion, metric, or check. mcptest docs build generates a full MkDocs site with auto-generated reference pages that stay in sync with code automatically.

Capture — tests write themselves

The fastest way to get started is mcptest capture. Point it at any MCP server and it auto-discovers tools, samples responses, and writes both fixture YAML and test-spec YAML — no hand-writing required.

# 1. Install
pip install mcptest

# 2. Capture a live server → auto-generate fixture + tests
mcptest capture "python my_server.py" --output fixtures/ --generate-tests

# Generated files:
#   fixtures/my-server.yaml   ← fixture with real responses
#   fixtures/my-server-tests.yaml  ← ready-to-run test suite

# 3. Run the generated tests
mcptest run fixtures/my-server-tests.yaml

# 4. Watch mode — auto-run on save
mcptest watch --watch-extra src/

Options:

Flag Default Description
--output / -o . Directory where files are written
--generate-tests off Also write a test-spec YAML
--samples-per-tool 3 Argument variations tried per tool
--dry-run off Preview without writing files
--agent python agent.py Agent command embedded in test suites

Quickstart (manual)

# 1. Install
pip install mcptest

# 2. Scaffold a project
mcptest init

# 3. Edit fixtures/example.yaml and tests/example.yaml

# 4. Run
mcptest run

# 5. Watch mode — auto-run on save
mcptest watch --watch-extra src/

Example fixture

# fixtures/github.yaml
server:
  name: mock-github

tools:
  - name: create_issue
    description: Create a GitHub issue
    input_schema:
      type: object
      properties:
        repo: { type: string }
        title: { type: string }
      required: [repo, title]
    responses:
      - match: { repo: acme/api }
        return:
          issue_number: 42
          url: https://github.com/acme/api/issues/42
      - default: true
        return:
          issue_number: 1

errors:
  - name: rate_limited
    tool: create_issue
    error_code: -32000
    message: GitHub API rate limit exceeded

Example test

# tests/issue_triage.yaml
name: Issue triage agent
fixtures:
  - fixtures/github.yaml
agent:
  command: python examples/issue_agent.py
cases:
  - name: Creates issue for bug report
    input: "File a bug: login page 500 error on Safari"
    assertions:
      - tool_called: create_issue
      - param_matches:
          tool: create_issue
          param: title
          contains: "500"
      - max_tool_calls: 3

Metric-gated assertions

Use any of the 7 built-in quality metrics directly as YAML assertion gates:

assertions:
  # Agent must be efficient (≥80% unique tool usage)
  - metric_above: {metric: tool_efficiency, threshold: 0.8}
  # Agent must not be repetitive (non-redundancy score ≥0.9)
  - metric_above: {metric: redundancy, threshold: 0.9}
  # Gate on a weighted composite quality score
  - weighted_score:
      threshold: 0.75
      weights:
        tool_efficiency: 0.3
        redundancy: 0.2
        error_recovery_rate: 0.5

Boolean combinators for complex logic:

assertions:
  - all_of:
      - tool_called: create_issue
      - max_tool_calls: 5
  - any_of:
      - tool_called: create_issue
      - output_contains: created
  - none_of:
      - tool_called: delete_all
      - output_contains: ERROR

Agent scorecard

Generate a weighted quality report card from any saved trace:

# Render a human-readable table (exit 1 if composite score < 0.75)
mcptest scorecard trace.json

# Override the threshold
mcptest scorecard trace.json --fail-under 0.8

# Custom weights from a YAML config
mcptest scorecard trace.json --config scorecard.yaml

# Machine-readable JSON output (for CI pipelines)
mcptest scorecard trace.json --json

Example scorecard.yaml:

composite_threshold: 0.75
default_threshold: 0.7
thresholds:
  tool_efficiency: 0.8
  error_recovery_rate: 0.9
weights:
  tool_efficiency: 2.0
  redundancy: 1.0
  error_recovery_rate: 3.0

Conformance testing

Verify that any MCP server implementation correctly implements the protocol. 19 checks across 5 sections, each tagged with RFC 2119 severity (MUST / SHOULD / MAY).

Quick start

# Test a server subprocess over stdio
mcptest conformance "python my_server.py"

# Test in-process using a fixture YAML (fast, no subprocess)
mcptest conformance --fixture fixtures/my_server.yaml

# Filter to a specific section
mcptest conformance --fixture fixtures/my_server.yaml --section initialization

# Only run MUST checks (CI gate — fail only on hard violations)
mcptest conformance --fixture fixtures/my_server.yaml --severity must

# Also fail on SHOULD violations
mcptest conformance --fixture fixtures/my_server.yaml --fail-on-should

# Machine-readable output for CI pipelines
mcptest conformance --fixture fixtures/my_server.yaml --json

Check catalogue

ID Section Severity Description
INIT-001 initialization MUST Server provides non-empty name
INIT-002 initialization MUST Server info includes version string
INIT-003 initialization MUST Server reports capabilities object
INIT-004 initialization SHOULD Capabilities includes tools when server has tools
TOOL-001 tool_listing MUST list_tools() returns a list
TOOL-002 tool_listing MUST Each tool has name and inputSchema fields
TOOL-003 tool_listing MUST All tool names are unique
TOOL-004 tool_listing SHOULD Each inputSchema has type: "object" at root
CALL-001 tool_calling MUST Calling a valid tool with matching arguments returns result
CALL-002 tool_calling MUST Result contains content list
CALL-003 tool_calling MUST Successful result has isError absent or False
CALL-004 tool_calling MUST Calling unknown tool name returns error
CALL-005 tool_calling SHOULD Error response sets isError to True
ERR-001 error_handling MUST Error result contains text content with message
ERR-002 error_handling SHOULD Server handles empty arguments dict without crashing
ERR-003 error_handling SHOULD Server handles None arguments without crashing
RES-001 resources MUST list_resources() returns a list
RES-002 resources MUST Each resource has uri and name fields
RES-003 resources MUST Resource URIs are unique

Resource checks (RES-*) are automatically skipped when the server has no resources capability.

CI integration

# .github/workflows/conformance.yml
- name: MCP conformance
  run: mcptest conformance --fixture fixtures/server.yaml --severity must --json > conformance.json

Exit code is 1 when any MUST check fails (or any SHOULD check fails with --fail-on-should).

Programmatic usage

import anyio
from mcptest.conformance import ConformanceRunner, InProcessServer, Severity
from mcptest.fixtures.loader import load_fixture
from mcptest.mock_server.server import MockMCPServer

fixture = load_fixture("fixtures/my_server.yaml")
mock = MockMCPServer(fixture)
server = InProcessServer(mock=mock, fixture=fixture)

runner = ConformanceRunner(server=server, severities=[Severity.MUST])
results = anyio.run(runner.run)
must_failures = [r for r in results if not r.passed and not r.skipped]

Documentation

Full reference documentation is auto-generated from live registries so it never goes stale.

# Look up any assertion, metric, or check inline
mcptest explain tool_called
mcptest explain tool_efficiency
mcptest explain INIT-001

# List all available assertions, metrics, and checks
mcptest docs list

# Generate a full MkDocs documentation site
mcptest docs build --output ./site
cd site && mkdocs serve

The generated site includes:

Configuration

Place a mcptest.yaml file in your project root (or any parent directory) to set defaults for all CLI flags. mcptest walks up from the current directory to find it — the same discovery strategy git uses for .gitignore.

# mcptest.yaml
test_paths: ["tests/"]
fixture_paths: ["fixtures/"]
baseline_dir: .mcptest/baselines

retry: 3          # default retry count for every case
tolerance: 0.8    # default pass-rate tolerance (0.0–1.0)
parallel: 4       # default worker count (-j flag)
fail_fast: false  # stop at first failure
fail_under: 0.0   # coverage gate (mcptest coverage --threshold)

# Per-metric thresholds used by mcptest scorecard
thresholds:
  tool_efficiency: 0.7
  redundancy: 0.3

# Plugins to load at startup (dotted module name or file path)
plugins:
  - my_company.mcptest_extensions
  - ./custom_assertions.py

# Cloud settings
cloud:
  url: https://mcptest.example.com
  api_key_env: MCPTEST_API_KEY

CLI flags always override config-file values. To inspect the resolved configuration and loaded plugins:

mcptest config

Benchmarking

mcptest bench runs the same test suite against multiple agent profiles and produces a side-by-side quality comparison. Use it to quantify which model or prompt strategy performs best before migrating, or to add a regression gate to CI.

Define agent profiles

Create an agents.yaml file listing the agents to compare:

# agents.yaml
agents:
  - name: claude-sonnet
    command: python agents/claude_agent.py
    env:
      MODEL: claude-3-5-sonnet-20241022
    description: Anthropic Claude Sonnet 3.5

  - name: gpt-4o
    command: python agents/openai_agent.py
    env:
      MODEL: gpt-4o
    description: OpenAI GPT-4o

Or embed profiles directly in mcptest.yaml so no extra file is needed:

# mcptest.yaml
agents:
  - name: claude-sonnet
    command: python agents/claude_agent.py
    env: { MODEL: claude-3-5-sonnet-20241022 }
  - name: gpt-4o
    command: python agents/openai_agent.py
    env: { MODEL: gpt-4o }

Run the benchmark

# Explicit profiles file
mcptest bench tests/ --agents agents.yaml

# Profiles from mcptest.yaml
mcptest bench tests/

# Machine-readable JSON (pipe to jq, store in CI artifacts, etc.)
mcptest bench tests/ --agents agents.yaml --json | jq .best_agent

The output includes three Rich tables:

  • Leaderboard — agents ranked by composite score with pass rate, duration, and a BEST badge for the winner.
  • Metric Comparison — pivot table of per-agent average scores for each quality metric, colour-coded green/yellow/red.
  • Per-Test Breakdown — pass/fail grid across agents and test cases, highlighting where agents diverge.

CI integration

# Exit 1 if the best agent's composite score is below 0.75
mcptest bench tests/ --agents agents.yaml --ci --fail-under 0.75

Add to your pipeline:

# .github/workflows/bench.yml (example)
- name: Benchmark agents
  run: mcptest bench tests/ --agents agents.yaml --ci --fail-under 0.75

Options

Flag Description
--agents <file> Load profiles from this YAML file
--json Emit JSON instead of Rich tables
--ci Exit non-zero when best score < --fail-under
--fail-under <float> CI composite-score threshold (default 0.0)
--retry <n> Override retry count for every case
--tolerance <float> Override pass-rate tolerance (0.0–1.0)
-j/--parallel <n> Parallel workers per agent

Semantic evaluation

mcptest eval scores agent text output against named criteria — no LLM API calls required. Grading is deterministic: keyword coverage, regex patterns, and text similarity (levenshtein / Jaccard / cosine). It is ideal for CI pipelines where speed and cost predictability matter.

Define a rubric

Create a rubric YAML that describes what a good answer looks like:

# rubrics/booking.yaml
rubric:
  name: booking-quality
  criteria:
    - name: correctness
      weight: 0.5
      method: keywords
      expected: [confirmed, booking_id, receipt]
      threshold: 0.6        # ≥60 % of keywords must be present
    - name: format
      weight: 0.3
      method: pattern
      expected: "Booking \\w+ confirmed"
      threshold: 1.0        # regex must match
    - name: completeness
      weight: 0.2
      method: similarity
      expected: "Your booking ABC123 is confirmed. You will receive a receipt."
      threshold: 0.7        # text similarity must be ≥ 0.7

Grading methods

Method Description
keywords Fraction of expected keywords found in the text (sub-string)
pattern Binary regex match anywhere in the text (1.0 or 0.0)
similarity Best of levenshtein / Jaccard / cosine similarity against a reference
contains Binary sub-string check (1.0 or 0.0)
custom Reserved for plug-in graders (returns 0.0 by default)

Run evaluations

# Grade every test case against a rubric file
mcptest eval tests/ --rubric rubrics/booking.yaml

# Machine-readable JSON
mcptest eval tests/ --rubric rubrics/booking.yaml --json

# CI gate: exit 1 if mean composite score < 0.75 or any criterion fails
mcptest eval tests/ --rubric rubrics/booking.yaml --ci --fail-under 0.75

Inline rubric in test spec

Embed an eval: section directly in a test case to avoid a separate file:

# tests/booking.yaml
name: Booking agent
fixtures: [fixtures/booking.yaml]
agent:
  command: python agent.py
cases:
  - name: confirm booking
    input: "Book a table for 2 at 7pm"
    assertions:
      - tool_called: create_booking
    eval:
      name: booking-quality
      criteria:
        - name: correctness
          method: keywords
          expected: [confirmed, booking_id]
          weight: 1.0
          threshold: 0.5

Output

The Rich table shows per-criterion average scores, pass rates, and verdicts:

Evaluation Report — rubric: booking-quality

 Criterion      Avg Score  Pass Rate  Verdict
 ──────────────────────────────────────────────
 correctness    0.833      100.0%     PASS
 format         1.000      100.0%     PASS
 completeness   0.721       80.0%     PARTIAL

Overall: 4/5 passed (80.0%) — composite score 0.851

Options

Flag Description
--rubric <file> Load rubric from this YAML file (overrides inline eval:)
--json Emit JSON instead of Rich tables
--ci Exit non-zero when any criterion fails or score < --fail-under
--fail-under <float> CI composite-score threshold (default 0.0)
--retry <n> Override retry count for every case
--tolerance <float> Override pass-rate tolerance (0.0–1.0)
-j/--parallel <n> Parallel workers

Programmatic usage

from pathlib import Path
from mcptest.eval import Grader, load_rubric, aggregate_results

rubric = load_rubric(Path("rubrics/booking.yaml"))
grader = Grader(rubric)

texts = [
    "Your booking ABC123 is confirmed. Receipt sent.",
    "Booking confirmed.",
]
results = [grader.grade(t) for t in texts]
summary = aggregate_results(results)
print(summary.pass_rate, summary.mean_composite)

Plugins

Plugins let you add custom assertions, metrics, and exporters without forking mcptest. Any module that calls the registration decorators at import time is a valid plugin.

Load via config file (dotted module name or file path):

# mcptest.yaml
plugins:
  - my_company.mcptest_extensions  # installed package
  - ./custom_assertions.py         # local file

Load via confmcptest.py (auto-discovered, like pytest's conftest.py):

# tests/confmcptest.py
from mcptest.assertions.base import register_assertion, TraceAssertion
from mcptest.runner.trace import Trace
from mcptest.assertions.base import AssertionResult

@register_assertion
class response_is_json(TraceAssertion):
    yaml_key = "response_is_json"

    def check(self, trace: Trace) -> AssertionResult:
        ok = all(
            "json" in (call.result or "").lower()
            for call in trace.tool_calls
        )
        return AssertionResult(
            passed=ok,
            name=self.yaml_key,
            message="all tool responses are JSON" if ok else "non-JSON response found",
        )

Load via entry points (for distributable packages):

# pyproject.toml of your plugin package
[project.entry-points."mcptest.assertions"]
my_assertions = "my_package.assertions"

[project.entry-points."mcptest.metrics"]
my_metrics = "my_package.metrics"

[project.entry-points."mcptest.exporters"]
my_exporter = "my_package.exporters"

Once installed, your assertions are available by yaml_key in any test YAML:

assertions:
  - response_is_json: true

Cloud dashboard

mcptest ships a lightweight web UI for the cloud backend that runs with zero build tooling — Tailwind CSS, htmx, and Chart.js are all loaded from CDN.

Launch

# Install cloud extras
pip install 'mcptest[cloud]'

# Start the dashboard (opens browser automatically)
mcptest dashboard

# Custom host / port / database
mcptest dashboard --host 0.0.0.0 --port 8200 --db ./prod.db --no-browser

The server starts at http://127.0.0.1:8100/dashboard/ by default.

Pages

Page URL Description
Overview /dashboard/ Stats cards (total runs, pass rate, avg duration, tool calls, baselines), recent-runs table, per-suite pass/fail bars
Runs /dashboard/runs Filterable, paginated run list. Dropdowns for suite, branch, status, and environment update the table live via htmx without a full page reload.
Run detail /dashboard/runs/{id} Full run info: metric scores (horizontal bar chart), collapsible tool-call timeline with arguments and results, input/output panels, promote-as-baseline button.
Trends /dashboard/trends Chart.js line chart of any metric over time. Baseline runs are marked with stars. Controls for metric, suite, branch, and data limit update the chart instantly.
Baselines /dashboard/baselines Active baseline table with one-click demote (htmx). Compare any two runs by ID and see a metric delta table with regression indicators.
Webhooks /dashboard/webhooks Register and manage HTTP webhook endpoints. Create webhooks with event subscriptions, view delivery history, and send test pings.

Configuration

Flag Default Description
--host 127.0.0.1 Bind address
--port 8100 Listen port
--db ./mcptest_cloud.db SQLite database path (or set MCPTEST_DATABASE_URL)
--no-browser off Skip auto-opening the browser

The dashboard is backed by the same FastAPI app as mcptest cloud-push; all API endpoints remain available at their existing paths alongside the dashboard routes.

Webhook system

mcptest cloud can POST signed JSON notifications to external HTTP endpoints when key events occur — completing the CI loop from cloud-push through auto-regression check to team alert.

Events

Event Fires when
run.created A new test run is pushed via POST /runs
regression.detected POST /runs/{id}/check finds metric regressions vs baseline
baseline.promoted A run is promoted as the suite baseline
baseline.demoted A baseline is demoted/removed

API

# Register a webhook
curl -X POST http://localhost:8100/webhooks \
  -H "X-API-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://hooks.slack.com/services/...",
    "secret": "my-signing-secret",
    "events": ["regression.detected", "baseline.promoted"],
    "suite_filter": "smoke"
  }'

# List webhooks
curl http://localhost:8100/webhooks -H "X-API-Key: $KEY"

# Send a test ping
curl -X POST http://localhost:8100/webhooks/{id}/test -H "X-API-Key: $KEY"

# View delivery history
curl http://localhost:8100/webhooks/{id}/deliveries -H "X-API-Key: $KEY"

Payload format

Every delivery POSTs the same envelope:

{
  "event": "regression.detected",
  "timestamp": "2026-01-15T12:00:00+00:00",
  "data": {
    "head_id": 42,
    "base_id": 37,
    "suite": "smoke",
    "branch": "feat/new-prompt",
    "regression_count": 2,
    "deltas": [
      {"name": "tool_efficiency", "base_score": 0.9, "head_score": 0.6, "delta": -0.3}
    ]
  }
}

Signature verification

When a secret is set, each request includes an X-MCPTest-Signature: sha256=<hex> header. Verify it in your receiver:

import hashlib, hmac

def verify(secret: str, body: bytes, header: str) -> bool:
    expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    provided = header.removeprefix("sha256=")
    return hmac.compare_digest(expected, provided)

Delivery retries up to 3 times with exponential back-off (1 s, 4 s, 16 s) on connection errors or 5xx responses. Every attempt is logged in the webhook_deliveries table and visible in the dashboard.

Status

Alpha. The core loop (mock server → runner → assertions → CLI) is functional; SSE transport and test packs are under active development.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_agent_test-0.1.0.tar.gz (309.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_agent_test-0.1.0-py3-none-any.whl (224.8 kB view details)

Uploaded Python 3

File details

Details for the file mcp_agent_test-0.1.0.tar.gz.

File metadata

  • Download URL: mcp_agent_test-0.1.0.tar.gz
  • Upload date:
  • Size: 309.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mcp_agent_test-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f21bb5a7484dd7543b2257717e1121436f9651f894ccf72d32c046cb697c09b8
MD5 3cb092c247b4c11f1324e6b7c86eb8da
BLAKE2b-256 e5dd246ce6fcefb6a92335c55e24b72c4a5d633e2b5e2dc0a2b22225f38c974d

See more details on using hashes here.

File details

Details for the file mcp_agent_test-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mcp_agent_test-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 224.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mcp_agent_test-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1030c23668842d0b464bf1020bd8e1df12b5bd6a8a2665e9cc7d4d54787231a6
MD5 1ac5168f6fd87c399e409bf3284135ca
BLAKE2b-256 c385669eade25831344f450582ca5eeb487a8c90a7660c0737b8b7ad4490c9b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page