testmcpy

A comprehensive testing framework for validating LLM tool calling capabilities with MCP services

These details have not been verified by PyPI

Project links

Project description

testmcpy logo

Test and benchmark LLMs with MCP tools in minutes.

A testing framework for validating how LLMs call tools via Model Context Protocol (MCP) — compare Claude, GPT-4, Llama, and other models' accuracy, cost, and performance.

MCP Explorer — tools, resources, and prompts from a connected MCP service

Why testmcpy?

Validate tool calling: Ensure LLMs call the right tools with correct parameters
Compare models: Find the best price/performance balance for your use case
Prevent regressions: Catch breaking changes in your MCP service with CI/CD
Optimize costs: Track token usage and identify the most cost-effective models

How it compares

	testmcpy	MCP Inspector	MCPJam	promptfoo
Automated LLM-driven evals of MCP servers	✅ YAML suites, 40+ evaluators	❌ manual testing	✅	⚠️ generic LLM eval with an MCP provider
Multi-provider (Claude / GPT / Gemini / Ollama / Bedrock…)	✅ 11 providers incl. agent SDKs	n/a	✅	✅
CI gate with exit codes + JUnit	✅ `--gate`, `--junit-xml`	❌	✅	✅
Cost & token tracking per test/model	✅	❌	⚠️	⚠️
Multi-turn, mutation & metamorphic testing	✅	❌	❌	⚠️
Auth testing (JWT/OAuth/mTLS) + debugger	✅ 7 auth types	⚠️ OAuth only	✅ OAuth debugger	❌
Python-native (`pip`/`uvx`, pytest-friendly)	✅	❌ npm	❌ npm	❌ npm

Use MCP Inspector for quick manual poking; reach for testmcpy when you want repeatable, scored, CI-gated evaluation of how real models use your server.

Quick Start

# Install testmcpy
pip install testmcpy

# Run interactive setup
testmcpy setup

# Start testing
testmcpy chat                     # Interactive chat with MCP tools
testmcpy research                 # Test LLM tool-calling capabilities
testmcpy run tests/              # Run your test suite

That's it! No complex configuration needed to get started.

Key Features

Multi-Provider LLM Support

Test with Claude, GPT, Gemini, Llama, and other models. Works with both paid APIs and free local models via Ollama. Includes agent-SDK providers (Claude, Codex, Gemini) with native MCP support.

Provider	Config name	Models	Features
Anthropic	`anthropic`	claude-opus-4, claude-sonnet-4-5, claude-haiku-4-5	Native MCP, extended thinking, vision, token caching
OpenAI	`openai`	gpt-4, gpt-4-turbo, gpt-4o	Function calling, vision, cost tracking
Ollama	`ollama`	Llama, Mistral, etc. (local)	Free, local execution, no API costs
Claude SDK	`claude-sdk` (aliases: `claude-cli`, `claude-code`)	claude-sonnet-4-5, claude-opus-4	Claude Agent SDK, native MCP, CLI OAuth login
Codex SDK	`codex-sdk` (aliases: `codex-cli`, `codex`)	gpt-5-codex, o3, o4-mini	openai-agents SDK, native MCP, Codex CLI OAuth or API key
Gemini SDK	`gemini-sdk`	gemini-sdk-flash, gemini-sdk-pro	google-adk, native MCP
Google Gemini	`gemini` (alias: `google`)	gemini-2.5-flash, gemini-2.5-pro	Direct Gemini API, function calling
Gemini CLI	`gemini-cli`	gemini-2.5-flash, gemini-2.5-pro	Subprocess-based Gemini CLI
AWS Bedrock	`bedrock` (alias: `aws-bedrock`)	Claude models via AWS	IAM auth, no Anthropic key needed
xAI	`xai` (alias: `grok`)	grok models	Function calling
OpenRouter	`openrouter`	100+ models with one API key	Function calling, cost tracking

LLM Profiles — manage Anthropic, OpenAI, Ollama and Claude SDK provider configurations

Built-in Evaluators

Comprehensive validation out of the box. Each evaluator returns a score from 0.0 to 1.0 with pass/fail status and detailed reasoning.

Tool Calling:

was_mcp_tool_called — Verify specific tool was invoked (supports prefix/gateway matching)
tool_call_count — Validate number of tool calls
tool_called_with_parameter — Check specific parameter was passed (fuzzy matching)
tool_called_with_parameters — Validate multiple parameters at once
parameter_value_in_range — Ensure numeric parameters are within bounds

Execution & Performance:

execution_successful — Check for errors or failures in tool results
within_time_limit — Performance validation against max_seconds
final_answer_contains — Validate response content
token_usage_reasonable — Cost efficiency validation
response_time_acceptable — Latency threshold checking
auth_successful — Authentication flow validation

Extensible: Extend BaseEvaluator and implement evaluate(context) -> EvalResult to create custom evaluators for your domain.

Reports — combined view of every test run, evaluator scores, and cost analysis

YAML Test Definitions

Define test suites as code for repeatable, version-controlled testing:

version: "1.0"
name: "Chart Operations Test Suite"

config:
  timeout: 30
  model: "claude-sonnet-4-5"
  provider: "anthropic"

tests:
  - name: "test_create_chart"
    prompt: "Create a bar chart showing sales by region"
    evaluators:
      - name: "was_mcp_tool_called"
        args:
          tool_name: "create_chart"
      - name: "execution_successful"

  # Multi-turn test
  - name: "test_multi_turn"
    steps:
      - prompt: "List all dashboards"
        evaluators:
          - name: "was_mcp_tool_called"
            args:
              tool_name: "list_dashboards"
      - prompt: "Show me the first one"
        evaluators:
          - name: "final_answer_contains"
            args:
              content: "dashboard"

  # Load testing
  - name: "test_load"
    prompt: "List dashboards"
    load_test:
      concurrent: 5
      duration: 60

CLI & Web UI

Rich terminal UI: Progress bars, colored output, formatted tables
Optional web interface: Visual tool explorer, interactive chat, analytics dashboards
Real-time feedback: Watch tests execute with live updates via WebSocket

Chat Interface — interactive chat against your MCP service from the browser

Architecture

testmcpy connects your LLM provider to your MCP service and validates the interactions:

graph TB
    subgraph UI["User Interface Layer"]
        CLI["CLI Commands<br>(Typer)"]
        WebUI["Web UI<br>(React + Vite + Tailwind)"]
        TUI["Terminal Dashboard<br>(Textual)"]
    end

    subgraph Core["Core Framework"]
        Runner["Test Runner"]
        LLM["LLM Integration"]
        Evals["Evaluators"]
    end

    subgraph MCP_Layer["MCP Integration Layer"]
        Client["MCP Client<br>(FastMCP)"]
        Auth["Auth Manager"]
        Discovery["Tool Discovery"]
    end

    subgraph External["External Services"]
        LLM_APIs["LLM APIs<br>(Anthropic, OpenAI, Ollama)"]
        MCP_Services["MCP Services<br>(HTTP/SSE)"]
        Storage["Storage<br>(SQLite + JSON)"]
    end

    UI --> Core
    Core --> MCP_Layer
    MCP_Layer --> External
    Core --> External

How it works:

Define test cases in YAML with prompts and expected behavior
testmcpy sends prompts to your chosen LLM (Claude, GPT-4, Llama, etc.)
LLM calls tools via MCP protocol to your service
Evaluators validate tool selection, parameters, execution, and performance
Get detailed pass/fail results with metrics and cost analysis

Installation

# Install base package
pip install testmcpy

# With web UI support
pip install 'testmcpy[server]'

# All optional features
pip install 'testmcpy[all]'

Requirements: Python 3.10-3.12

Getting Started

1. Configuration

Run the interactive setup wizard:

testmcpy setup

This creates two config files:

.llm_providers.yaml — LLM configuration:

default: prod

profiles:
  prod:
    name: "Production"
    providers:
      - name: "Claude Sonnet"
        provider: "anthropic"
        model: "claude-sonnet-4-5"
        api_key: "your-anthropic-api-key"
        timeout: 60
        default: true

.mcp_services.yaml — MCP server profiles:

default: prod

profiles:
  prod:
    name: "Production"
    mcps:
      - name: "My MCP Service"
        mcp_url: "https://your-service.example.com/mcp"
        auth:
          auth_type: "jwt"  # or "bearer", "oauth", "none"
          api_url: "https://auth.example.com/v1/auth/"
          api_token: "your-api-token"
          api_secret: "your-api-secret"
        timeout: 30
        rate_limit_rpm: 60
        default: true

Configuration priority: CLI options > Profile files > .env > User config (~/.testmcpy) > Environment variables > Built-in defaults

The setup command is idempotent — safe to run multiple times. Use --force to overwrite existing files.

TESTMCPY_CHAT_OAUTH_LOGIN (default true): when a chat message hits an OAuth (oauth_auto_discover) MCP profile with no cached token, the server opens the interactive browser OAuth flow and retries. This assumes a browser is available on the machine running the server — in headless deployments set TESTMCPY_CHAT_OAUTH_LOGIN=false so the request fails fast with a clear error instead of blocking on a login that can never complete.

2. Explore Your MCP Service

# List available MCP tools
testmcpy tools

# Interactive chat to explore your tools
testmcpy chat

# Run automated research on tool-calling capabilities
testmcpy research --model claude-haiku-4-5

3. Create and Run Test Suites

# tests/my_tests.yaml
version: "1.0"
name: "My MCP Service Tests"

tests:
  - name: "test_tool_selection"
    prompt: "Create a bar chart showing sales by region"
    evaluators:
      - name: "was_mcp_tool_called"
        args:
          tool_name: "create_chart"
      - name: "execution_successful"
      - name: "within_time_limit"
        args:
          max_seconds: 30

testmcpy run tests/ --model claude-haiku-4-5

Commands Reference

The highlights are below — the full reference for all 38 commands lives at preset-io.github.io/testmcpy/cli.

Command	Description
Setup
`testmcpy setup`	Interactive configuration wizard
`testmcpy doctor`	Diagnose installation issues
Discovery
`testmcpy tools`	List available MCP tools
`testmcpy profiles`	List MCP profiles (table)
`testmcpy status`	Show MCP connection status
`testmcpy explore-cli`	Browse tools (non-interactive)
Testing
`testmcpy run <path>`	Execute test suite
`testmcpy research`	Test LLM tool-calling capabilities
`testmcpy chat`	Interactive chat with MCP tools
`testmcpy compare`	Multi-model comparison
Quality & Benchmarking
`testmcpy bench`	Run a suite across models × profiles × repeats
`testmcpy conformance`	Run the official MCP spec conformance suite
`testmcpy score`	Grade tool surface for LLM usability (0-100, A-F)
`testmcpy scan`	Static security scan of tool metadata (SARIF output)
`testmcpy matrix` / `leaderboard` / `flaky`	Per-test × per-config analytics
Advanced
`testmcpy baseline-save`	Save current test results as a named baseline
`testmcpy baseline-compare`	Compare a run against a saved baseline
`testmcpy baseline-list`	List saved baselines
`testmcpy mutate`	Prompt mutation testing
`testmcpy metamorphic`	Metamorphic testing
`testmcpy generate`	AI-assisted test generation
`testmcpy smoke-test`	Quick smoke test against an MCP service
`testmcpy coverage`	Tool coverage report for a test suite
`testmcpy multi-env`	Run the same suite against multiple MCP profiles
`testmcpy export-db`	Export the SQLite results database
UI
`testmcpy serve`	Start web UI server (default port 8000)
`testmcpy config-cmd`	View current configuration
`testmcpy config-mcp`	Print MCP client snippets for Claude Desktop / Code

Common options: --profile, --llm-profile, --model, --provider, --timeout, --verbose, --output

Inline MCP Auth (No Config File Needed)

Pass MCP auth credentials directly on the command line, bypassing .mcp_services.yaml:

# JWT auth (e.g., Preset workspaces)
testmcpy run tests/ \
  --mcp-url https://workspace.example.com/mcp \
  --auth-type jwt \
  --jwt-url https://auth.example.com/v1/auth/ \
  --jwt-token $MCP_JWT_TOKEN \
  --jwt-secret $MCP_JWT_SECRET

# Bearer token auth
testmcpy run tests/ \
  --mcp-url https://workspace.example.com/mcp \
  --auth-type bearer \
  --auth-token $MCP_BEARER_TOKEN

# No auth (public MCP endpoint)
testmcpy run tests/ \
  --mcp-url https://workspace.example.com/mcp \
  --auth-type none

Environment variables are also supported: MCP_AUTH_TOKEN, MCP_JWT_URL, MCP_JWT_TOKEN, MCP_JWT_SECRET.

Web Interface

Optional React-based UI for visual testing and analytics — every page is documented at preset-io.github.io/testmcpy/web-ui:

Test Manager — browse YAML suites, kick off runs, watch results stream in

# Install with UI support
pip install 'testmcpy[server]'

# Start server
testmcpy serve

The UI accepts loopback Host headers by default. For LAN, container, or reverse-proxy access, bind on all interfaces and explicitly list every hostname or IP clients will use (the option is repeatable):

testmcpy serve --host 0.0.0.0 \
  --allowed-host testmcpy.example.com \
  --allowed-host 192.0.2.10 \
  --no-browser

TESTMCPY_ALLOWED_HOSTS=testmcpy.example.com,192.0.2.10 provides the same host policy for deployments configured through environment variables. Values are hostnames or IP addresses only, without a URL scheme or port. A global * is rejected by testmcpy serve because it would disable DNS-rebinding protection.

Route	Page	Description
`/`	MCP Explorer	Tool discovery, smoke tests, schema viewing
`/tests`	Test Manager	YAML test browser, execution, results
`/reports`	Reports	All test results, evaluations, cost analysis
`/chat`	Chat Interface	Multi-turn conversation with MCP tools
`/performance`	Performance	Per-test matrix and config leaderboard (also serves `/metrics`, `/compare`)
`/servers`	Servers	Health monitoring + cross-server schema compatibility (also serves `/mcp-health`, `/compatibility`)
`/security`	Security Dashboard	Security evaluator results and risk summary
`/generation-history`	Generation History	AI test generation logs
`/auth-debugger`	Auth Debugger	Auth flow debugging
`/config`	Configuration	Settings and environment
`/mcp-profiles`	MCP Profiles	MCP server configuration
`/llm-profiles`	LLM Profiles	LLM provider configuration

Access at http://localhost:8000.

More screenshots

_{Generation History — AI-assisted test generation runs}	_{Auth Debugger — step through OAuth / JWT / Bearer flows}
_{Performance — per-test results across model and MCP configurations}	_{Leaderboard — configs ranked by pass rate, cost-per-pass, latency}
_{Security Dashboard — security evaluator results and risk summary}	_{Schema Compat — cross-server tool schema compatibility matrix}
_{Servers — MCP server health monitoring}	_{MCP Profiles — manage MCP service connections}
_{LLM Profiles — provider configurations with model pricing}	_{Configuration — current settings and client snippets}

LLM Providers

Anthropic (Recommended)

Best tool-calling accuracy, native MCP support:

# .llm_providers.yaml
prod:
  name: "Production"
  providers:
    - name: "Claude Sonnet"
      provider: "anthropic"
      model: "claude-sonnet-4-5"
      api_key_env: "ANTHROPIC_API_KEY"
      default: true

Ollama (Free, Local)

Perfect for development without API costs:

brew install ollama  # macOS
ollama serve
ollama pull llama3.1:8b

local:
  name: "Local Only"
  providers:
    - name: "Ollama Llama"
      provider: "ollama"
      model: "llama3.1:8b"
      base_url: "http://localhost:11434"
      default: true

OpenAI

openai:
  name: "OpenAI"
  providers:
    - name: "GPT-4"
      provider: "openai"
      model: "gpt-4-turbo"
      api_key_env: "OPENAI_API_KEY"
      default: true

CI in 60 Seconds

Gate your MCP service on eval results in any CI system — no wrapper required:

# .github/workflows/mcp-tests.yml
jobs:
  mcp-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v5
      - name: Run MCP eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          uvx testmcpy run tests/ \
            --mcp-url "$MCP_URL" \
            --gate --min-pass-rate 85 \
            --junit-xml junit.xml

--gate exits non-zero when the run fails your thresholds, so the build fails. Tune thresholds in .testmcpy-gate.yaml:

min_pass_rate: 85.0       # % of tests that must pass
max_failures: 3           # absolute failure budget
required_tests:           # these must always pass
  - critical_auth_flow
block_on_regression: true # fail on baseline regressions

--junit-xml emits JUnit XML for CI systems that ingest it natively (Jenkins, GitLab, CircleCI, Buildkite). On GitHub Actions, pair it with an action like dorny/test-reporter — or just rely on the next bullet.
Inside GitHub Actions, the markdown eval report is automatically appended to the job summary — results render on the workflow run page with zero extra steps.

Or use the bundled reusable Action — adds a sticky PR comment, JUnit artifact upload, and structured outputs (pass-rate, gate_passed):

- uses: preset-io/testmcpy@v1
  with:
    test_path: tests/
    mcp_url: ${{ vars.MCP_URL }}
    pass_threshold: '85'
    pr_comment: 'true'
    anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}

Custom Evaluators

Extend testmcpy with domain-specific validation:

from testmcpy.evals.base_evaluators import BaseEvaluator, EvalResult

class MyEvaluator(BaseEvaluator):
    def evaluate(self, context: dict) -> EvalResult:
        response = context.get("response", "")
        passed = "expected" in response
        return EvalResult(
            passed=passed,
            score=1.0 if passed else 0.0,
            reason=f"Check passed: {passed}",
        )

See the Evaluator Reference and the Custom Evaluators guide for complete documentation.

Examples

Check out the examples/ directory for:

Basic test suites — Simple examples to get started
CI/CD integration — GitHub Actions and GitLab CI workflows
Custom evaluators — Building domain-specific validation
Multi-model comparison — Benchmarking different LLMs

Contributing

We welcome contributions! Whether it's bug reports, feature requests, documentation improvements, or code contributions.

Read the Contributing Guide to get started.

Community & Support

Issues: Report bugs or request features
Discussions: Ask questions and share ideas
Documentation: preset-io.github.io/testmcpy (agent-facing source docs live in context/)
Examples: Explore examples/ for sample code

License

Apache License 2.0 — See LICENSE for details.

Built by @aminghadersohi at Preset.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.11.9

Jul 15, 2026

0.11.8

Jul 10, 2026

0.11.7

Jul 6, 2026

0.11.6

Jun 28, 2026

0.11.4

Jun 28, 2026

0.11.3

Jun 27, 2026

0.11.1

Jun 27, 2026

0.11.0

Jun 27, 2026

0.10.3

Jun 12, 2026

0.10.1

Jun 12, 2026

0.9.2

Jun 11, 2026

0.9.0

Jun 11, 2026

0.8.0

Jun 11, 2026

0.7.26

Jun 10, 2026

0.7.24

Jun 9, 2026

0.7.23

Jun 8, 2026

0.7.22

Jun 8, 2026

0.7.21

Jun 8, 2026

0.7.20

Jun 8, 2026

0.7.18

Jun 8, 2026

0.7.17

Jun 8, 2026

0.7.16

Jun 7, 2026

0.7.15

Jun 6, 2026

0.7.12

Jun 6, 2026

0.7.10

Jun 6, 2026

0.7.7

Jun 6, 2026

0.7.6

Jun 6, 2026

0.7.5

May 28, 2026

0.7.4

May 21, 2026

0.7.3

May 8, 2026

0.7.2

May 6, 2026

0.7.1

May 6, 2026

0.7.0

May 5, 2026

0.6.1

May 5, 2026

0.5.1

May 5, 2026

0.5.0

May 4, 2026

0.4.0

May 2, 2026

0.3.2

Apr 23, 2026

0.3.1

Apr 22, 2026

0.3.0

Apr 17, 2026

0.2.17

Dec 19, 2025

0.2.16

Dec 19, 2025

0.2.15

Dec 19, 2025

0.2.14

Dec 19, 2025

0.2.13

Dec 19, 2025

0.2.12

Dec 19, 2025

0.2.11

Dec 18, 2025

0.2.10

Dec 18, 2025

0.2.9

Dec 18, 2025

0.2.8

Dec 18, 2025

0.2.7

Dec 18, 2025

0.2.6

Dec 18, 2025

0.2.4

Nov 4, 2025

0.2.3

Nov 1, 2025

0.2.2

Nov 1, 2025

0.2.1

Nov 1, 2025

0.2.0

Oct 18, 2025

0.1.15

Oct 17, 2025

0.1.13

Oct 17, 2025

0.1.12

Oct 17, 2025

0.1.11

Oct 17, 2025

0.1.10

Oct 17, 2025

0.1.9

Oct 17, 2025

0.1.8

Oct 17, 2025

0.1.7

Oct 17, 2025

0.1.6

Oct 17, 2025

0.1.5

Oct 17, 2025

0.1.4

Oct 17, 2025

0.1.3

Oct 16, 2025

0.1.2

Oct 16, 2025

0.1.1

Oct 16, 2025

0.1.0

Oct 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

testmcpy-0.11.9.tar.gz (1.2 MB view details)

Uploaded Jul 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

testmcpy-0.11.9-py3-none-any.whl (1.3 MB view details)

Uploaded Jul 15, 2026 Python 3

File details

Details for the file testmcpy-0.11.9.tar.gz.

File metadata

Download URL: testmcpy-0.11.9.tar.gz
Upload date: Jul 15, 2026
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for testmcpy-0.11.9.tar.gz
Algorithm	Hash digest
SHA256	`fe21a69e14f24d4881b6cabc76a6eeef77dd5d18967e3bea21aa5cee68c88dfd`
MD5	`e2db8bc72c11af5a65119d4bdd5e9e62`
BLAKE2b-256	`c0603f0986aad765a4f3b8f6a40eb3f8fea8622638a17e0c8ec425a19a63b7c6`

See more details on using hashes here.

File details

Details for the file testmcpy-0.11.9-py3-none-any.whl.

File metadata

Download URL: testmcpy-0.11.9-py3-none-any.whl
Upload date: Jul 15, 2026
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for testmcpy-0.11.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`378fe1fded7df995c15624c15ceddb8cfb6f4a487492d07a3bd33b646379a533`
MD5	`c9c44b07c609b7c0058d80232690dc0a`
BLAKE2b-256	`5f6638fc10283a13fbae4993c7be62f155cdf2f10275b9d9f2edaeda52385fd4`

See more details on using hashes here.

testmcpy 0.11.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why testmcpy?

How it compares

Quick Start

Key Features

Multi-Provider LLM Support

Built-in Evaluators

YAML Test Definitions

CLI & Web UI

Architecture

Installation

Getting Started

1. Configuration

2. Explore Your MCP Service

3. Create and Run Test Suites

Commands Reference

Inline MCP Auth (No Config File Needed)

Web Interface

More screenshots

LLM Providers

Anthropic (Recommended)

Ollama (Free, Local)

OpenAI

CI in 60 Seconds

Custom Evaluators

Examples

Contributing

Community & Support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes