Skip to main content

A comprehensive testing framework for validating LLM tool calling capabilities with MCP services

Project description

testmcpy

Test and benchmark LLMs with MCP tools in minutes.

A testing framework for validating how LLMs call tools via Model Context Protocol (MCP) - compare Claude, GPT-4, Llama, and other models' accuracy, cost, and performance.

Python 3.9+ License PyPI

[Screenshot: CLI test runner with colorful progress bars and results]

[Screenshot: Web UI showing tool explorer and interactive chat]

[GIF: Running a test suite from command line with real-time progress]


DocumentationExamplesContributingDiscussions


Why testmcpy?

  • Validate tool calling: Ensure LLMs call the right tools with correct parameters
  • Compare models: Find the best price/performance balance for your use case
  • Prevent regressions: Catch breaking changes in your MCP service with CI/CD
  • Optimize costs: Track token usage and identify the most cost-effective models

Quick Start

# Install testmcpy
pip install testmcpy

# Run interactive setup
testmcpy setup

# Start testing
testmcpy chat                     # Interactive chat with MCP tools
testmcpy research                 # Test LLM tool-calling capabilities
testmcpy run tests/              # Run your test suite

That's it! No complex configuration needed to get started.

Key Features

Multi-Provider Support

Test with Claude, GPT-4, Llama, and other models. Works with both paid APIs and free local models via Ollama.

[Screenshot: Model selector showing Claude, GPT-4, and Ollama options]

Built-in Evaluators

Comprehensive validation out of the box:

  • Tool Selection: Did the LLM call the right tool?
  • Parameter Validation: Were correct parameters passed?
  • Execution Success: Did the tool call complete without errors?
  • Performance: Response time and token usage tracking
  • Cost Analysis: Monitor API costs across test runs

[Screenshot: Test results showing pass/fail for different evaluators]

Beautiful CLI & Web UI

  • Rich terminal UI: Progress bars, colored output, formatted tables
  • Optional web interface: Visual tool explorer and interactive chat
  • Real-time feedback: Watch tests execute with live updates

[Screenshot: Split view of CLI and Web UI running the same test]

YAML Test Definitions

Define test suites as code for repeatable, version-controlled testing:

version: "1.0"
name: "Chart Operations Test Suite"

tests:
  - name: "test_create_chart"
    prompt: "Create a bar chart showing sales by region"
    evaluators:
      - name: "was_mcp_tool_called"
        args:
          tool_name: "create_chart"
      - name: "execution_successful"

Use Cases

Perfect for:

  • LLM Benchmarking: Compare tool-calling accuracy across Claude, GPT-4, and Llama
  • MCP Service Testing: Validate your MCP integrations work correctly
  • Regression Prevention: Catch breaking changes in CI/CD pipelines
  • Model Selection: Make data-driven decisions about which LLM to use
  • Cost Optimization: Find the best price/performance balance for your workload
  • Parameter Validation: Ensure LLMs pass correct parameters to your tools

Architecture

testmcpy connects your LLM provider to your MCP service and validates the interactions:

graph TB
    subgraph "CLI Interface"
        CLI[testmcpy CLI]
        WebUI[Web UI - Optional]
    end

    subgraph "Core Framework"
        TestRunner[Test Runner]
        Evaluators[Evaluators]
        Config[Configuration Manager]
    end

    subgraph "LLM Providers"
        Anthropic[Anthropic API]
        OpenAI[OpenAI API]
        Ollama[Ollama Local]
    end

    subgraph "MCP Integration"
        MCPClient[MCP Client]
        MCPService[MCP Service<br/>HTTP/SSE]
    end

    CLI --> TestRunner
    WebUI --> TestRunner
    TestRunner --> Config
    TestRunner --> Evaluators
    TestRunner --> Anthropic
    TestRunner --> OpenAI
    TestRunner --> Ollama
    Anthropic --> MCPClient
    OpenAI --> MCPClient
    Ollama --> MCPClient
    MCPClient --> MCPService

    style CLI fill:#4A90E2
    style WebUI fill:#4A90E2
    style TestRunner fill:#50E3C2
    style MCPClient fill:#F5A623
    style MCPService fill:#BD10E0

How it works:

  1. Define test cases in YAML with prompts and expected behavior
  2. testmcpy sends prompts to your chosen LLM (Claude, GPT-4, Llama, etc.)
  3. LLM calls tools via MCP protocol to your service
  4. Evaluators validate tool selection, parameters, execution, and performance
  5. Get detailed pass/fail results with metrics and cost analysis

Installation

# Install base package
pip install testmcpy

# With web UI support
pip install 'testmcpy[server]'

# All optional features
pip install 'testmcpy[all]'

Requirements: Python 3.9-3.12 (3.13+ not yet supported)

Getting Started

1. Configuration

Run the interactive setup wizard:

testmcpy setup

Or manually create ~/.testmcpy:

# MCP Service
MCP_URL=http://localhost:5008/mcp/
MCP_AUTH_TOKEN=your_bearer_token

# LLM Provider (choose one)
DEFAULT_PROVIDER=anthropic
DEFAULT_MODEL=claude-haiku-4-5
ANTHROPIC_API_KEY=sk-ant-...

Configuration priority: CLI options > .env > ~/.testmcpy > Environment variables > Defaults

2. Test Your MCP Service

# List available MCP tools
testmcpy tools

# Interactive chat to explore your tools
testmcpy chat

# Run automated research on tool-calling capabilities
testmcpy research --model claude-haiku-4-5

3. Create Test Suites

Define tests in YAML (tests/my_tests.yaml):

version: "1.0"
name: "My MCP Service Tests"

tests:
  - name: "test_tool_selection"
    prompt: "Create a bar chart showing sales by region"
    evaluators:
      - name: "was_mcp_tool_called"
        args:
          tool_name: "create_chart"
      - name: "execution_successful"
      - name: "within_time_limit"
        args:
          max_seconds: 30

Run your tests:

testmcpy run tests/ --model claude-haiku-4-5

Documentation

Core Guides

Examples

Commands Reference

Command Description
testmcpy setup Interactive configuration wizard
testmcpy tools List available MCP tools
testmcpy research Test LLM tool-calling capabilities
testmcpy run <path> Execute test suite
testmcpy chat Interactive chat with MCP tools
testmcpy serve Start web UI server
testmcpy report Compare test results across models
testmcpy config-cmd View current configuration
testmcpy doctor Diagnose installation issues

LLM Providers

Anthropic (Recommended)

Best tool-calling accuracy, native MCP support:

ANTHROPIC_API_KEY=sk-ant-your-key
DEFAULT_MODEL=claude-haiku-4-5  # Fast & cost-effective

Available models: claude-haiku-4-5, claude-sonnet-4-5, claude-opus-4-1

Ollama (Free, Local)

Perfect for development without API costs:

# Install Ollama
brew install ollama  # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama and pull a model
ollama serve
ollama pull llama3.1:8b

# Configure testmcpy
DEFAULT_PROVIDER=ollama
DEFAULT_MODEL=llama3.1:8b

OpenAI

OPENAI_API_KEY=sk-your-key
DEFAULT_MODEL=gpt-4-turbo

Built-in Evaluators

testmcpy includes comprehensive evaluators for validating LLM behavior:

Tool Calling

  • was_mcp_tool_called - Verify specific tool was invoked
  • tool_call_count - Validate number of tool calls
  • tool_called_with_parameter - Check specific parameter was passed
  • tool_called_with_parameters - Validate multiple parameters
  • parameter_value_in_range - Ensure numeric parameters are valid

Execution

  • execution_successful - Check for errors or failures
  • within_time_limit - Performance validation
  • final_answer_contains - Validate response content

Cost & Performance

  • token_usage_reasonable - Cost efficiency validation
  • Performance metrics automatically tracked

Extensible: Easily add custom evaluators for your domain-specific needs.

See Evaluator Reference for complete documentation.

For MCP Service Developers

Integrate testmcpy into your MCP service for automated testing:

# Install testmcpy in your project
pip install testmcpy[all]

# Create tests for your MCP tools
cat > tests/my_service_tests.yaml <<EOF
version: "1.0"
name: "My MCP Service Tests"
tests:
  - name: "test_tool_selection"
    prompt: "List all items"
    evaluators:
      - name: "was_mcp_tool_called"
        args:
          tool_name: "list_items"
      - name: "execution_successful"
EOF

# Run tests in CI/CD
testmcpy run tests/ --model claude-haiku-4-5

Client Usage Guide - Complete integration guide for your MCP service

CI/CD Examples - GitHub Actions and GitLab CI configurations

Web Interface

Optional React-based UI for visual testing:

[Screenshot: Web UI dashboard with tool explorer]

# Install with UI support
pip install 'testmcpy[server]'

# Start server
testmcpy serve

Features:

  • Visual MCP tool explorer
  • Interactive chat interface
  • Test management and execution
  • Real-time results display

Access at http://localhost:8000

Examples

Check out the examples/ directory for:

  • Basic test suites - Simple examples to get started
  • CI/CD integration - GitHub Actions and GitLab CI workflows
  • Custom evaluators - Building domain-specific validation
  • Multi-model comparison - Benchmarking different LLMs

Contributing

We welcome contributions! Whether it's bug reports, feature requests, documentation improvements, or code contributions.

Read the Contributing Guide to get started.

Quick guidelines:

  • Follow Black code formatting (100 char line length)
  • Add tests for new features
  • Ensure multi-provider compatibility (test with Ollama, Claude, GPT)
  • Document your changes
  • Be respectful and collaborative

Contributors

Built with contributions from:

Want to see your name here? Check out our Contributing Guide!

Community & Support

License

Apache License 2.0 - See LICENSE for details.

By contributing, you agree that your contributions will be licensed under Apache 2.0.


Acknowledgments

Built by the team at Preset to enable better LLM testing and integration with Apache Superset and beyond.

Special thanks to the MCP community and all our contributors!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

testmcpy-0.2.1.tar.gz (241.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

testmcpy-0.2.1-py3-none-any.whl (247.8 kB view details)

Uploaded Python 3

File details

Details for the file testmcpy-0.2.1.tar.gz.

File metadata

  • Download URL: testmcpy-0.2.1.tar.gz
  • Upload date:
  • Size: 241.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for testmcpy-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a4ab54bb171a66bb60381f3257f194b2b29ba5df2bac8c826aa846884ec3bb28
MD5 17b37904fddd860ed12a911addaddc15
BLAKE2b-256 382fd7d4cceba00949da1e8474dfbaf899efadea7d08abaa584eed067e6fcbb3

See more details on using hashes here.

File details

Details for the file testmcpy-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: testmcpy-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 247.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for testmcpy-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d632e101b29c66b305b6b2074dae7f1d9e726ccf3dbafe58d46bd3994017c348
MD5 a4f37265f44ee88dfd83b732a5217796
BLAKE2b-256 f6e1c1c60ba37c4863b7a44410b64847cdb2c123121340467cdfdd099ecfc149

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page