Skip to main content

A comprehensive testing framework for validating LLM tool calling capabilities with MCP services

Project description

testmcpy - MCP Testing Framework

A comprehensive testing framework for validating LLM tool calling capabilities with MCP (Model Context Protocol) services, specifically designed for testing Superset operations.

Quick Start

Installation

From source (development):

git clone https://github.com/preset-io/testmcpy.git
cd testmcpy
pip install -e .

From PyPI (once published):

pip install testmcpy

Via Homebrew (once published to PyPI):

brew tap preset-io/testmcpy
brew install testmcpy

See INSTALLATION.md for detailed installation instructions and distribution options.

Quick Usage

# List MCP tools
testmcpy tools
testmcpy tools --detail --filter chart

# Research LLM capabilities
testmcpy research --model claude-sonnet-4.5-20250929 --provider anthropic

# Run test suites
testmcpy run tests/ --model claude-3-5-haiku-20241022 --provider anthropic

# Interactive chat
testmcpy chat --provider anthropic --model claude-sonnet-4.5-20250929

# Compare test results
testmcpy report reports/model1.yaml reports/model2.yaml

# Initialize new project
testmcpy init my_project

Framework Structure

mcp_testing/
├── research/               # Research scripts for testing LLM capabilities
│   └── test_ollama_tools.py
├── src/                    # Core framework modules
│   ├── mcp_client.py      # MCP protocol client
│   ├── llm_integration.py # LLM provider abstraction
│   └── test_runner.py     # Test execution engine
├── evals/                  # Evaluation functions
│   └── base_evaluators.py # Standard evaluators
├── tests/                  # Test cases (YAML/JSON)
│   ├── basic_test.yaml
│   └── example_mcp_tests.yaml
├── reports/                # Test reports and comparisons
└── cli.py                  # CLI interface

Writing Test Cases

Test cases are defined in YAML files:

version: "1.0"
name: "My Test Suite"

tests:
  - name: "test_chart_creation"
    prompt: "Create a bar chart showing sales by region"
    expected_tools:
      - "create_chart"
    evaluators:
      - name: "was_mcp_tool_called"
        args:
          tool_name: "create_chart"
      - name: "execution_successful"
      - name: "final_answer_contains"
        args:
          expected_content: ["chart", "created"]
      - name: "within_time_limit"
        args:
          max_seconds: 30

Available Evaluators

Generic Evaluators

  • was_mcp_tool_called - Verify MCP tool was called
  • execution_successful - Check for successful execution
  • final_answer_contains - Validate response content
  • answer_contains_link - Check for links in response
  • within_time_limit - Verify performance
  • token_usage_reasonable - Check token/cost efficiency

Superset-Specific Evaluators

  • was_superset_chart_created - Verify chart creation
  • sql_query_valid - Validate SQL syntax

Supported LLM Providers

  • Claude Agent SDK (claude-sdk) - Official Anthropic SDK ⚠️ Limited MCP Support

    • claude-sonnet-4.5-20250929 (newest, most capable)
    • claude-sonnet-4-20250514
    • claude-3-5-sonnet-20241022
    • claude-3-5-haiku-20241022
    • All Claude models
    • Requires: ANTHROPIC_API_KEY environment variable
    • Features: Native tool calling, streaming, hooks
    • Note: Designed for stdio-based MCP servers, not HTTP-based services
    • For HTTP MCP (like Superset): Use anthropic provider instead
  • Anthropic API (anthropic) - Direct API integration ✅ Recommended for HTTP MCP

    • claude-sonnet-4.5-20250929 (newest, recommended)
    • claude-sonnet-4-20250514
    • claude-3-5-sonnet-20241022
    • claude-3-5-haiku-20241022 (fast, cost-effective)
    • claude-3-opus-20240229
    • All Claude models via API
    • Requires: ANTHROPIC_API_KEY environment variable
    • Full support for HTTP-based MCP services (like Superset MCP)
    • Best choice for production testing with MCP tools
  • Ollama (ollama) - Local models with tool calling support

    • llama3.1:8b (recommended)
    • mistral-nemo
    • qwen2.5:7b
  • OpenAI (openai) - GPT models via API

    • Requires: OPENAI_API_KEY environment variable
  • Local (local) - Transformers-based local models

  • Claude CLI (claude-cli) - Claude Code CLI interface

    • Uses Claude Code binary

Configuration

Environment Variables

# For Claude providers (claude-sdk, anthropic)
export ANTHROPIC_API_KEY="sk-ant-..."

# For OpenAI provider
export OPENAI_API_KEY="sk-..."

# MCP service URL (optional, defaults to http://localhost:5008/mcp/)
export MCP_URL="http://localhost:5008/mcp/"

# Default model and provider (optional)
export DEFAULT_MODEL="claude-sonnet-4.5-20250929"
export DEFAULT_PROVIDER="anthropic"

Configuration File

Create mcp_test_config.yaml:

mcp_url: "http://localhost:5008/mcp"
default_model: "claude-sonnet-4.5-20250929"
default_provider: "anthropic"
evaluators:
  timeout: 30
  max_tokens: 2000
  max_cost: 0.10

Development Status

Phase 0: Research & Prototype ✅

  • Research local LLM options with tool calling
  • Build minimal Python script for LLM+MCP integration
  • Validate tool calling with selected LLM
  • Create basic framework structure

Phase 1: Foundation (In Progress)

  • CLI framework with typer + rich
  • Basic test execution engine
  • MCP protocol client
  • LLM provider abstraction
  • Core evaluation functions
  • Integration with existing Superset tests

Phase 2: Core Features (Planned)

  • Multi-model comparison support
  • Advanced reporting with charts
  • Test suite versioning
  • Parallel test execution

Phase 3: Advanced Capabilities (Future)

  • CI/CD integration
  • Interactive test development mode
  • Performance profiling
  • Cost optimization insights

Known Limitations

  • Claude SDK Provider: Only supports stdio-based MCP servers (command-line tools)
    • Not compatible with HTTP-based MCP services (like Superset MCP)
    • Use anthropic provider for HTTP MCP services
  • HTTP MCP Services: Use anthropic provider (fully supported)
  • Ollama models: Require specific formatting for reliable tool calling
  • CPU-only execution: May be slow for larger local models
  • Tool calling accuracy: Varies by model (Claude models generally most reliable)
  • Cost: Claude API providers (anthropic) incur API costs; consider using Ollama for development

Contributing

This framework follows the patterns established by promptimize and superset-sup. When contributing:

  1. Use modern Python practices (type hints, async/await)
  2. Follow the existing code style
  3. Add tests for new evaluators
  4. Document new features in this README

License

Same as the parent promptimize project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

testmcpy-0.1.0.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

testmcpy-0.1.0-py3-none-any.whl (49.9 kB view details)

Uploaded Python 3

File details

Details for the file testmcpy-0.1.0.tar.gz.

File metadata

  • Download URL: testmcpy-0.1.0.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for testmcpy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 015e53ca6632b3a673a4ec867a8e4d15de3fa56eb9068f009ec09a8d742716fb
MD5 5f83143fac1a76b6aaa1f4ebb4e97b8e
BLAKE2b-256 4f63fb765886f52bf3caa98a6dac4404b2d97523de5713fbf59022e516198294

See more details on using hashes here.

File details

Details for the file testmcpy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: testmcpy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for testmcpy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed40b350e97b11c3067aecdacdb7b774f46304a6699a3b9dfb0fc64cd4758b4f
MD5 8d3b76176a64e89404d04485092cd124
BLAKE2b-256 01403c0dff88856091c2bdf099ee82fc1cec599c96febdb1b645a183d1c3eadf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page