Skip to main content

Benchmark LLM tool selection accuracy for MCP tools using the GitHub Copilot SDK

Project description

MCP Tool Selection Benchmark

Benchmark how accurately different LLM models select the correct MCP tool given natural language instructions, powered by the GitHub Copilot SDK.

How It Works

  1. Load a tool registry (tools.json) and ground-truth test suite (test_suite.json).
  2. Generate 5 query variations per instruction at different ambiguity levels (explicit → misleading) using a Copilot SDK model call.
  3. Evaluate each variation against every selected model — the model is presented with the tools and must pick one or more.
  4. Score selections via exact-match and partial-credit against ground truth.
  5. Report per-model accuracy, confusion matrices, and optional description suggestions in JSON and HTML.

Prerequisites

  • Python ≥ 3.10
  • GitHub Copilot CLI installed and in PATH
  • An active Copilot subscription (each query counts as a premium request)
  • Authenticated via copilot login, or GH_TOKEN / GITHUB_TOKEN env var

Installation

From PyPI:

pip install mcp-tool-selection-bench

Or for development:

cd mcp-tool-selection-bench
pip install -e ".[dev]"

Usage

Auto-generate a test suite (optional)

Don't have a test_suite.json? Generate one from your tool definitions:

mcp-bench generate --tools tools.json --output test_suite.json

This uses an LLM to create 3–5 test cases per tool (mix of single-tool and multi-tool, varying ambiguity). Review and edit the output before benchmarking.

Argument Required Default Description
--tools Path to the tool registry JSON
--output test_suite.json Output path for the generated test suite
--per-tool 4 Number of test cases to generate per tool
--model Model to use for generation

Run a benchmark

mcp-bench run \
  --tools samples/tools.json \
  --test-suite samples/test_suite.json \
  --models gpt-5 claude-sonnet-4 \
  --output results.json

The run subcommand is the default — you can omit it for backward compatibility.

CLI Arguments (run)

Argument Required Default Description
--tools Path to the tool registry JSON
--test-suite Path to the ground-truth test suite JSON
--models Space-separated model names to benchmark
--output results.json Output path for the report
--variations 5 Number of query variations per instruction
--generator-model first --models entry Model used to generate query variations
-v / --verbose off Enable DEBUG logging
--html Output path for an HTML report with confusion-matrix heatmaps
--suggest off Generate tool-description improvement suggestions (extra API calls)
--suggest-threshold 0.7 Accuracy threshold below which to suggest description improvements
--fail-under Exit with code 1 if any model's exact-match accuracy is below this value

Input Schemas

tools.json

[
  {
    "name": "search_issues",
    "description": "Search for issues in GitHub repositories",
    "parameters": {
      "type": "object",
      "properties": {
        "query": { "type": "string", "description": "Search query" }
      },
      "required": ["query"]
    },
    "metadata": { "category": "github", "tags": ["issues", "search"] }
  }
]

test_suite.json

Supports both single-tool and multi-tool test cases:

[
  {
    "instruction": "Find all open bugs in the react repo",
    "expected_tools": ["search_issues"]
  },
  {
    "instruction": "Find open bugs in react and check the CI logs",
    "expected_tools": ["search_issues", "get_job_logs"]
  }
]

For multi-tool cases, use expected_tools (ordered list). Single-tool cases using expected_tool (string) remain supported for backward compatibility.

Output

The report (results.json) contains:

  • run_metadata — timestamp, models tested, query counts
  • per_model_results — per-model accuracy (exact-match and partial-credit), per-instruction breakdown with each variation's selected tools and exact_match result
  • summary — best model, accuracy ranking, best exact-match and partial-credit scores
  • confusion_matrices — per-model confusion matrix (expected vs. selected tool counts)
  • suggestions — tool-description improvement suggestions (when --suggest is used)

Scoring Metrics

  • Exact-match accuracy: Full ordered sequence of selected tools must match expected tools exactly
  • Partial-credit accuracy: Ordered prefix matching — counts how many tools in sequence match from the start (e.g., expected [A, B, C], selected [A, B, X] → 2/3 credit)

HTML Report

Use --html report.html to generate a visual report with:

  • Model ranking table
  • Confusion-matrix heatmaps (green = correct, red = misselected)
  • Description improvement suggestions (if --suggest was used)

Regression Tracking (Diff)

Compare two benchmark runs to see what improved or regressed:

mcp-bench diff baseline.json current.json
mcp-bench diff baseline.json current.json --html diff.html
mcp-bench diff baseline.json current.json --fail-under 0.05  # fail if any model regressed >5%

The diff output shows per-model and per-instruction accuracy changes with ↑/↓/= indicators. Changes beyond ±5% are flagged as improved (green) or regressed (red).

For MCP Server Maintainers

Want to benchmark tool selection accuracy in your MCP server's CI? Add a workflow that runs on every tool change:

  1. Add tools.json and test_suite.json to your repo (see Input Schemas)
  2. Copy the template workflow from examples/mcp-server-workflow.yml into .github/workflows/
  3. Create a GH_TOKEN repository secret with a Copilot-licensed PAT
  4. Customize the model list, paths, and thresholds in the workflow

The workflow will:

  • Run the benchmark whenever tool files change
  • Automatically diff against the previous run's results
  • Fail the build if accuracy drops below the configured threshold
  • Upload JSON + HTML reports as GitHub Actions artifacts

Running Tests

pytest tests/

Project Structure

mcp-tool-selection-bench/
├── pyproject.toml
├── README.md
├── .github/workflows/
│   ├── ci.yml              # CI pipeline: test & build on every push/PR
│   └── publish.yml         # Publish to PyPI on GitHub Release
├── examples/
│   └── mcp-server-workflow.yml  # Template workflow for MCP server repos
├── src/mcp_bench/
│   ├── cli.py              # CLI entry point (run + diff + generate subcommands)
│   ├── models.py           # Pydantic data models
│   ├── prompts.py          # All prompt templates (centralised)
│   ├── generator.py        # Auto-generate test suites from tool definitions
│   ├── query_generator.py  # Generate query variations via Copilot SDK
│   ├── evaluator.py        # Run queries against models
│   ├── scorer.py           # Score, rank, and build confusion matrices
│   ├── advisor.py          # Description improvement suggestions
│   ├── diff.py             # Regression tracking (diff two runs)
│   ├── visualize.py        # HTML report with heatmaps & diff views
│   └── report.py           # Write JSON report
├── samples/
│   ├── tools.json          # Example tool registry (8 tools)
│   └── test_suite.json     # Example test suite (19 instructions)
└── tests/

CI

Every push to master and every pull request triggers the CI pipeline (.github/workflows/ci.yml):

  1. Test — runs pytest across Python 3.10, 3.12, and 3.13
  2. Build — builds sdist + wheel and uploads as a GitHub Actions artifact

Publishing

Creating a GitHub Release triggers .github/workflows/publish.yml, which builds and publishes to PyPI via Trusted Publishers (OIDC — no API tokens needed).

# Create a release via CLI
gh release create v0.2.0 --generate-notes

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_tool_selection_bench-0.2.1.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_tool_selection_bench-0.2.1-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file mcp_tool_selection_bench-0.2.1.tar.gz.

File metadata

  • Download URL: mcp_tool_selection_bench-0.2.1.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mcp_tool_selection_bench-0.2.1.tar.gz
Algorithm Hash digest
SHA256 9e948c58fb63ff0dda09a9a08ad6a1009f7507fd297433edc5b49dc2c989103f
MD5 e220ff7c69ff29979c66f25aa0981f30
BLAKE2b-256 36b56293d2a67dbb57ce8d8e134237ec0f5244807595d6471c69e5ffe6e7a79d

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_tool_selection_bench-0.2.1.tar.gz:

Publisher: publish.yml on skmanoj/mcp-tool-selection-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mcp_tool_selection_bench-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for mcp_tool_selection_bench-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e303bfe3c06a18723e164dd08a5e16b98d3d0b969d8510f2735215d2e252be6c
MD5 c829b21ec5ce1a700d88cd56931e1ea6
BLAKE2b-256 4b77520460249937dae417e26a78154ba8c5ccd6ae28de5d863053cedcb6a952

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_tool_selection_bench-0.2.1-py3-none-any.whl:

Publisher: publish.yml on skmanoj/mcp-tool-selection-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page