Skip to main content

Benchmark LLM tool selection accuracy for MCP tools using the GitHub Copilot SDK

Project description

MCP Tool Selection Benchmark

Benchmark how accurately different LLM models select the correct MCP tool given natural language instructions, powered by the GitHub Copilot SDK.

How It Works

  1. Load a tool registry (tools.json) and ground-truth test suite (test_suite.json).
  2. Generate 5 query variations per instruction at different ambiguity levels (explicit → misleading) using a Copilot SDK model call.
  3. Evaluate each variation against every selected model — the model is presented with the tools and must pick one or more.
  4. Score selections via exact-match and partial-credit against ground truth.
  5. Report per-model accuracy, confusion matrices, and optional description suggestions in JSON and HTML.

Prerequisites

  • Python ≥ 3.10
  • GitHub Copilot CLI installed and in PATH
  • An active Copilot subscription (each query counts as a premium request)
  • Authenticated via copilot login, or GH_TOKEN / GITHUB_TOKEN env var

Installation

From PyPI:

pip install mcp-tool-selection-bench

Or for development:

cd mcp-tool-selection-bench
pip install -e ".[dev]"

Usage

Run a benchmark

mcp-bench run \
  --tools samples/tools.json \
  --test-suite samples/test_suite.json \
  --models gpt-5 claude-sonnet-4 \
  --output results.json

The run subcommand is the default — you can omit it for backward compatibility.

CLI Arguments (run)

Argument Required Default Description
--tools Path to the tool registry JSON
--test-suite Path to the ground-truth test suite JSON
--models Space-separated model names to benchmark
--output results.json Output path for the report
--variations 5 Number of query variations per instruction
--generator-model first --models entry Model used to generate query variations
-v / --verbose off Enable DEBUG logging
--html Output path for an HTML report with confusion-matrix heatmaps
--suggest off Generate tool-description improvement suggestions (extra API calls)
--suggest-threshold 0.7 Accuracy threshold below which to suggest description improvements
--fail-under Exit with code 1 if any model's exact-match accuracy is below this value

Input Schemas

tools.json

[
  {
    "name": "search_issues",
    "description": "Search for issues in GitHub repositories",
    "parameters": {
      "type": "object",
      "properties": {
        "query": { "type": "string", "description": "Search query" }
      },
      "required": ["query"]
    },
    "metadata": { "category": "github", "tags": ["issues", "search"] }
  }
]

test_suite.json

Supports both single-tool and multi-tool test cases:

[
  {
    "instruction": "Find all open bugs in the react repo",
    "expected_tool": "search_issues"
  },
  {
    "instruction": "Find open bugs in react and check the CI logs",
    "expected_tools": ["search_issues", "get_job_logs"]
  }
]

For multi-tool cases, use expected_tools (ordered list). Single-tool cases using expected_tool remain fully supported.

Output

The report (results.json) contains:

  • run_metadata — timestamp, models tested, query counts
  • per_model_results — per-model accuracy (exact-match and partial-credit), per-instruction breakdown with each variation's selected tool(s) and correctness
  • summary — best model, accuracy ranking, best exact-match and partial-credit scores
  • confusion_matrices — per-model confusion matrix (expected vs. selected tool counts)
  • suggestions — tool-description improvement suggestions (when --suggest is used)

Scoring Metrics

  • Exact-match accuracy: Full ordered sequence of selected tools must match expected tools exactly
  • Partial-credit accuracy: Ordered prefix matching — counts how many tools in sequence match from the start (e.g., expected [A, B, C], selected [A, B, X] → 2/3 credit)

HTML Report

Use --html report.html to generate a visual report with:

  • Model ranking table
  • Confusion-matrix heatmaps (green = correct, red = misselected)
  • Description improvement suggestions (if --suggest was used)

Regression Tracking (Diff)

Compare two benchmark runs to see what improved or regressed:

mcp-bench diff baseline.json current.json
mcp-bench diff baseline.json current.json --html diff.html
mcp-bench diff baseline.json current.json --fail-under 0.05  # fail if any model regressed >5%

The diff output shows per-model and per-instruction accuracy changes with ↑/↓/= indicators. Changes beyond ±5% are flagged as improved (green) or regressed (red).

For MCP Server Maintainers

Want to benchmark tool selection accuracy in your MCP server's CI? Add a workflow that runs on every tool change:

  1. Add tools.json and test_suite.json to your repo (see Input Schemas)
  2. Copy the template workflow from examples/mcp-server-workflow.yml into .github/workflows/
  3. Create a GH_TOKEN repository secret with a Copilot-licensed PAT
  4. Customize the model list, paths, and thresholds in the workflow

The workflow will:

  • Run the benchmark whenever tool files change
  • Automatically diff against the previous run's results
  • Fail the build if accuracy drops below the configured threshold
  • Upload JSON + HTML reports as GitHub Actions artifacts

Running Tests

pytest tests/

Project Structure

mcp-tool-selection-bench/
├── pyproject.toml
├── README.md
├── .github/workflows/
│   ├── ci.yml              # CI pipeline: test & build on every push/PR
│   └── publish.yml         # Publish to PyPI on GitHub Release
├── examples/
│   └── mcp-server-workflow.yml  # Template workflow for MCP server repos
├── src/mcp_bench/
│   ├── cli.py              # CLI entry point (run + diff subcommands)
│   ├── models.py           # Pydantic data models
│   ├── prompts.py          # All prompt templates (centralised)
│   ├── query_generator.py  # Generate query variations via Copilot SDK
│   ├── evaluator.py        # Run queries against models
│   ├── scorer.py           # Score, rank, and build confusion matrices
│   ├── advisor.py          # Description improvement suggestions
│   ├── diff.py             # Regression tracking (diff two runs)
│   ├── visualize.py        # HTML report with heatmaps & diff views
│   └── report.py           # Write JSON report
├── samples/
│   ├── tools.json          # Example tool registry (8 tools)
│   └── test_suite.json     # Example test suite (19 instructions)
└── tests/

CI

Every push to main and every pull request triggers the CI pipeline (.github/workflows/ci.yml):

  1. Test — runs pytest across Python 3.10, 3.12, and 3.13
  2. Build — builds sdist + wheel and uploads as a GitHub Actions artifact

Publishing

Creating a GitHub Release triggers .github/workflows/publish.yml, which builds and publishes to PyPI via Trusted Publishers (OIDC — no API tokens needed).

# Create a release via CLI
gh release create v0.1.0 --generate-notes

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_tool_selection_bench-0.1.0.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_tool_selection_bench-0.1.0-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file mcp_tool_selection_bench-0.1.0.tar.gz.

File metadata

  • Download URL: mcp_tool_selection_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mcp_tool_selection_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e3b9b520a3f1c329556e2d952b3a52da04ab980513fc572fb190456fd4a802fb
MD5 39c52e78c0de06f6c0a388855edb6d6d
BLAKE2b-256 adb8fcc607fe19d3cf85e787a981c4a2700e4ee96f41d2b21e359c31a80ef4d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_tool_selection_bench-0.1.0.tar.gz:

Publisher: publish.yml on skmanoj/mcp-tool-selection-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mcp_tool_selection_bench-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mcp_tool_selection_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 931b19d41a59ef2994b8844896889257a72af199fbc57745bb21555147dd01ac
MD5 e2076bbdfcb9de6cfc92911c0044f6a8
BLAKE2b-256 389eda422c19b38266c84d5c37c78bfc0f3add5c88d51f7492efbc2092716b45

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_tool_selection_bench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on skmanoj/mcp-tool-selection-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page