Benchmark LLM tool selection accuracy for MCP tools using the GitHub Copilot SDK
Project description
MCP Tool Selection Benchmark
Benchmark how accurately different LLM models select the correct MCP tool given natural language instructions, powered by the GitHub Copilot SDK.
How It Works
- Load a tool registry (
tools.json) and ground-truth test suite (test_suite.json). - Generate 5 query variations per instruction at different ambiguity levels (explicit → misleading) using a Copilot SDK model call.
- Evaluate each variation against every selected model — the model is presented with the tools and must pick one or more.
- Score selections via exact-match and partial-credit against ground truth.
- Report per-model accuracy, confusion matrices, and optional description suggestions in JSON and HTML.
Prerequisites
- Python ≥ 3.10
- GitHub Copilot CLI installed and in PATH
- An active Copilot subscription (each query counts as a premium request)
- Authenticated via
copilotlogin, orGH_TOKEN/GITHUB_TOKENenv var
Installation
From PyPI:
pip install mcp-tool-selection-bench
Or for development:
cd mcp-tool-selection-bench
pip install -e ".[dev]"
Usage
Run a benchmark
mcp-bench run \
--tools samples/tools.json \
--test-suite samples/test_suite.json \
--models gpt-5 claude-sonnet-4 \
--output results.json
The run subcommand is the default — you can omit it for backward compatibility.
CLI Arguments (run)
| Argument | Required | Default | Description |
|---|---|---|---|
--tools |
✅ | — | Path to the tool registry JSON |
--test-suite |
✅ | — | Path to the ground-truth test suite JSON |
--models |
✅ | — | Space-separated model names to benchmark |
--output |
— | results.json |
Output path for the report |
--variations |
— | 5 |
Number of query variations per instruction |
--generator-model |
— | first --models entry |
Model used to generate query variations |
-v / --verbose |
— | off | Enable DEBUG logging |
--html |
— | — | Output path for an HTML report with confusion-matrix heatmaps |
--suggest |
— | off | Generate tool-description improvement suggestions (extra API calls) |
--suggest-threshold |
— | 0.7 |
Accuracy threshold below which to suggest description improvements |
--fail-under |
— | — | Exit with code 1 if any model's exact-match accuracy is below this value |
Input Schemas
tools.json
[
{
"name": "search_issues",
"description": "Search for issues in GitHub repositories",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Search query" }
},
"required": ["query"]
},
"metadata": { "category": "github", "tags": ["issues", "search"] }
}
]
test_suite.json
Supports both single-tool and multi-tool test cases:
[
{
"instruction": "Find all open bugs in the react repo",
"expected_tool": "search_issues"
},
{
"instruction": "Find open bugs in react and check the CI logs",
"expected_tools": ["search_issues", "get_job_logs"]
}
]
For multi-tool cases, use expected_tools (ordered list). Single-tool cases using expected_tool remain fully supported.
Output
The report (results.json) contains:
run_metadata— timestamp, models tested, query countsper_model_results— per-model accuracy (exact-match and partial-credit), per-instruction breakdown with each variation's selected tool(s) and correctnesssummary— best model, accuracy ranking, best exact-match and partial-credit scoresconfusion_matrices— per-model confusion matrix (expected vs. selected tool counts)suggestions— tool-description improvement suggestions (when--suggestis used)
Scoring Metrics
- Exact-match accuracy: Full ordered sequence of selected tools must match expected tools exactly
- Partial-credit accuracy: Ordered prefix matching — counts how many tools in sequence match from the start (e.g., expected
[A, B, C], selected[A, B, X]→ 2/3 credit)
HTML Report
Use --html report.html to generate a visual report with:
- Model ranking table
- Confusion-matrix heatmaps (green = correct, red = misselected)
- Description improvement suggestions (if
--suggestwas used)
Regression Tracking (Diff)
Compare two benchmark runs to see what improved or regressed:
mcp-bench diff baseline.json current.json
mcp-bench diff baseline.json current.json --html diff.html
mcp-bench diff baseline.json current.json --fail-under 0.05 # fail if any model regressed >5%
The diff output shows per-model and per-instruction accuracy changes with ↑/↓/= indicators. Changes beyond ±5% are flagged as improved (green) or regressed (red).
For MCP Server Maintainers
Want to benchmark tool selection accuracy in your MCP server's CI? Add a workflow that runs on every tool change:
- Add
tools.jsonandtest_suite.jsonto your repo (see Input Schemas) - Copy the template workflow from
examples/mcp-server-workflow.ymlinto.github/workflows/ - Create a
GH_TOKENrepository secret with a Copilot-licensed PAT - Customize the model list, paths, and thresholds in the workflow
The workflow will:
- Run the benchmark whenever tool files change
- Automatically diff against the previous run's results
- Fail the build if accuracy drops below the configured threshold
- Upload JSON + HTML reports as GitHub Actions artifacts
Running Tests
pytest tests/
Project Structure
mcp-tool-selection-bench/
├── pyproject.toml
├── README.md
├── .github/workflows/
│ ├── ci.yml # CI pipeline: test & build on every push/PR
│ └── publish.yml # Publish to PyPI on GitHub Release
├── examples/
│ └── mcp-server-workflow.yml # Template workflow for MCP server repos
├── src/mcp_bench/
│ ├── cli.py # CLI entry point (run + diff subcommands)
│ ├── models.py # Pydantic data models
│ ├── prompts.py # All prompt templates (centralised)
│ ├── query_generator.py # Generate query variations via Copilot SDK
│ ├── evaluator.py # Run queries against models
│ ├── scorer.py # Score, rank, and build confusion matrices
│ ├── advisor.py # Description improvement suggestions
│ ├── diff.py # Regression tracking (diff two runs)
│ ├── visualize.py # HTML report with heatmaps & diff views
│ └── report.py # Write JSON report
├── samples/
│ ├── tools.json # Example tool registry (8 tools)
│ └── test_suite.json # Example test suite (19 instructions)
└── tests/
CI
Every push to main and every pull request triggers the CI pipeline (.github/workflows/ci.yml):
- Test — runs
pytestacross Python 3.10, 3.12, and 3.13 - Build — builds sdist + wheel and uploads as a GitHub Actions artifact
Publishing
Creating a GitHub Release triggers .github/workflows/publish.yml, which builds and publishes to PyPI via Trusted Publishers (OIDC — no API tokens needed).
# Create a release via CLI
gh release create v0.1.0 --generate-notes
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_tool_selection_bench-0.1.0.tar.gz.
File metadata
- Download URL: mcp_tool_selection_bench-0.1.0.tar.gz
- Upload date:
- Size: 26.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3b9b520a3f1c329556e2d952b3a52da04ab980513fc572fb190456fd4a802fb
|
|
| MD5 |
39c52e78c0de06f6c0a388855edb6d6d
|
|
| BLAKE2b-256 |
adb8fcc607fe19d3cf85e787a981c4a2700e4ee96f41d2b21e359c31a80ef4d5
|
Provenance
The following attestation bundles were made for mcp_tool_selection_bench-0.1.0.tar.gz:
Publisher:
publish.yml on skmanoj/mcp-tool-selection-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcp_tool_selection_bench-0.1.0.tar.gz -
Subject digest:
e3b9b520a3f1c329556e2d952b3a52da04ab980513fc572fb190456fd4a802fb - Sigstore transparency entry: 927050788
- Sigstore integration time:
-
Permalink:
skmanoj/mcp-tool-selection-bench@78e559aa71548ee7e48284b9c2abafa6054c9089 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/skmanoj
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@78e559aa71548ee7e48284b9c2abafa6054c9089 -
Trigger Event:
release
-
Statement type:
File details
Details for the file mcp_tool_selection_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mcp_tool_selection_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
931b19d41a59ef2994b8844896889257a72af199fbc57745bb21555147dd01ac
|
|
| MD5 |
e2076bbdfcb9de6cfc92911c0044f6a8
|
|
| BLAKE2b-256 |
389eda422c19b38266c84d5c37c78bfc0f3add5c88d51f7492efbc2092716b45
|
Provenance
The following attestation bundles were made for mcp_tool_selection_bench-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on skmanoj/mcp-tool-selection-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcp_tool_selection_bench-0.1.0-py3-none-any.whl -
Subject digest:
931b19d41a59ef2994b8844896889257a72af199fbc57745bb21555147dd01ac - Sigstore transparency entry: 927050789
- Sigstore integration time:
-
Permalink:
skmanoj/mcp-tool-selection-bench@78e559aa71548ee7e48284b9c2abafa6054c9089 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/skmanoj
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@78e559aa71548ee7e48284b9c2abafa6054c9089 -
Trigger Event:
release
-
Statement type: