Benchmark LLM tool selection accuracy for MCP tools using the GitHub Copilot SDK

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

MCP Tool Selection Benchmark

Benchmark how accurately different LLM models select the correct MCP tool given natural language instructions, powered by the GitHub Copilot SDK.

How It Works

Load a tool registry (tools.json) and ground-truth test suite (test_suite.json).
Generate 5 query variations per instruction at different ambiguity levels (explicit → misleading) using a Copilot SDK model call.
Evaluate each variation against every selected model — the model is presented with the tools and must pick one or more.
Score selections via exact-match and partial-credit against ground truth.
Report per-model accuracy, confusion matrices, and optional description suggestions in JSON and HTML.

Prerequisites

Python ≥ 3.10
GitHub Copilot CLI installed and in PATH
An active Copilot subscription (each query counts as a premium request)
Authenticated via copilot login, or GH_TOKEN / GITHUB_TOKEN env var

Installation

From PyPI:

pip install mcp-tool-selection-bench

Or for development:

cd mcp-tool-selection-bench
pip install -e ".[dev]"

Usage

Run a benchmark

mcp-bench run \
  --tools samples/tools.json \
  --test-suite samples/test_suite.json \
  --models gpt-5 claude-sonnet-4 \
  --output results.json

The run subcommand is the default — you can omit it for backward compatibility.

CLI Arguments (`run`)

Argument	Required	Default	Description
`--tools`	✅	—	Path to the tool registry JSON
`--test-suite`	✅	—	Path to the ground-truth test suite JSON
`--models`	✅	—	Space-separated model names to benchmark
`--output`	—	`results.json`	Output path for the report
`--variations`	—	`5`	Number of query variations per instruction
`--generator-model`	—	first `--models` entry	Model used to generate query variations
`-v` / `--verbose`	—	off	Enable DEBUG logging
`--html`	—	—	Output path for an HTML report with confusion-matrix heatmaps
`--suggest`	—	off	Generate tool-description improvement suggestions (extra API calls)
`--suggest-threshold`	—	`0.7`	Accuracy threshold below which to suggest description improvements
`--fail-under`	—	—	Exit with code 1 if any model's exact-match accuracy is below this value

Input Schemas

`tools.json`

[
  {
    "name": "search_issues",
    "description": "Search for issues in GitHub repositories",
    "parameters": {
      "type": "object",
      "properties": {
        "query": { "type": "string", "description": "Search query" }
      },
      "required": ["query"]
    },
    "metadata": { "category": "github", "tags": ["issues", "search"] }
  }
]

`test_suite.json`

Supports both single-tool and multi-tool test cases:

[
  {
    "instruction": "Find all open bugs in the react repo",
    "expected_tool": "search_issues"
  },
  {
    "instruction": "Find open bugs in react and check the CI logs",
    "expected_tools": ["search_issues", "get_job_logs"]
  }
]

For multi-tool cases, use expected_tools (ordered list). Single-tool cases using expected_tool remain fully supported.

Output

The report (results.json) contains:

run_metadata — timestamp, models tested, query counts
per_model_results — per-model accuracy (exact-match and partial-credit), per-instruction breakdown with each variation's selected tool(s) and correctness
summary — best model, accuracy ranking, best exact-match and partial-credit scores
confusion_matrices — per-model confusion matrix (expected vs. selected tool counts)
suggestions — tool-description improvement suggestions (when --suggest is used)

Scoring Metrics

Exact-match accuracy: Full ordered sequence of selected tools must match expected tools exactly
Partial-credit accuracy: Ordered prefix matching — counts how many tools in sequence match from the start (e.g., expected [A, B, C], selected [A, B, X] → 2/3 credit)

HTML Report

Use --html report.html to generate a visual report with:

Model ranking table
Confusion-matrix heatmaps (green = correct, red = misselected)
Description improvement suggestions (if --suggest was used)

Regression Tracking (Diff)

Compare two benchmark runs to see what improved or regressed:

mcp-bench diff baseline.json current.json
mcp-bench diff baseline.json current.json --html diff.html
mcp-bench diff baseline.json current.json --fail-under 0.05  # fail if any model regressed >5%

The diff output shows per-model and per-instruction accuracy changes with ↑/↓/= indicators. Changes beyond ±5% are flagged as improved (green) or regressed (red).

For MCP Server Maintainers

Want to benchmark tool selection accuracy in your MCP server's CI? Add a workflow that runs on every tool change:

Add tools.json and test_suite.json to your repo (see Input Schemas)
Copy the template workflow from examples/mcp-server-workflow.yml into .github/workflows/
Create a GH_TOKEN repository secret with a Copilot-licensed PAT
Customize the model list, paths, and thresholds in the workflow

The workflow will:

Run the benchmark whenever tool files change
Automatically diff against the previous run's results
Fail the build if accuracy drops below the configured threshold
Upload JSON + HTML reports as GitHub Actions artifacts

Running Tests

pytest tests/

Project Structure

mcp-tool-selection-bench/
├── pyproject.toml
├── README.md
├── .github/workflows/
│   ├── ci.yml              # CI pipeline: test & build on every push/PR
│   └── publish.yml         # Publish to PyPI on GitHub Release
├── examples/
│   └── mcp-server-workflow.yml  # Template workflow for MCP server repos
├── src/mcp_bench/
│   ├── cli.py              # CLI entry point (run + diff subcommands)
│   ├── models.py           # Pydantic data models
│   ├── prompts.py          # All prompt templates (centralised)
│   ├── query_generator.py  # Generate query variations via Copilot SDK
│   ├── evaluator.py        # Run queries against models
│   ├── scorer.py           # Score, rank, and build confusion matrices
│   ├── advisor.py          # Description improvement suggestions
│   ├── diff.py             # Regression tracking (diff two runs)
│   ├── visualize.py        # HTML report with heatmaps & diff views
│   └── report.py           # Write JSON report
├── samples/
│   ├── tools.json          # Example tool registry (8 tools)
│   └── test_suite.json     # Example test suite (19 instructions)
└── tests/

CI

Every push to main and every pull request triggers the CI pipeline (.github/workflows/ci.yml):

Test — runs pytest across Python 3.10, 3.12, and 3.13
Build — builds sdist + wheel and uploads as a GitHub Actions artifact

Publishing

Creating a GitHub Release triggers .github/workflows/publish.yml, which builds and publishes to PyPI via Trusted Publishers (OIDC — no API tokens needed).

# Create a release via CLI
gh release create v0.1.0 --generate-notes

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

skmanoj

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Mar 4, 2026

0.2.0

Mar 4, 2026

This version

0.1.0

Feb 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_tool_selection_bench-0.1.0.tar.gz (26.2 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp_tool_selection_bench-0.1.0-py3-none-any.whl (21.9 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file mcp_tool_selection_bench-0.1.0.tar.gz.

File metadata

Download URL: mcp_tool_selection_bench-0.1.0.tar.gz
Upload date: Feb 7, 2026
Size: 26.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mcp_tool_selection_bench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e3b9b520a3f1c329556e2d952b3a52da04ab980513fc572fb190456fd4a802fb`
MD5	`39c52e78c0de06f6c0a388855edb6d6d`
BLAKE2b-256	`adb8fcc607fe19d3cf85e787a981c4a2700e4ee96f41d2b21e359c31a80ef4d5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_tool_selection_bench-0.1.0.tar.gz:

Publisher: publish.yml on skmanoj/mcp-tool-selection-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcp_tool_selection_bench-0.1.0.tar.gz
- Subject digest: e3b9b520a3f1c329556e2d952b3a52da04ab980513fc572fb190456fd4a802fb
- Sigstore transparency entry: 927050788
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: skmanoj/mcp-tool-selection-bench@78e559aa71548ee7e48284b9c2abafa6054c9089
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/skmanoj
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@78e559aa71548ee7e48284b9c2abafa6054c9089
- Trigger Event: release

File details

Details for the file mcp_tool_selection_bench-0.1.0-py3-none-any.whl.

File metadata

Download URL: mcp_tool_selection_bench-0.1.0-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 21.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mcp_tool_selection_bench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`931b19d41a59ef2994b8844896889257a72af199fbc57745bb21555147dd01ac`
MD5	`e2076bbdfcb9de6cfc92911c0044f6a8`
BLAKE2b-256	`389eda422c19b38266c84d5c37c78bfc0f3add5c88d51f7492efbc2092716b45`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcp_tool_selection_bench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on skmanoj/mcp-tool-selection-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcp_tool_selection_bench-0.1.0-py3-none-any.whl
- Subject digest: 931b19d41a59ef2994b8844896889257a72af199fbc57745bb21555147dd01ac
- Sigstore transparency entry: 927050789
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: skmanoj/mcp-tool-selection-bench@78e559aa71548ee7e48284b9c2abafa6054c9089
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/skmanoj
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@78e559aa71548ee7e48284b9c2abafa6054c9089
- Trigger Event: release

mcp-tool-selection-bench 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

MCP Tool Selection Benchmark

How It Works

Prerequisites

Installation

Usage

Run a benchmark

CLI Arguments (run)

Input Schemas

tools.json

test_suite.json

Output

Scoring Metrics

HTML Report

Regression Tracking (Diff)

For MCP Server Maintainers

Running Tests

Project Structure

CI

Publishing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

CLI Arguments (`run`)

`tools.json`

`test_suite.json`