Skip to main content

Scoring engine for MCP server quality assessment

Project description

mcp-scoring-engine

A standalone scoring engine for evaluating the quality of Model Context Protocol (MCP) servers. Pure Python, no framework dependencies — just dataclasses, scoring logic, and network probes.

Used in production by MCP Scoreboard to grade thousands of MCP servers.

Installation

pip install mcp-scoring-engine

Quick Start

Score a server from its GitHub repo (static analysis)

from mcp_scoring_engine import ServerInfo, analyze_repo, compute_score

server = ServerInfo(
    name="my-mcp-server",
    description="A tool server for doing useful things",
    repo_url="https://github.com/owner/my-mcp-server",
)

static = analyze_repo(server.repo_url)
result = compute_score(server, static_result=static)

print(result.composite_score)  # 0–100
print(result.grade)            # "A+", "B", "D", etc.
print(result.score_type)       # "partial" (1 tier) or "full" (2+ tiers)

Probe a running server

from mcp_scoring_engine import (
    ServerInfo, probe_server, deep_probe_server, compute_score
)

# Fast health check (~10s) — connection, initialize, ping
fast = probe_server("https://my-server.example.com/mcp")
print(fast.is_reachable, fast.connection_ms)

# Deep protocol probe (~30s) — schema validation, error handling, fuzz testing
deep = deep_probe_server("https://my-server.example.com/mcp")
print(deep.tools_count, deep.schema_valid, deep.fuzz_score)

# Score with the probe results
server = ServerInfo(name="my-server", description="...", repo_url="...")
static = analyze_repo(server.repo_url)
result = compute_score(server, static_result=static, deep_probe=deep)
print(result.grade)

Probe a stdio server

from mcp_scoring_engine import probe_server_stdio, deep_probe_server_stdio

fast = probe_server_stdio(["npx", "-y", "@modelcontextprotocol/server-memory"])
deep = deep_probe_server_stdio(["python", "-m", "my_mcp_server"])

Classify a server

from mcp_scoring_engine import classify_server, ServerInfo

server = ServerInfo(
    name="stripe-mcp",
    description="MCP server for Stripe payment processing",
    repo_url="https://github.com/stripe/stripe-mcp",
)

category, targets = classify_server(server)
print(category)  # "finance"
print(targets)   # ["Stripe"]

Detect entry points for stdio servers

from mcp_scoring_engine import detect_entry_point, make_github_file_reader

# With a GitHubPublicClient (from your own GitHub API code)
file_reader = make_github_file_reader(client)
tree = client.get_tree()

result = detect_entry_point(tree, file_reader)
# {"language": "python", "run_cmd": ["python", "-m", "my_server"],
#  "install_cmd": "uv pip install -e .",
#  "source": "pyproject.toml [project.scripts]", "confidence": "high"}

Entry point detection parses build metadata to infer how to run an MCP server:

  • Python: pyproject.toml scripts, setup.cfg/setup.py console_scripts, __main__.py
  • Node: package.json bin field, scripts.start, main field

When called via analyze_repo(), detection piggybacks on the already-fetched file tree at zero extra API cost. The result is stored in StaticAnalysis.details["entry_point"].

Detect red flags

from mcp_scoring_engine import detect_flags, ServerInfo

server = ServerInfo(
    name="sketchy-server",
    description="A MCP server",
    repo_url="",
    remote_endpoint_url="http://localhost:3000/mcp",
)

flags = detect_flags(server)
for flag in flags:
    print(f"[{flag.severity}] {flag.label}: {flag.description}")
    # [critical] No Source Code: No repository URL or source link provided
    # [warning] Staging Artifact: Endpoint URL contains localhost or staging reference

Architecture

The engine evaluates servers across three data tiers:

Tier Source What it measures
Tier 1 — Static Analysis GitHub repo Schema completeness, description quality, documentation, maintenance pulse, dependency health, license clarity, version hygiene
Tier 2 — Protocol Probe Live server Connection health, tool schema validation, error handling, fuzz resilience, auth discovery
Tier 3 — Reliability Rolling window Uptime percentage, p50/p95 latency

The composite score is a weighted blend of five categories:

Category Weight
Schema & Docs 25%
Protocol Compliance 20%
Reliability 20%
Maintenance 15%
Security 20%

Score types:

  • partial — Only 1 data tier available. Numeric score but no letter grade.
  • full — 2+ data tiers. Graded A+ through F.

API Reference

Core

Function Description
compute_score(server, static_result?, deep_probe?, reliability?) Compute weighted composite score → ScoreResult
score_to_grade(score) Convert 0–100 → letter grade (A+, A, B, C, D, F)
classify_server(server) Categorize a server → (category, target_platforms)
detect_flags(server, context?) Detect red flags → list[Flag]
generate_badges(server, static_result?, deep_probe?, reliability?, flags?) Generate display badges → dict

Probes

Function Description
probe_server(url) Fast health check over HTTP → FastProbeResult
probe_server_stdio(command) Fast health check over stdio → FastProbeResult
deep_probe_server(url) Full protocol probe over HTTP → DeepProbeResult
deep_probe_server_stdio(command) Full protocol probe over stdio → DeepProbeResult
analyze_repo(repo_url) Static analysis of GitHub repo → StaticAnalysis
detect_entry_point(file_tree, file_reader) Detect how to run a server from repo metadata → dict | None
make_github_file_reader(client) Create a file_reader callable from a GitHub API client → Callable
compute_reliability_score(data) Score from uptime + latency → int

Types

All inputs and outputs are plain dataclasses:

  • ServerInfo — Input server metadata (name, description, repo_url, etc.)
  • ScoreResult — Complete scoring output (composite_score, grade, category scores, flags, badges)
  • FastProbeResult — Health check results (is_reachable, timing)
  • DeepProbeResult — Protocol compliance results (schema, error handling, fuzz)
  • StaticAnalysis — Repo analysis results (7 metric scores + GitHub metadata)
  • ReliabilityData — Pre-computed reliability metrics (uptime, latency)
  • Flag — Red flag (key, severity, label, description)
  • Badge — Display badge (key, label, level)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_scoring_engine-0.4.1.tar.gz (44.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_scoring_engine-0.4.1-py3-none-any.whl (37.5 kB view details)

Uploaded Python 3

File details

Details for the file mcp_scoring_engine-0.4.1.tar.gz.

File metadata

  • Download URL: mcp_scoring_engine-0.4.1.tar.gz
  • Upload date:
  • Size: 44.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mcp_scoring_engine-0.4.1.tar.gz
Algorithm Hash digest
SHA256 b54bfee1335900d97926a96da2442c909ab98de9de7bbf465333e873efc0d6c8
MD5 42e395d376c3a954af7d221785c34a24
BLAKE2b-256 0e48bc6282c1e7e20391a9c7b09baa53385de10bbe024a09f81d69129d922219

See more details on using hashes here.

File details

Details for the file mcp_scoring_engine-0.4.1-py3-none-any.whl.

File metadata

File hashes

Hashes for mcp_scoring_engine-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0dd39bbcb27b97cc76d91988f64ee16c0de0a276126634bbe28552b9aeb51ab3
MD5 7badd747f87d7773360a6291a0526316
BLAKE2b-256 5438d532a64c728e68f6bce224a3f663634a7adfad989df1dcb5b6b7a1c25b4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page