Model Context Protocol (MCP) server for Apache Spark History Server with job comparison and analytics
Project description
Kubeflow Spark AI Toolkit
๐ค Connect AI agents and engineers to Apache Spark History Server for intelligent job analysis, performance monitoring, and terminal-based investigation
This project provides two interfaces to your Apache Spark History Server data โ an MCP server for AI agents doing natural-language investigation, and a CLI (shs) for engineers and scripts that need direct terminal access:
[!IMPORTANT]
โจ NEW โ Spark History Server CLI is now available
A standalone Go binary that queries Spark History Server directly from your terminal โ no MCP, no AI framework, no daemon process. Inspect jobs, compare runs, investigate failures, and script against the Spark REST API.
This project provides two interfaces
| โก MCP Server | ๐ ๏ธ SHS CLI (shs) |
|
|---|---|---|
| For | AI agents and MCP-compatible clients | Humans, shell scripts, CI/CD, coding agents |
| How | AI calls tools via Model Context Protocol | Direct terminal commands, no protocol overhead |
| Example | "Why is my ETL job slow?" โ agent investigates | shs stages -a APP --sort duration |
| Install | uv run -m spark_history_mcp.core.main |
cd skills/cli && go build -o bin/shs . |
๐ฏ What is This?
Kubeflow Spark AI Toolkit is a diagnostics toolkit for Apache Spark applications. It provides two interfaces to your Spark History Server data:
- โก MCP Server โ AI agents query Spark data via the Model Context Protocol using natural language
- ๐ ๏ธ CLI (
shs) โ Engineers and scripts query Spark data directly from the terminal
Both interfaces enable:
- ๐ Query job details โ application metadata, stages, executors, SQL queries
- ๐ Analyze performance โ identify slow stages, bottlenecks, and resource usage patterns
- ๐ Compare runs โ diff configurations and metrics across applications to catch regressions
- ๐จ Investigate failures โ drill into failed tasks with detailed error analysis
- ๐ Generate insights โ surface optimization recommendations from historical execution data
๐บ See it in action:
๐๏ธ Architecture
graph TB
subgraph Clients
A[๐ค AI Agent / LLM]
B[๐ฉโ๐ป Engineer / Script / CI]
end
subgraph Toolkit
C[โก MCP Server]
D[๐ ๏ธ CLI - shs]
end
subgraph Spark History Servers
E[๐ฅ Production]
F[๐ฅ Dev]
end
A -->|MCP Protocol| C
B -->|Terminal| D
C -->|REST API| E
C -->|REST API| F
D -->|REST API| E
D -->|REST API| F
Quick Start
CLI (shs)
Download the latest binary from GitHub Releases:
# Linux (amd64)
curl -sSL https://github.com/kubeflow/mcp-apache-spark-history-server/releases/latest/download/shs-linux-amd64.tar.gz | tar xz
sudo mv shs /usr/local/bin/
# macOS (Apple Silicon)
curl -sSL https://github.com/kubeflow/mcp-apache-spark-history-server/releases/latest/download/shs-darwin-arm64.tar.gz | tar xz
sudo mv shs /usr/local/bin/
Point it at your Spark History Server and start querying:
shs apps --server http://your-spark-history-server:18080
shs stages -a <app-id> --sort duration
# Generate a config file to avoid passing --server every time
shs setup config > config.yaml
# Generate a skill file for coding agents (e.g. Claude Code)
shs setup skill > ~/.claude/skills/spark-history.md
See the CLI documentation for full usage, or check out a real-world example of Claude Code comparing two TPC-DS 3TB benchmark runs.
MCP Server
# Run directly with uvx (no install needed)
uvx --from mcp-apache-spark-history-server spark-mcp
# Or install with pip
pip install mcp-apache-spark-history-server
python3 -m spark_history_mcp.core.main
The package is published to PyPI.
Prerequisites
- Existing Spark History Server (running and accessible)
- CLI: No dependencies โ single static binary
- MCP Server: Python 3.12+, uv
โ๏ธ Server Configuration
Edit config.yaml for your Spark History Server:
Config File Options:
- Command line:
--config /path/to/config.yamlor-c /path/to/config.yaml - Environment variable:
SHS_MCP_CONFIG=/path/to/config.yaml - Default:
./config.yaml
servers:
local:
default: true
url: "http://your-spark-history-server:18080"
auth: # optional
username: "user"
password: "pass"
include_plan_description: false # optional, whether to include SQL execution plans by default (default: false)
mcp:
transports:
- streamable-http # streamable-http or stdio.
port: "18888"
debug: true
๐ธ Screenshots
๐ Get Spark Application
โก Job Performance Comparison
๐ ๏ธ MCP Tools
Note: These tools are subject to change as we scale and improve the performance of the MCP server.
The MCP server provides 18 specialized tools organized by analysis patterns. LLMs can intelligently select and combine these tools based on user queries:
๐ Application Information
Basic application metadata and overview
| ๐ง Tool | ๐ Description |
|---|---|
list_applications |
๐ Get a list of all applications available on the Spark History Server with optional filtering by status, date ranges, and limits |
get_application |
๐ Get detailed information about a specific Spark application including status, resource usage, duration, and attempt details |
๐ Job Analysis
Job-level performance analysis and identification
| ๐ง Tool | ๐ Description |
|---|---|
list_jobs |
๐ Get a list of all jobs for a Spark application with optional status filtering |
list_slowest_jobs |
โฑ๏ธ Get the N slowest jobs for a Spark application (excludes running jobs by default) |
โก Stage Analysis
Stage-level performance deep dive and task metrics
| ๐ง Tool | ๐ Description |
|---|---|
list_stages |
โก Get a list of all stages for a Spark application with optional status filtering and summaries |
list_slowest_stages |
๐ Get the N slowest stages for a Spark application (excludes running stages by default) |
get_stage |
๐ฏ Get information about a specific stage with optional attempt ID and summary metrics |
get_stage_task_summary |
๐ Get statistical distributions of task metrics for a specific stage (execution times, memory usage, I/O metrics) |
๐ฅ๏ธ Executor & Resource Analysis
Resource utilization, executor performance, and allocation tracking
| ๐ง Tool | ๐ Description |
|---|---|
list_executors |
๐ฅ๏ธ Get executor information with optional inactive executor inclusion |
get_executor |
๐ Get information about a specific executor including resource allocation, task statistics, and performance metrics |
get_executor_summary |
๐ Aggregates metrics across all executors (memory usage, disk usage, task counts, performance metrics) |
get_resource_usage_timeline |
๐ Get chronological view of resource allocation and usage patterns including executor additions/removals |
โ๏ธ Configuration & Environment
Spark configuration, environment variables, and runtime settings
| ๐ง Tool | ๐ Description |
|---|---|
get_environment |
โ๏ธ Get comprehensive Spark runtime configuration including JVM info, Spark properties, system properties, and classpath |
๐ SQL & Query Analysis
SQL performance analysis and execution plan comparison
| ๐ง Tool | ๐ Description |
|---|---|
list_slowest_sql_queries |
๐ Get the top N slowest SQL queries for an application with detailed execution metrics and optional plan descriptions |
compare_sql_execution_plans |
๐ Compare SQL execution plans between two Spark jobs, analyzing logical/physical plans and execution metrics |
๐จ Performance & Bottleneck Analysis
Intelligent bottleneck identification and performance recommendations
| ๐ง Tool | ๐ Description |
|---|---|
get_job_bottlenecks |
๐จ Identify performance bottlenecks by analyzing stages, tasks, and executors with actionable recommendations |
๐ Comparative Analysis
Cross-application comparison for regression detection and optimization
| ๐ง Tool | ๐ Description |
|---|---|
compare_job_environments |
โ๏ธ Compare Spark environment configurations between two jobs to identify differences in properties and settings |
compare_job_performance |
๐ Compare performance metrics between two Spark jobs including execution times, resource usage, and task distribution |
๐ค How LLMs Use These Tools
Query Pattern Examples:
- "Show me all applications between 12 AM and 1 AM on 2025-06-27" โ
list_applications - "Why is my job slow?" โ
get_job_bottlenecks+list_slowest_stages+get_executor_summary - "Compare today vs yesterday" โ
compare_job_performance+compare_job_environments - "What's wrong with stage 5?" โ
get_stage+get_stage_task_summary - "Show me resource usage over time" โ
get_resource_usage_timeline+get_executor_summary - "Find my slowest SQL queries" โ
list_slowest_sql_queries+compare_sql_execution_plans
๐ AWS Integration Guides
If you are an existing AWS user looking to analyze your Spark Applications, we provide detailed setup guides for:
- AWS Glue Users - Connect to Glue Spark History Server
- Amazon EMR Users - Use EMR Persistent UI for Spark analysis
These guides provide step-by-step instructions for setting up the Spark History Server MCP with your AWS services.
๐ Kubernetes Deployment
Deploy using Kubernetes with Helm:
โ ๏ธ Work in Progress: We are still testing and will soon publish the container image and Helm registry to GitHub for easy deployment.
# ๐ฆ Deploy with Helm
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/
# ๐ฏ Production configuration
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/ \
--set replicaCount=3 \
--set autoscaling.enabled=true \
--set monitoring.enabled=true
๐ See deploy/kubernetes/helm/ for complete deployment manifests and configuration options.
Note: When using Secret Store CSI Driver authentication, you must create a
SecretProviderClassexternally before deploying the chart.
๐ Multi-Spark History Server Setup
Setup multiple Spark history servers in the config.yaml and choose which server you want the LLM to interact with for each query.
servers:
production:
default: true
url: "http://prod-spark-history:18080"
auth:
username: "user"
password: "pass"
staging:
url: "http://staging-spark-history:18080"
๐ User Query: "Can you get application <app_id> using production server?"
๐ค AI Tool Request:
{
"app_id": "<app_id>",
"server": "production"
}
๐ค AI Tool Response:
{
"id": "<app_id>>",
"name": "app_name",
"coresGranted": null,
"maxCores": null,
"coresPerExecutor": null,
"memoryPerExecutorMB": null,
"attempts": [
{
"attemptId": null,
"startTime": "2023-09-06T04:44:37.006000Z",
"endTime": "2023-09-06T04:45:40.431000Z",
"lastUpdated": "2023-09-06T04:45:42Z",
"duration": 63425,
"sparkUser": "spark",
"appSparkVersion": "3.3.0",
"completed": true
}
]
}
๐ Environment Variables
SHS_MCP_PORT - Port for MCP server (default: 18888)
SHS_MCP_DEBUG - Enable debug mode (default: false)
SHS_MCP_ADDRESS - Address for MCP server (default: localhost)
SHS_MCP_TRANSPORT - MCP transport mode (default: streamable-http)
SHS_SERVERS_*_URL - URL for a specific server
SHS_SERVERS_*_AUTH_USERNAME - Username for a specific server
SHS_SERVERS_*_AUTH_PASSWORD - Password for a specific server
SHS_SERVERS_*_AUTH_TOKEN - Token for a specific server
SHS_SERVERS_*_VERIFY_SSL - Whether to verify SSL for a specific server (true/false)
SHS_SERVERS_*_TIMEOUT - HTTP request timeout in seconds for a specific server (default: 30)
SHS_SERVERS_*_EMR_CLUSTER_ARN - EMR cluster ARN for a specific server
SHS_SERVERS_*_INCLUDE_PLAN_DESCRIPTION - Whether to include SQL execution plans by default for a specific server (true/false, default: false)
๐ค AI Agent Integration
Quick Start Options
| Integration | Transport | Best For |
|---|---|---|
| Local Testing | HTTP | Development, testing tools |
| Claude Desktop | STDIO | Interactive analysis |
| Amazon Q CLI | STDIO | Command-line automation |
| Kiro | HTTP | IDE integration, code-centric analysis |
| LangGraph | HTTP | Multi-agent workflows |
| Strands Agents | HTTP | Multi-agent workflows |
Tip: The
shsCLI can also generate a skill file for coding agents that support tool use:shs setup skill > ~/.claude/skills/spark-history.mdThis gives agents like Claude Code direct access to Spark History Server queries without the MCP server. See a real-world example of Claude Code using
shsto compare two TPC-DS 3TB benchmark runs โ dispatching subagents in parallel for per-query root cause analysis.
๐ฏ Example Use Cases
๐ Performance Investigation
๐ค AI Query: "Why is my ETL job running slower than usual?"
๐ MCP Actions:
โ
Analyze application metrics
โ
Compare with historical performance
โ
Identify bottleneck stages
โ
Generate optimization recommendations
๐จ Failure Analysis
๐ค AI Query: "What caused job 42 to fail?"
๐ MCP Actions:
โ
Examine failed tasks and error messages
โ
Review executor logs and resource usage
โ
Identify root cause and suggest fixes
๐ Comparative Analysis
๐ค AI Query: "Compare today's batch job with yesterday's run"
๐ MCP Actions:
โ
Compare execution times and resource usage
โ
Identify performance deltas
โ
Highlight configuration differences
Development Setup
git clone https://github.com/kubeflow/mcp-apache-spark-history-server.git
cd mcp-apache-spark-history-server
# Install Task (if not already installed)
brew install go-task # macOS, see https://taskfile.dev/installation/ for others
# Start Spark History Server with sample data and MCP server
task start-spark-bg # Default Spark 3.5.5
task start-mcp-bg
# Optional: MCP Inspector on http://localhost:6274
task start-inspector-bg
# When done
task stop-all
๐ Adopters
Are you using MCP Apache Spark History Server? We'd love to know! Add your organization or name to our ADOPTERS.md and help grow the community.
๐ค Contributing
Check CONTRIBUTING.md for full guidelines on contributions
๐ License
Apache License 2.0 - see LICENSE file for details.
๐ Trademark Notice
This project is built for use with Apache Sparkโข History Server. Not affiliated with or endorsed by the Apache Software Foundation.
๐ฅ Connect your Spark infrastructure to AI agents
๐ Get Started | ๐ ๏ธ View Tools | ๐งช Test Now | ๐ค Contribute
Built by the community, for the community ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_apache_spark_history_server-0.2.0.tar.gz.
File metadata
- Download URL: mcp_apache_spark_history_server-0.2.0.tar.gz
- Upload date:
- Size: 87.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf8b24b3fad62312c2998ed86af7c909cc255270fc2bcc7c2056e52983744863
|
|
| MD5 |
ef87fc0dc606d3d0146223b107ef237e
|
|
| BLAKE2b-256 |
b1cb74920a0e65bc976a1e090a7ed71cff155e520365af4451314bf502168c3e
|
File details
Details for the file mcp_apache_spark_history_server-0.2.0-py3-none-any.whl.
File metadata
- Download URL: mcp_apache_spark_history_server-0.2.0-py3-none-any.whl
- Upload date:
- Size: 161.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84cf7f01520ced54c7501778e37273828c6bb6cec58f926fef55438506986b9f
|
|
| MD5 |
cf466ef9e15358c94d783e3b6402b345
|
|
| BLAKE2b-256 |
5cda57fc063797df3001d95bc046452b7f8fa5767a229235c5acb40bb4a5f689
|