Skip to main content

Model Context Protocol (MCP) server for Apache Spark History Server with job comparison and analytics

Project description

Kubeflow Spark AI Toolkit

CI Python 3.12+ MCP License Kubeflow Slack

๐Ÿค– Connect AI agents and engineers to Apache Spark History Server for intelligent job analysis, performance monitoring, and terminal-based investigation

This project provides two interfaces to your Apache Spark History Server data โ€” an MCP server for AI agents doing natural-language investigation, and a CLI (shs) for engineers and scripts that need direct terminal access:


[!IMPORTANT]

โœจ NEW โ€” Spark History Server CLI is now available

SHS CLI

A standalone Go binary that queries Spark History Server directly from your terminal โ€” no MCP, no AI framework, no daemon process. Inspect jobs, compare runs, investigate failures, and script against the Spark REST API.

Get started with the SHS CLI โ†’


This project provides two interfaces

โšก MCP Server ๐Ÿ› ๏ธ SHS CLI (shs)
For AI agents and MCP-compatible clients Humans, shell scripts, CI/CD, coding agents
How AI calls tools via Model Context Protocol Direct terminal commands, no protocol overhead
Example "Why is my ETL job slow?" โ†’ agent investigates shs stages -a APP --sort duration
Install uv run -m spark_history_mcp.core.main cd skills/cli && go build -o bin/shs .

๐ŸŽฏ What is This?

Kubeflow Spark AI Toolkit is a diagnostics toolkit for Apache Spark applications. It provides two interfaces to your Spark History Server data:

  • โšก MCP Server โ€” AI agents query Spark data via the Model Context Protocol using natural language
  • ๐Ÿ› ๏ธ CLI (shs) โ€” Engineers and scripts query Spark data directly from the terminal

Both interfaces enable:

  • ๐Ÿ” Query job details โ€” application metadata, stages, executors, SQL queries
  • ๐Ÿ“Š Analyze performance โ€” identify slow stages, bottlenecks, and resource usage patterns
  • ๐Ÿ”„ Compare runs โ€” diff configurations and metrics across applications to catch regressions
  • ๐Ÿšจ Investigate failures โ€” drill into failed tasks with detailed error analysis
  • ๐Ÿ“ˆ Generate insights โ€” surface optimization recommendations from historical execution data

๐Ÿ“บ See it in action:

Watch the demo video

๐Ÿ—๏ธ Architecture

graph TB
    subgraph Clients
        A[๐Ÿค– AI Agent / LLM]
        B[๐Ÿ‘ฉโ€๐Ÿ’ป Engineer / Script / CI]
    end

    subgraph Toolkit
        C[โšก MCP Server]
        D[๐Ÿ› ๏ธ CLI - shs]
    end

    subgraph Spark History Servers
        E[๐Ÿ”ฅ Production]
        F[๐Ÿ”ฅ Dev]
    end

    A -->|MCP Protocol| C
    B -->|Terminal| D

    C -->|REST API| E
    C -->|REST API| F
    D -->|REST API| E
    D -->|REST API| F

Quick Start

CLI (shs)

Download the latest binary from GitHub Releases:

# Linux (amd64)
curl -sSL https://github.com/kubeflow/mcp-apache-spark-history-server/releases/latest/download/shs-linux-amd64.tar.gz | tar xz
sudo mv shs /usr/local/bin/

# macOS (Apple Silicon)
curl -sSL https://github.com/kubeflow/mcp-apache-spark-history-server/releases/latest/download/shs-darwin-arm64.tar.gz | tar xz
sudo mv shs /usr/local/bin/

Point it at your Spark History Server and start querying:

shs apps --server http://your-spark-history-server:18080
shs stages -a <app-id> --sort duration

# Generate a config file to avoid passing --server every time
shs setup config > config.yaml

# Generate a skill file for coding agents (e.g. Claude Code)
shs setup skill > ~/.claude/skills/spark-history.md

See the CLI documentation for full usage, or check out a real-world example of Claude Code comparing two TPC-DS 3TB benchmark runs.

MCP Server

# Run directly with uvx (no install needed)
uvx --from mcp-apache-spark-history-server spark-mcp

# Or install with pip
pip install mcp-apache-spark-history-server
python3 -m spark_history_mcp.core.main

The package is published to PyPI.

Prerequisites

  • Existing Spark History Server (running and accessible)
  • CLI: No dependencies โ€” single static binary
  • MCP Server: Python 3.12+, uv

โš™๏ธ Server Configuration

Edit config.yaml for your Spark History Server:

Config File Options:

  • Command line: --config /path/to/config.yaml or -c /path/to/config.yaml
  • Environment variable: SHS_MCP_CONFIG=/path/to/config.yaml
  • Default: ./config.yaml
servers:
  local:
    default: true
    url: "http://your-spark-history-server:18080"
    auth:  # optional
      username: "user"
      password: "pass"
    include_plan_description: false  # optional, whether to include SQL execution plans by default (default: false)
mcp:
  transports:
    - streamable-http # streamable-http or stdio.
  port: "18888"
  debug: true

๐Ÿ“ธ Screenshots

๐Ÿ” Get Spark Application

Get Application

โšก Job Performance Comparison

Job Comparison

๐Ÿ› ๏ธ MCP Tools

Note: These tools are subject to change as we scale and improve the performance of the MCP server.

The MCP server provides 18 specialized tools organized by analysis patterns. LLMs can intelligently select and combine these tools based on user queries:

๐Ÿ“Š Application Information

Basic application metadata and overview

๐Ÿ”ง Tool ๐Ÿ“ Description
list_applications ๐Ÿ“‹ Get a list of all applications available on the Spark History Server with optional filtering by status, date ranges, and limits
get_application ๐Ÿ“Š Get detailed information about a specific Spark application including status, resource usage, duration, and attempt details

๐Ÿ”— Job Analysis

Job-level performance analysis and identification

๐Ÿ”ง Tool ๐Ÿ“ Description
list_jobs ๐Ÿ”— Get a list of all jobs for a Spark application with optional status filtering
list_slowest_jobs โฑ๏ธ Get the N slowest jobs for a Spark application (excludes running jobs by default)

โšก Stage Analysis

Stage-level performance deep dive and task metrics

๐Ÿ”ง Tool ๐Ÿ“ Description
list_stages โšก Get a list of all stages for a Spark application with optional status filtering and summaries
list_slowest_stages ๐ŸŒ Get the N slowest stages for a Spark application (excludes running stages by default)
get_stage ๐ŸŽฏ Get information about a specific stage with optional attempt ID and summary metrics
get_stage_task_summary ๐Ÿ“Š Get statistical distributions of task metrics for a specific stage (execution times, memory usage, I/O metrics)

๐Ÿ–ฅ๏ธ Executor & Resource Analysis

Resource utilization, executor performance, and allocation tracking

๐Ÿ”ง Tool ๐Ÿ“ Description
list_executors ๐Ÿ–ฅ๏ธ Get executor information with optional inactive executor inclusion
get_executor ๐Ÿ” Get information about a specific executor including resource allocation, task statistics, and performance metrics
get_executor_summary ๐Ÿ“ˆ Aggregates metrics across all executors (memory usage, disk usage, task counts, performance metrics)
get_resource_usage_timeline ๐Ÿ“… Get chronological view of resource allocation and usage patterns including executor additions/removals

โš™๏ธ Configuration & Environment

Spark configuration, environment variables, and runtime settings

๐Ÿ”ง Tool ๐Ÿ“ Description
get_environment โš™๏ธ Get comprehensive Spark runtime configuration including JVM info, Spark properties, system properties, and classpath

๐Ÿ”Ž SQL & Query Analysis

SQL performance analysis and execution plan comparison

๐Ÿ”ง Tool ๐Ÿ“ Description
list_slowest_sql_queries ๐ŸŒ Get the top N slowest SQL queries for an application with detailed execution metrics and optional plan descriptions
compare_sql_execution_plans ๐Ÿ” Compare SQL execution plans between two Spark jobs, analyzing logical/physical plans and execution metrics

๐Ÿšจ Performance & Bottleneck Analysis

Intelligent bottleneck identification and performance recommendations

๐Ÿ”ง Tool ๐Ÿ“ Description
get_job_bottlenecks ๐Ÿšจ Identify performance bottlenecks by analyzing stages, tasks, and executors with actionable recommendations

๐Ÿ”„ Comparative Analysis

Cross-application comparison for regression detection and optimization

๐Ÿ”ง Tool ๐Ÿ“ Description
compare_job_environments โš™๏ธ Compare Spark environment configurations between two jobs to identify differences in properties and settings
compare_job_performance ๐Ÿ“ˆ Compare performance metrics between two Spark jobs including execution times, resource usage, and task distribution

๐Ÿค– How LLMs Use These Tools

Query Pattern Examples:

  • "Show me all applications between 12 AM and 1 AM on 2025-06-27" โ†’ list_applications
  • "Why is my job slow?" โ†’ get_job_bottlenecks + list_slowest_stages + get_executor_summary
  • "Compare today vs yesterday" โ†’ compare_job_performance + compare_job_environments
  • "What's wrong with stage 5?" โ†’ get_stage + get_stage_task_summary
  • "Show me resource usage over time" โ†’ get_resource_usage_timeline + get_executor_summary
  • "Find my slowest SQL queries" โ†’ list_slowest_sql_queries + compare_sql_execution_plans

๐Ÿ“” AWS Integration Guides

If you are an existing AWS user looking to analyze your Spark Applications, we provide detailed setup guides for:

These guides provide step-by-step instructions for setting up the Spark History Server MCP with your AWS services.

๐Ÿš€ Kubernetes Deployment

Deploy using Kubernetes with Helm:

โš ๏ธ Work in Progress: We are still testing and will soon publish the container image and Helm registry to GitHub for easy deployment.

# ๐Ÿ“ฆ Deploy with Helm
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/

# ๐ŸŽฏ Production configuration
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/ \
  --set replicaCount=3 \
  --set autoscaling.enabled=true \
  --set monitoring.enabled=true

๐Ÿ“š See deploy/kubernetes/helm/ for complete deployment manifests and configuration options.

Note: When using Secret Store CSI Driver authentication, you must create a SecretProviderClass externally before deploying the chart.

๐ŸŒ Multi-Spark History Server Setup

Setup multiple Spark history servers in the config.yaml and choose which server you want the LLM to interact with for each query.

servers:
  production:
    default: true
    url: "http://prod-spark-history:18080"
    auth:
      username: "user"
      password: "pass"
  staging:
    url: "http://staging-spark-history:18080"

๐Ÿ’ User Query: "Can you get application <app_id> using production server?"

๐Ÿค– AI Tool Request:

{
  "app_id": "<app_id>",
  "server": "production"
}

๐Ÿค– AI Tool Response:

{
  "id": "<app_id>>",
  "name": "app_name",
  "coresGranted": null,
  "maxCores": null,
  "coresPerExecutor": null,
  "memoryPerExecutorMB": null,
  "attempts": [
    {
      "attemptId": null,
      "startTime": "2023-09-06T04:44:37.006000Z",
      "endTime": "2023-09-06T04:45:40.431000Z",
      "lastUpdated": "2023-09-06T04:45:42Z",
      "duration": 63425,
      "sparkUser": "spark",
      "appSparkVersion": "3.3.0",
      "completed": true
    }
  ]
}

๐Ÿ” Environment Variables

SHS_MCP_PORT - Port for MCP server (default: 18888)
SHS_MCP_DEBUG - Enable debug mode (default: false)
SHS_MCP_ADDRESS - Address for MCP server (default: localhost)
SHS_MCP_TRANSPORT - MCP transport mode (default: streamable-http)
SHS_SERVERS_*_URL - URL for a specific server
SHS_SERVERS_*_AUTH_USERNAME - Username for a specific server
SHS_SERVERS_*_AUTH_PASSWORD - Password for a specific server
SHS_SERVERS_*_AUTH_TOKEN - Token for a specific server
SHS_SERVERS_*_VERIFY_SSL - Whether to verify SSL for a specific server (true/false)
SHS_SERVERS_*_TIMEOUT - HTTP request timeout in seconds for a specific server (default: 30)
SHS_SERVERS_*_EMR_CLUSTER_ARN - EMR cluster ARN for a specific server
SHS_SERVERS_*_INCLUDE_PLAN_DESCRIPTION - Whether to include SQL execution plans by default for a specific server (true/false, default: false)

๐Ÿค– AI Agent Integration

Quick Start Options

Integration Transport Best For
Local Testing HTTP Development, testing tools
Claude Desktop STDIO Interactive analysis
Amazon Q CLI STDIO Command-line automation
Kiro HTTP IDE integration, code-centric analysis
LangGraph HTTP Multi-agent workflows
Strands Agents HTTP Multi-agent workflows

Tip: The shs CLI can also generate a skill file for coding agents that support tool use:

shs setup skill > ~/.claude/skills/spark-history.md

This gives agents like Claude Code direct access to Spark History Server queries without the MCP server. See a real-world example of Claude Code using shs to compare two TPC-DS 3TB benchmark runs โ€” dispatching subagents in parallel for per-query root cause analysis.

๐ŸŽฏ Example Use Cases

๐Ÿ” Performance Investigation

๐Ÿค– AI Query: "Why is my ETL job running slower than usual?"

๐Ÿ“Š MCP Actions:
โœ… Analyze application metrics
โœ… Compare with historical performance
โœ… Identify bottleneck stages
โœ… Generate optimization recommendations

๐Ÿšจ Failure Analysis

๐Ÿค– AI Query: "What caused job 42 to fail?"

๐Ÿ” MCP Actions:
โœ… Examine failed tasks and error messages
โœ… Review executor logs and resource usage
โœ… Identify root cause and suggest fixes

๐Ÿ“ˆ Comparative Analysis

๐Ÿค– AI Query: "Compare today's batch job with yesterday's run"

๐Ÿ“Š MCP Actions:
โœ… Compare execution times and resource usage
โœ… Identify performance deltas
โœ… Highlight configuration differences

Development Setup

git clone https://github.com/kubeflow/mcp-apache-spark-history-server.git
cd mcp-apache-spark-history-server

# Install Task (if not already installed)
brew install go-task  # macOS, see https://taskfile.dev/installation/ for others

# Start Spark History Server with sample data and MCP server
task start-spark-bg            # Default Spark 3.5.5
task start-mcp-bg

# Optional: MCP Inspector on http://localhost:6274
task start-inspector-bg

# When done
task stop-all

๐ŸŒ Adopters

Are you using MCP Apache Spark History Server? We'd love to know! Add your organization or name to our ADOPTERS.md and help grow the community.

๐Ÿค Contributing

Check CONTRIBUTING.md for full guidelines on contributions

๐Ÿ“„ License

Apache License 2.0 - see LICENSE file for details.

๐Ÿ“ Trademark Notice

This project is built for use with Apache Sparkโ„ข History Server. Not affiliated with or endorsed by the Apache Software Foundation.


๐Ÿ”ฅ Connect your Spark infrastructure to AI agents

๐Ÿš€ Get Started | ๐Ÿ› ๏ธ View Tools | ๐Ÿงช Test Now | ๐Ÿค Contribute

Built by the community, for the community ๐Ÿ’™

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_apache_spark_history_server-0.2.0.tar.gz (87.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_apache_spark_history_server-0.2.0-py3-none-any.whl (161.1 kB view details)

Uploaded Python 3

File details

Details for the file mcp_apache_spark_history_server-0.2.0.tar.gz.

File metadata

  • Download URL: mcp_apache_spark_history_server-0.2.0.tar.gz
  • Upload date:
  • Size: 87.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_apache_spark_history_server-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cf8b24b3fad62312c2998ed86af7c909cc255270fc2bcc7c2056e52983744863
MD5 ef87fc0dc606d3d0146223b107ef237e
BLAKE2b-256 b1cb74920a0e65bc976a1e090a7ed71cff155e520365af4451314bf502168c3e

See more details on using hashes here.

File details

Details for the file mcp_apache_spark_history_server-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mcp_apache_spark_history_server-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 161.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_apache_spark_history_server-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 84cf7f01520ced54c7501778e37273828c6bb6cec58f926fef55438506986b9f
MD5 cf466ef9e15358c94d783e3b6402b345
BLAKE2b-256 5cda57fc063797df3001d95bc046452b7f8fa5767a229235c5acb40bb4a5f689

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page