Skip to main content

MCP server exposing Apache Spark Connect (and Databricks Connect) via DataFrame and SQL tools for AI agents

Project description

spark-connect-mcp

MCP server exposing Apache Spark Connect (and Databricks Connect) via DataFrame and SQL tools for AI agents.

Install

Choose one backend — do not install both.

# OSS Spark Connect
pip install "spark-connect-mcp[spark]"

# Databricks Connect
pip install "spark-connect-mcp[databricks]"

Quick Start

Add to your Claude Code MCP config:

{
  "mcpServers": {
    "spark": {
      "command": "uvx",
      "args": ["--from", "spark-connect-mcp[databricks]", "spark-connect-mcp"]
    }
  }
}

For OSS Spark Connect, replace [databricks] with [spark].

Configuration

All connection config is set via environment variables — the MCP tools require no parameters to start a session.

OSS Spark Connect

Set SPARK_REMOTE to your Spark Connect server URL (PySpark's native env var):

export SPARK_REMOTE=sc://localhost:15002

Databricks Connect

Optionally set DATABRICKS_CONFIG_PROFILE to select a profile from ~/.databrickscfg (defaults to DEFAULT):

export DATABRICKS_CONFIG_PROFILE=my-workspace

Serverless compute is used by default inside Databricks Apps, Jobs, and notebooks — no env var needed.

Preflight Size Checks

Before executing an action tool (show, collect, count, describe, save, save_as_table), spark-connect-mcp runs a lightweight preflight check that inspects the Spark Catalyst optimized plan statistics to estimate result size without triggering a Spark job. If the estimate exceeds configurable thresholds the tool returns a warning instead of executing.

Prerequisites — making statistics available

Preflight relies on Catalyst Cost-Based Optimization (CBO) statistics. How you populate them depends on your environment:

Environment How to get statistics
Databricks UC managed tables Enable Predictive Optimization — stats are computed automatically.
External / unmanaged tables Run ANALYZE TABLE <table> COMPUTE STATISTICS FOR ALL COLUMNS.
OSS Spark Set spark.sql.cbo.enabled=true and spark.sql.cbo.planStats.enabled=true, then run ANALYZE TABLE.

Confidence tiers

The quality of the estimate depends on what statistics are present in the plan:

Tier Condition Behaviour
High Root node has sizeInBytes + rowCount, and every join node has rowCount Blocks if thresholds exceeded
Medium Root has rowCount but some join nodes are missing rowCount Uses 10× the configured thresholds before blocking
Low Root has sizeInBytes only, no rowCount Fail-open — warns but does not block
Cross-join Plan contains CartesianProduct Always warns regardless of thresholds

Threshold configuration

Set via environment variables (defaults shown):

# Maximum estimated bytes before warning (default 1 GB)
export SPARK_CONNECT_MCP_PREFLIGHT_MAX_BYTES=1073741824

# Maximum estimated rows before warning (default 10 million)
export SPARK_CONNECT_MCP_PREFLIGHT_MAX_ROWS=10000000

# Disable preflight entirely
export SPARK_CONNECT_MCP_PREFLIGHT_ENABLED=false

Per-session overrides

Use the set_preflight_threshold tool to adjust thresholds for a single session without changing env vars:

{
  "tool": "set_preflight_threshold",
  "arguments": {
    "session_id": "abc123",
    "max_bytes": 5368709120,
    "max_rows": 50000000
  }
}

Pass "enabled": false to disable preflight for that session.

The force escape hatch

Every action tool accepts a force parameter. Pass force=True to skip the preflight check entirely and execute immediately:

{
  "tool": "collect",
  "arguments": { "df_id": "df-001", "limit": 100, "force": true }
}

SQL Tool

The sql tool executes a SQL query against an active Spark session. By default it enforces read-only SQL — only SELECT, WITH...SELECT, SHOW, DESCRIBE, and EXPLAIN statements are permitted. Write operations (INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, MERGE, TRUNCATE, COPY INTO, OPTIMIZE, VACUUM, etc.) are rejected before reaching Spark.

Multi-statement SQL (e.g. SELECT 1; DROP TABLE foo) is also blocked — submit one statement at a time.

Malformed SQL that cannot be parsed is rejected fail-closed — the query is never executed.

Allowing write SQL

To permit write operations, set the escape-hatch environment variable:

export SPARK_CONNECT_MCP_ALLOW_WRITE_SQL=true

This bypasses all read-only enforcement. Intended for trusted environments where the agent needs DDL or DML access.

Status

Under active development. See issues for the roadmap.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_connect_mcp-0.2.0.tar.gz (116.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_connect_mcp-0.2.0-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file spark_connect_mcp-0.2.0.tar.gz.

File metadata

  • Download URL: spark_connect_mcp-0.2.0.tar.gz
  • Upload date:
  • Size: 116.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_connect_mcp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c63c5bd1dc039bc92579e38cbe1f1516befe63b72a9a56df9b57d6b263289789
MD5 746b1501e5a7205c97112f307c53729f
BLAKE2b-256 c445c98255da5003c715e3d99e35c971b08ee09edb36411af02726cca6bca53b

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_connect_mcp-0.2.0.tar.gz:

Publisher: release.yml on IceRhymers/spark-connect-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spark_connect_mcp-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spark_connect_mcp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d192827e3492152270c3439988cf493a9a913fed42bc8a342055e76afc1817f
MD5 5aa6e963be34b346926185861adf5c0e
BLAKE2b-256 928a707d2adb84f5410b96228dd11187e62dec09b16275b52f7724b8d742df48

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_connect_mcp-0.2.0-py3-none-any.whl:

Publisher: release.yml on IceRhymers/spark-connect-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page