MCP server exposing Apache Spark Connect (and Databricks Connect) via DataFrame and SQL tools for AI agents

These details have not been verified by PyPI

Project description

spark-connect-mcp

MCP server exposing Apache Spark Connect (and Databricks Connect) via DataFrame and SQL tools for AI agents.

Install

Choose one backend — do not install both.

# OSS Spark Connect
pip install "spark-connect-mcp[spark]"

# Databricks Connect
pip install "spark-connect-mcp[databricks]"

Quick Start

Add to your Claude Code MCP config:

{
  "mcpServers": {
    "spark": {
      "command": "uvx",
      "args": ["--from", "spark-connect-mcp[databricks]", "spark-connect-mcp"]
    }
  }
}

For OSS Spark Connect, replace [databricks] with [spark].

Configuration

All connection config is set via environment variables — the MCP tools require no parameters to start a session.

OSS Spark Connect

Set SPARK_REMOTE to your Spark Connect server URL (PySpark's native env var):

export SPARK_REMOTE=sc://localhost:15002

Databricks Connect

Optionally set DATABRICKS_CONFIG_PROFILE to select a profile from ~/.databrickscfg (defaults to DEFAULT):

export DATABRICKS_CONFIG_PROFILE=my-workspace

Serverless compute is used by default inside Databricks Apps, Jobs, and notebooks — no env var needed.

Preflight Size Checks

Before executing an action tool (show, collect, count, describe, save, save_as_table), spark-connect-mcp runs a lightweight preflight check that inspects the Spark Catalyst optimized plan statistics to estimate result size without triggering a Spark job. If the estimate exceeds configurable thresholds the tool returns a warning instead of executing.

Prerequisites — making statistics available

Preflight relies on Catalyst Cost-Based Optimization (CBO) statistics. How you populate them depends on your environment:

Environment	How to get statistics
Databricks UC managed tables	Enable Predictive Optimization — stats are computed automatically.
External / unmanaged tables	Run `ANALYZE TABLE <table> COMPUTE STATISTICS FOR ALL COLUMNS`.
OSS Spark	Set `spark.sql.cbo.enabled=true` and `spark.sql.cbo.planStats.enabled=true`, then run `ANALYZE TABLE`.

Confidence tiers

The quality of the estimate depends on what statistics are present in the plan:

Tier	Condition	Behaviour
High	Root node has `sizeInBytes` + `rowCount`, and every join node has `rowCount`	Blocks if thresholds exceeded
Medium	Root has `rowCount` but some join nodes are missing `rowCount`	Uses 10× the configured thresholds before blocking
Low	Root has `sizeInBytes` only, no `rowCount`	Fail-open — warns but does not block
Cross-join	Plan contains `CartesianProduct`	Always warns regardless of thresholds

Threshold configuration

Set via environment variables (defaults shown):

# Maximum estimated bytes before warning (default 1 GB)
export SPARK_CONNECT_MCP_PREFLIGHT_MAX_BYTES=1073741824

# Maximum estimated rows before warning (default 10 million)
export SPARK_CONNECT_MCP_PREFLIGHT_MAX_ROWS=10000000

# Disable preflight entirely
export SPARK_CONNECT_MCP_PREFLIGHT_ENABLED=false

Per-session overrides

Use the set_preflight_threshold tool to adjust thresholds for a single session without changing env vars:

{
  "tool": "set_preflight_threshold",
  "arguments": {
    "session_id": "abc123",
    "max_bytes": 5368709120,
    "max_rows": 50000000
  }
}

Pass "enabled": false to disable preflight for that session.

The force escape hatch

Every action tool accepts a force parameter. Pass force=True to skip the preflight check entirely and execute immediately:

{
  "tool": "collect",
  "arguments": { "df_id": "df-001", "limit": 100, "force": true }
}

SQL Tool

The sql tool executes a SQL query against an active Spark session. By default it enforces read-only SQL — only SELECT, WITH...SELECT, SHOW, DESCRIBE, and EXPLAIN statements are permitted. Write operations (INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, MERGE, TRUNCATE, COPY INTO, OPTIMIZE, VACUUM, etc.) are rejected before reaching Spark.

Multi-statement SQL (e.g. SELECT 1; DROP TABLE foo) is also blocked — submit one statement at a time.

Malformed SQL that cannot be parsed is rejected fail-closed — the query is never executed.

Allowing write SQL

To permit write operations, set the escape-hatch environment variable:

export SPARK_CONNECT_MCP_ALLOW_WRITE_SQL=true

This bypasses all read-only enforcement. Intended for trusted environments where the agent needs DDL or DML access.

Status

Under active development. See issues for the roadmap.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 18, 2026

0.1.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_connect_mcp-0.2.0.tar.gz (116.8 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spark_connect_mcp-0.2.0-py3-none-any.whl (24.1 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file spark_connect_mcp-0.2.0.tar.gz.

File metadata

Download URL: spark_connect_mcp-0.2.0.tar.gz
Upload date: Apr 18, 2026
Size: 116.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_connect_mcp-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c63c5bd1dc039bc92579e38cbe1f1516befe63b72a9a56df9b57d6b263289789`
MD5	`746b1501e5a7205c97112f307c53729f`
BLAKE2b-256	`c445c98255da5003c715e3d99e35c971b08ee09edb36411af02726cca6bca53b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_connect_mcp-0.2.0.tar.gz:

Publisher: release.yml on IceRhymers/spark-connect-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: spark_connect_mcp-0.2.0.tar.gz
- Subject digest: c63c5bd1dc039bc92579e38cbe1f1516befe63b72a9a56df9b57d6b263289789
- Sigstore transparency entry: 1335859421
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: IceRhymers/spark-connect-mcp@2288ec8ee699acc329f7149f9738f5b0d673123e
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/IceRhymers
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2288ec8ee699acc329f7149f9738f5b0d673123e
- Trigger Event: push

File details

Details for the file spark_connect_mcp-0.2.0-py3-none-any.whl.

File metadata

Download URL: spark_connect_mcp-0.2.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 24.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_connect_mcp-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d192827e3492152270c3439988cf493a9a913fed42bc8a342055e76afc1817f`
MD5	`5aa6e963be34b346926185861adf5c0e`
BLAKE2b-256	`928a707d2adb84f5410b96228dd11187e62dec09b16275b52f7724b8d742df48`

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_connect_mcp-0.2.0-py3-none-any.whl:

Publisher: release.yml on IceRhymers/spark-connect-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: spark_connect_mcp-0.2.0-py3-none-any.whl
- Subject digest: 3d192827e3492152270c3439988cf493a9a913fed42bc8a342055e76afc1817f
- Sigstore transparency entry: 1335859547
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: IceRhymers/spark-connect-mcp@2288ec8ee699acc329f7149f9738f5b0d673123e
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/IceRhymers
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2288ec8ee699acc329f7149f9738f5b0d673123e
- Trigger Event: push

spark-connect-mcp 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

spark-connect-mcp

Install

Quick Start

Configuration

OSS Spark Connect

Databricks Connect

Preflight Size Checks

Prerequisites — making statistics available

Confidence tiers

Threshold configuration

Per-session overrides

The force escape hatch

SQL Tool

Allowing write SQL

Status

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance