Skip to main content

Minimal PySpark MCP server inspired by LakeSail

Project description

hi# PySpark MCP Server

Description

PySpark MCP Server is a lightweight server implementation of Model Context Protocol (MCP) for Apache Spark.

The primary purpose of this MCP server is to facilitate query optimization using AI systems. It provides both logical and physical query plans from Spark to AI systems for analysis, along with additional query plan information. Furthermore, the server exposes catalog and table information, enabling data discovery capabilities in data lakes powered by Spark.

Quick Start

Installation

pip install pyspark-mcp

Running the Server

After installation, use the pyspark-mcp command to start the server:

pyspark-mcp --master "local[*]" --host 127.0.0.1 --port 8090

The CLI automatically handles spark-submit configuration. All standard spark-submit options are supported:

# With additional Spark configuration
pyspark-mcp --master "local[*]" --conf spark.driver.memory=4g

# YARN cluster mode
pyspark-mcp --master yarn --deploy-mode client --num-executors 4

# With additional JARs
pyspark-mcp --master "local[*]" --jars /path/to/connector.jar

# Preview the spark-submit command without running
pyspark-mcp --master "local[*]" --dry-run

# With GraphFrames package
pyspark-mcp --master "local[*]" --packages io.graphframes:graphframes-spark3_2.12:0.10.1

CLI Options

Option Default Description
--master local[*] Spark master URL
--host 127.0.0.1 MCP server host address
--port 8090 MCP server port number
--spark-submit spark-submit Path to spark-submit executable
--dry-run - Print command without executing

All spark-submit options (--conf, --jars, --packages, --executor-memory, etc.) are passed through automatically.

Adding the running MCP to the Claude-code

# Must run one server on a different port per Claude instance
claude mcp add --transport http pyspark-mcp http://127.0.0.1:8090/mcp

Dependencies

  • Python >=3.11,<4.0
  • fastmcp >= 2.10.6
  • loguru
  • pyspark >= 3.5

Bundled MCP tools

The following tools are included in the PySpark MCP Server:

MCP Tool Description
Get the version of PySpark Get the version number from the current PySpark Session
Get Analyzed Plan of the query Extracts an analyzed logical plan from the provided SQL query
Get Optimized Plan of the query Extracts an optimized logical plan from the provided SQL query
Get size estimation for the query results Extracts a size and units from the query plan explain
Get tables from the query plan Extracts all the tables (relations) from the query plan explain
Get the current Spark Catalog Get the catalog that is the default one for the current SparkSession
Check does database exist Check if the database with a given name exists in the current Catalog
Get the current default database Get the current default database from the default Catalog
List all the databases in the current catalog List all the available databases from the current Catalog
List available catalogs List all the catalogs available in the current SparkSession
List tables in the current catalog List all the available tables in the current Spark Catalog
Get a comment of the table Extract comment of the table or returns an empty string
Get table schema Get the spark schema of the table in the catalog
Returns a schema of the result of the SQL query Run query, get the result, get the schema of the result and return a JSON-value of the schema
Read first N lines of the text file Read the first N lines of the file as a plain text. Useful to determine the format

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_mcp-0.0.6.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_mcp-0.0.6-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_mcp-0.0.6.tar.gz.

File metadata

  • Download URL: pyspark_mcp-0.0.6.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyspark_mcp-0.0.6.tar.gz
Algorithm Hash digest
SHA256 8c282f8af325a4a993284bb2ddfcee81d910a0ca32adfde98abe503bfc0f8a09
MD5 79fa8d659686b2b4df8cf90d3a3063bb
BLAKE2b-256 fc2db0fe0479d0a55cdd26b50bae4c334cae15d7c6acea6d720363f87eed436e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_mcp-0.0.6.tar.gz:

Publisher: publish.yml on SemyonSinchenko/pyspark-mcp-server

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyspark_mcp-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: pyspark_mcp-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyspark_mcp-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2a97867d81374b4997a010e5b4ad2cb4dc7c9f503d86be514b7b1f3808bbcd79
MD5 68f5f992c44c829b5c15a7565f987e16
BLAKE2b-256 5bf14c96104aa288886e924e1fb7d1781dfd3a6b37b6b11604f9b26fc535281e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_mcp-0.0.6-py3-none-any.whl:

Publisher: publish.yml on SemyonSinchenko/pyspark-mcp-server

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page