MCP server for SQL migration, AWS Glue job generation, and PySpark optimization
Project description
PySpark MCP Server
SQL migration assistance, AWS Glue job generation, and Spark code optimization — as an MCP server.
What It Does
- SQL Dialect Transpilation — Convert between PostgreSQL, Oracle, Redshift, MySQL, Snowflake, and Spark SQL using SQLGlot
- PySpark DataFrame API Generation — Generate DataFrame API code from SQL with optimization hints
- AWS Glue Integration — Job templates, DynamicFrame conversions, Data Catalog definitions, S3 optimization strategies
- Batch Processing — Process hundreds of SQL files concurrently
- Code Review & Optimization — Analyze existing PySpark code for performance improvements
- Pattern Detection — Find code duplication and suggest refactoring
What It Doesn't Do
- Recursive CTEs → provides Spark SQL equivalent + guidance (PySpark has no native recursive CTE support)
- MERGE/PIVOT/CONNECT BY → transpiles to Spark SQL, provides DataFrame API guidance
- Perfect 1:1 DataFrame API transpilation for all SQL — complex queries get Spark SQL + optimization recommendations
Quick Start
pip install -e .
pyspark-mcp # starts the MCP server
MCP Configuration
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"pyspark": {
"command": "pyspark-mcp",
"args": []
}
}
}
Hermes Agent
Add to ~/.hermes/config.yaml:
mcp:
servers:
pyspark:
command: pyspark-mcp
enabled_tools: all
Docker
docker compose up -d
Tools
SQL Conversion
convert_sql_to_pyspark— Convert SQL to PySpark with dialect detectionanalyze_sql_context— Analyze SQL complexity and suggest approach
AWS Glue
generate_aws_glue_job_template— Generate complete Glue job scriptsconvert_dataframe_to_dynamic_frame— DataFrame ↔ DynamicFrame conversiongenerate_data_catalog_table_definition— Data Catalog table definitionsgenerate_incremental_processing_job— Incremental/CDC job generationanalyze_s3_optimization_opportunities— S3 layout and partitioning analysis
Optimization
review_pyspark_code— Code review with performance recommendationsoptimize_pyspark_code— Suggest optimizations for existing coderecommend_join_strategy— Broadcast vs shuffle join recommendationssuggest_partitioning_strategy— Partitioning recommendations
Batch Processing
batch_process_files— Process multiple SQL files concurrentlybatch_process_directory— Convert entire directories
Development
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Test
pytest tests/ -v --cov=pyspark_tools
# Format
black pyspark_tools tests
isort pyspark_tools tests
# Lint
flake8 pyspark_tools tests
Architecture
pyspark_tools/
├── server.py # FastMCP server + tool definitions
├── sql_converter.py # SQLGlot-based transpilation + DataFrame API generation
├── aws_glue_integration.py # Glue job templates, DynamicFrame, Data Catalog
├── advanced_optimizer.py # Performance analysis + optimization suggestions
├── batch_processor.py # Concurrent file processing
├── code_reviewer.py # PySpark code review patterns
├── duplicate_detector.py # Code deduplication
├── data_source_analyzer.py # Data source analysis
└── file_utils.py # File I/O utilities
CI/CD
- ✅ 256 tests passing
- ✅ 71% code coverage
- ✅ Code quality checks (black, isort, flake8)
- ✅ Python 3.11 tested
License
MIT — see LICENSE.
mcp-name: io.github.AnnasMazhar/pyspark-mcp
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_tools-0.0.4.tar.gz.
File metadata
- Download URL: pyspark_tools-0.0.4.tar.gz
- Upload date:
- Size: 138.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
462aa0e2f68044211f5b246c7a4b86851243e16c2ab445805038f6dbbc49f8e4
|
|
| MD5 |
1c2469cd372e59b8b06fa0745fceffd5
|
|
| BLAKE2b-256 |
29b223e3e4f7a62293b723abecaaa22617be3635bbe2de316dded2b5ae643c96
|
Provenance
The following attestation bundles were made for pyspark_tools-0.0.4.tar.gz:
Publisher:
publish.yml on AnnasMazhar/pyspark_mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyspark_tools-0.0.4.tar.gz -
Subject digest:
462aa0e2f68044211f5b246c7a4b86851243e16c2ab445805038f6dbbc49f8e4 - Sigstore transparency entry: 1482119059
- Sigstore integration time:
-
Permalink:
AnnasMazhar/pyspark_mcp@4edd01eaec7128b834d4cb11547d79555d93593d -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/AnnasMazhar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4edd01eaec7128b834d4cb11547d79555d93593d -
Trigger Event:
release
-
Statement type:
File details
Details for the file pyspark_tools-0.0.4-py3-none-any.whl.
File metadata
- Download URL: pyspark_tools-0.0.4-py3-none-any.whl
- Upload date:
- Size: 104.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
781b356e9be4aab6354e01b1040a6496d1f29b1b77c6c0c20980e8ceec363bb5
|
|
| MD5 |
24c29757d8f8de8cb6533f2cb703ab60
|
|
| BLAKE2b-256 |
6c56cb2b8ccbd62cb05e323915f4f0478047d19fc20bb5597ed1d8ac95af12fe
|
Provenance
The following attestation bundles were made for pyspark_tools-0.0.4-py3-none-any.whl:
Publisher:
publish.yml on AnnasMazhar/pyspark_mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyspark_tools-0.0.4-py3-none-any.whl -
Subject digest:
781b356e9be4aab6354e01b1040a6496d1f29b1b77c6c0c20980e8ceec363bb5 - Sigstore transparency entry: 1482119270
- Sigstore integration time:
-
Permalink:
AnnasMazhar/pyspark_mcp@4edd01eaec7128b834d4cb11547d79555d93593d -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/AnnasMazhar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4edd01eaec7128b834d4cb11547d79555d93593d -
Trigger Event:
release
-
Statement type: