Skip to main content

MCP server for SQL migration, AWS Glue job generation, and PySpark optimization

Project description

PySpark MCP Server

SQL migration assistance, AWS Glue job generation, and Spark code optimization — as an MCP server.

CI Pipeline Python 3.11+ License: MIT

What It Does

  • SQL Dialect Transpilation — Convert between PostgreSQL, Oracle, Redshift, MySQL, Snowflake, and Spark SQL using SQLGlot
  • PySpark DataFrame API Generation — Generate DataFrame API code from SQL with optimization hints
  • AWS Glue Integration — Job templates, DynamicFrame conversions, Data Catalog definitions, S3 optimization strategies
  • Batch Processing — Process hundreds of SQL files concurrently
  • Code Review & Optimization — Analyze existing PySpark code for performance improvements
  • Pattern Detection — Find code duplication and suggest refactoring

What It Doesn't Do

  • Recursive CTEs → provides Spark SQL equivalent + guidance (PySpark has no native recursive CTE support)
  • MERGE/PIVOT/CONNECT BY → transpiles to Spark SQL, provides DataFrame API guidance
  • Perfect 1:1 DataFrame API transpilation for all SQL — complex queries get Spark SQL + optimization recommendations

Quick Start

pip install -e .
pyspark-mcp  # starts the MCP server

MCP Configuration

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pyspark": {
      "command": "pyspark-mcp",
      "args": []
    }
  }
}

Hermes Agent

Add to ~/.hermes/config.yaml:

mcp:
  servers:
    pyspark:
      command: pyspark-mcp
      enabled_tools: all

Docker

docker compose up -d

Tools

SQL Conversion

  • convert_sql_to_pyspark — Convert SQL to PySpark with dialect detection
  • analyze_sql_context — Analyze SQL complexity and suggest approach

AWS Glue

  • generate_aws_glue_job_template — Generate complete Glue job scripts
  • convert_dataframe_to_dynamic_frame — DataFrame ↔ DynamicFrame conversion
  • generate_data_catalog_table_definition — Data Catalog table definitions
  • generate_incremental_processing_job — Incremental/CDC job generation
  • analyze_s3_optimization_opportunities — S3 layout and partitioning analysis

Optimization

  • review_pyspark_code — Code review with performance recommendations
  • optimize_pyspark_code — Suggest optimizations for existing code
  • recommend_join_strategy — Broadcast vs shuffle join recommendations
  • suggest_partitioning_strategy — Partitioning recommendations

Batch Processing

  • batch_process_files — Process multiple SQL files concurrently
  • batch_process_directory — Convert entire directories

Development

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Test
pytest tests/ -v --cov=pyspark_tools

# Format
black pyspark_tools tests
isort pyspark_tools tests

# Lint
flake8 pyspark_tools tests

Architecture

pyspark_tools/
├── server.py              # FastMCP server + tool definitions
├── sql_converter.py       # SQLGlot-based transpilation + DataFrame API generation
├── aws_glue_integration.py # Glue job templates, DynamicFrame, Data Catalog
├── advanced_optimizer.py  # Performance analysis + optimization suggestions
├── batch_processor.py     # Concurrent file processing
├── code_reviewer.py       # PySpark code review patterns
├── duplicate_detector.py  # Code deduplication
├── data_source_analyzer.py # Data source analysis
└── file_utils.py          # File I/O utilities

CI/CD

  • ✅ 256 tests passing
  • ✅ 71% code coverage
  • ✅ Code quality checks (black, isort, flake8)
  • ✅ Python 3.11 tested

License

MIT — see LICENSE.


mcp-name: io.github.AnnasMazhar/pyspark-mcp

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_tools-0.0.4.tar.gz (138.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_tools-0.0.4-py3-none-any.whl (104.8 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_tools-0.0.4.tar.gz.

File metadata

  • Download URL: pyspark_tools-0.0.4.tar.gz
  • Upload date:
  • Size: 138.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyspark_tools-0.0.4.tar.gz
Algorithm Hash digest
SHA256 462aa0e2f68044211f5b246c7a4b86851243e16c2ab445805038f6dbbc49f8e4
MD5 1c2469cd372e59b8b06fa0745fceffd5
BLAKE2b-256 29b223e3e4f7a62293b723abecaaa22617be3635bbe2de316dded2b5ae643c96

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_tools-0.0.4.tar.gz:

Publisher: publish.yml on AnnasMazhar/pyspark_mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyspark_tools-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: pyspark_tools-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 104.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyspark_tools-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 781b356e9be4aab6354e01b1040a6496d1f29b1b77c6c0c20980e8ceec363bb5
MD5 24c29757d8f8de8cb6533f2cb703ab60
BLAKE2b-256 6c56cb2b8ccbd62cb05e323915f4f0478047d19fc20bb5597ed1d8ac95af12fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_tools-0.0.4-py3-none-any.whl:

Publisher: publish.yml on AnnasMazhar/pyspark_mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page