MCP server for SQL migration, AWS Glue job generation, and PySpark optimization

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

SyedAnnas

Project description

PySpark MCP Server

SQL migration assistance, AWS Glue job generation, and Spark code optimization — as an MCP server.

What It Does

SQL Dialect Transpilation — Convert between PostgreSQL, Oracle, Redshift, MySQL, Snowflake, and Spark SQL using SQLGlot
PySpark DataFrame API Generation — Generate DataFrame API code from SQL with optimization hints
AWS Glue Integration — Job templates, DynamicFrame conversions, Data Catalog definitions, S3 optimization strategies
Batch Processing — Process hundreds of SQL files concurrently
Code Review & Optimization — Analyze existing PySpark code for performance improvements
Pattern Detection — Find code duplication and suggest refactoring

What It Doesn't Do

Recursive CTEs → provides Spark SQL equivalent + guidance (PySpark has no native recursive CTE support)
MERGE/PIVOT/CONNECT BY → transpiles to Spark SQL, provides DataFrame API guidance
Perfect 1:1 DataFrame API transpilation for all SQL — complex queries get Spark SQL + optimization recommendations

Quick Start

pip install -e .
pyspark-mcp  # starts the MCP server

MCP Configuration

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pyspark": {
      "command": "pyspark-mcp",
      "args": []
    }
  }
}

Hermes Agent

Add to ~/.hermes/config.yaml:

mcp:
  servers:
    pyspark:
      command: pyspark-mcp
      enabled_tools: all

Docker

docker compose up -d

Tools

SQL Conversion

convert_sql_to_pyspark — Convert SQL to PySpark with dialect detection
analyze_sql_context — Analyze SQL complexity and suggest approach

AWS Glue

generate_aws_glue_job_template — Generate complete Glue job scripts
convert_dataframe_to_dynamic_frame — DataFrame ↔ DynamicFrame conversion
generate_data_catalog_table_definition — Data Catalog table definitions
generate_incremental_processing_job — Incremental/CDC job generation
analyze_s3_optimization_opportunities — S3 layout and partitioning analysis

Optimization

review_pyspark_code — Code review with performance recommendations
optimize_pyspark_code — Suggest optimizations for existing code
recommend_join_strategy — Broadcast vs shuffle join recommendations
suggest_partitioning_strategy — Partitioning recommendations

Batch Processing

batch_process_files — Process multiple SQL files concurrently
batch_process_directory — Convert entire directories

Development

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Test
pytest tests/ -v --cov=pyspark_tools

# Format
black pyspark_tools tests
isort pyspark_tools tests

# Lint
flake8 pyspark_tools tests

Architecture

pyspark_tools/
├── server.py              # FastMCP server + tool definitions
├── sql_converter.py       # SQLGlot-based transpilation + DataFrame API generation
├── aws_glue_integration.py # Glue job templates, DynamicFrame, Data Catalog
├── advanced_optimizer.py  # Performance analysis + optimization suggestions
├── batch_processor.py     # Concurrent file processing
├── code_reviewer.py       # PySpark code review patterns
├── duplicate_detector.py  # Code deduplication
├── data_source_analyzer.py # Data source analysis
└── file_utils.py          # File I/O utilities

CI/CD

✅ 256 tests passing
✅ 71% code coverage
✅ Code quality checks (black, isort, flake8)
✅ Python 3.11 tested

License

MIT — see LICENSE.

mcp-name: io.github.AnnasMazhar/pyspark-mcp

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

SyedAnnas

Release history Release notifications | RSS feed

This version

0.0.4

May 9, 2026

0.0.3

May 9, 2026

0.0.2

May 9, 2026

0.0.1

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_tools-0.0.4.tar.gz (138.8 kB view details)

Uploaded May 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_tools-0.0.4-py3-none-any.whl (104.8 kB view details)

Uploaded May 9, 2026 Python 3

File details

Details for the file pyspark_tools-0.0.4.tar.gz.

File metadata

Download URL: pyspark_tools-0.0.4.tar.gz
Upload date: May 9, 2026
Size: 138.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyspark_tools-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`462aa0e2f68044211f5b246c7a4b86851243e16c2ab445805038f6dbbc49f8e4`
MD5	`1c2469cd372e59b8b06fa0745fceffd5`
BLAKE2b-256	`29b223e3e4f7a62293b723abecaaa22617be3635bbe2de316dded2b5ae643c96`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_tools-0.0.4.tar.gz:

Publisher: publish.yml on AnnasMazhar/pyspark_mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyspark_tools-0.0.4.tar.gz
- Subject digest: 462aa0e2f68044211f5b246c7a4b86851243e16c2ab445805038f6dbbc49f8e4
- Sigstore transparency entry: 1482119059
- Sigstore integration time: May 9, 2026
Source repository:
- Permalink: AnnasMazhar/pyspark_mcp@4edd01eaec7128b834d4cb11547d79555d93593d
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/AnnasMazhar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4edd01eaec7128b834d4cb11547d79555d93593d
- Trigger Event: release

File details

Details for the file pyspark_tools-0.0.4-py3-none-any.whl.

File metadata

Download URL: pyspark_tools-0.0.4-py3-none-any.whl
Upload date: May 9, 2026
Size: 104.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyspark_tools-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`781b356e9be4aab6354e01b1040a6496d1f29b1b77c6c0c20980e8ceec363bb5`
MD5	`24c29757d8f8de8cb6533f2cb703ab60`
BLAKE2b-256	`6c56cb2b8ccbd62cb05e323915f4f0478047d19fc20bb5597ed1d8ac95af12fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_tools-0.0.4-py3-none-any.whl:

Publisher: publish.yml on AnnasMazhar/pyspark_mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyspark_tools-0.0.4-py3-none-any.whl
- Subject digest: 781b356e9be4aab6354e01b1040a6496d1f29b1b77c6c0c20980e8ceec363bb5
- Sigstore transparency entry: 1482119270
- Sigstore integration time: May 9, 2026
Source repository:
- Permalink: AnnasMazhar/pyspark_mcp@4edd01eaec7128b834d4cb11547d79555d93593d
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/AnnasMazhar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4edd01eaec7128b834d4cb11547d79555d93593d
- Trigger Event: release

pyspark-tools 0.0.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Project description

PySpark MCP Server

What It Does

What It Doesn't Do

Quick Start

MCP Configuration

Claude Desktop

Hermes Agent

Docker

Tools

SQL Conversion

AWS Glue

Optimization

Batch Processing

Development

Architecture

CI/CD

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance