The automated technical fellow for your Fabric Spark workloads - intelligent configuration analysis and optimization recommendations

These details have not been verified by PyPI

Project links

Project description

🔥 sparkwise

The automated technical fellow for your Fabric Spark workloads

sparkwise is an intelligent configuration advisor for Apache Spark on Microsoft Fabric. It automatically analyzes your Spark workloads, detects performance issues, and provides actionable optimization recommendations - all without you having to scan through thousands of configuration options.

🎯 Why sparkwise?

As a Spark developer on Microsoft Fabric, you face:

Millions of configuration combinations - impossible to know which ones matter for your workload
Runtime mysteries - jobs fail or run slowly with cryptic error messages
Hidden optimizations - missing out on Native Execution Engine, V-Order, or proper pooling strategies
Immutable config traps - accidentally setting configs that force 3-5min cold-starts
Documentation overload - OSS Spark, Delta Lake, and Fabric-specific configs scattered everywhere

sparkwise solves this by acting as your personal Spark expert that:

✅ Analyzes your current session and detects misconfigurations
✅ Warns when you accidentally force Custom Pool usage (save 3-5min per run!)
✅ Explains errors in plain English with remediation steps
✅ Recommends configuration tweaks based on your workload characteristics
✅ Provides an interactive Q&A interface for 100+ Spark/Delta/Fabric configurations

🚀 Quick Start

Installation

pip install sparkwise

Usage in Fabric Notebook

from sparkwise import diagnose

# Run comprehensive analysis after your Spark job
diagnose.analyze_last_run()

Output:

🚀 Running sparkwise Analysis...

🔎 --- Native Execution Engine ---
✅ Native Engine ACTIVE: Your query is fully vectorized (Velox detected)

🏊 --- Pooling Strategy ---
🔴 CRITICAL: Session-Immutable Configs Detected
======================================================================
The following configs FORCE Custom Pool usage (3-5min cold-start):

   • spark.executor.memory = 8g
   • spark.dynamicAllocation.maxExecutors = 20

💡 Impact:
   ❌ Cannot use Starter Pool (instant startup)
   ❌ Forced to Custom Pool (3-5 minute cold-start)
   ❌ Additional capacity consumption

✅ Solution:
   1. Remove these spark.conf.set() calls from your notebook
   2. Use Starter Pool defaults (auto-configured by Fabric)
   3. Only set these if you truly need Custom Pool
======================================================================

💾 --- Storage & Delta Optimizations ---
⚠️ Performance Miss: V-Order is DISABLED
   💡 Set 'spark.sql.parquet.vorder.enabled=true' for 3x faster Power BI reads

⚙️ --- Runtime Tuning ---
✅ Adaptive Query Execution (AQE) is Active
✅ Optimal partition sizing for your workload

📊 --- Data Skew Detection ---
⚠️ Data Skew Detected: One task took 145s while median was 32s
   💡 Consider salting your join keys or repartitioning

Done. 🎄 Happy Optimizing!

Interactive Configuration Assistant

from sparkwise import ask

# Ask about any configuration
ask.config("spark.sql.shuffle.partitions")

# Search across 100+ documented configs
ask.search("partition")

Knowledge Base: 100+ Configurations

55+ Core Spark configurations (shuffle, memory, AQE, serialization, etc.)
17 Delta Lake configurations (V-Order, deletion vectors, OPTIMIZE, VACUUM, etc.)
12 Fabric-specific configurations (Native Engine, Starter Pools, OneLake, etc.)
Critical: Session-immutable configs that force Custom Pool usage

Output:

📚 spark.sql.shuffle.partitions

Default: 200
Scope: Session-level, can be changed at runtime

What it does:
Controls the number of partitions created during shuffle operations 
(joins, aggregations, etc.). The default 200 is optimized for small 
clusters but may be suboptimal for large-scale workloads.

Recommendations for your workload:
- Small data (<10GB): 50-100 partitions
- Medium data (10-100GB): 200-500 partitions  
- Large data (>100GB): 1000-2000 partitions
- Formula: num_executors * executor_cores * 2-3

Fabric-specific notes:
On Starter Pools with Native Execution, start with 100-200 and let
AQE (Adaptive Query Execution) handle dynamic coalescing.

Related configs:
- spark.sql.adaptive.coalescePartitions.enabled
- spark.sql.files.maxPartitionBytes

Error Diagnosis

from sparkwise import diagnose

# When you hit an error
diagnose.explain_error("org.apache.spark.shuffle.FetchFailedException")

🎯 Key Features

1. Native Execution Engine Verification

Checks if you're actually using Fabric's Velox-based Native Execution Engine or accidentally falling back to slower row-based processing due to UDFs.

2. Intelligent Pooling Advisor

Detects if you're wasting 3-5 minutes spinning up Custom Pools for jobs that could run on Starter Pools.

3. Data Skew Detection

Identifies when one task is taking 2x+ longer than others, indicating skewed data distribution.

4. Delta & Storage Optimizations

V-Order enablement for Power BI/Direct Lake performance
Deletion Vectors for efficient MERGE operations
Optimize Write for small file prevention

5. Runtime Tuning Recommendations

AQE configuration validation
Partition sizing analysis
Scheduler mode recommendations
Driver vs Executor balance checks

6. Interactive Documentation

Ask questions about any Spark, Delta, or Fabric configuration and get clear, context-aware explanations.

📋 Core Analysis Modules

Module	What It Checks	Key Metrics
Native Compliance	Velox engine usage	Physical plan analysis, fallback detection
Pooling Efficiency	Starter vs Custom Pool	Node count, startup overhead
Skew Detection	Task duration variance	Max vs Median task time
Delta Hygiene	V-Order, Deletion Vectors	Storage format, merge performance
Runtime Tuning	AQE, partitioning, scheduler	Partition sizes, parallelism
Resource Profile	Driver/Executor balance	Memory allocation, OOM risks

🛠️ Advanced Usage

Analyze with DataFrame Context

# Provide a DataFrame for deep plan analysis
df = spark.read.parquet("/lakehouse/data/large_table")
result = df.groupBy("category").agg(sum("sales"))

diagnose.analyze(result)

Get Configuration Report

from sparkwise import config_report

# Get detailed report of current vs recommended configurations
report = config_report.generate()
print(report.to_markdown())

Export Recommendations

# Save recommendations to file
diagnose.analyze_last_run(export_path="/lakehouse/reports/optimization_report.json")

🏗️ Architecture

sparkwise/
├── core/
│   ├── advisor.py          # Main diagnostic engine
│   ├── native_check.py     # Velox/Native execution verification
│   ├── pool_check.py       # Pooling strategy analysis
│   ├── skew_check.py       # Data skew detection
│   ├── delta_check.py      # Delta/Storage optimizations
│   └── runtime_check.py    # Runtime configuration tuning
├── knowledge_base/
│   ├── spark_configs.yaml  # OSS Spark configurations
│   ├── delta_configs.yaml  # Delta Lake configurations
│   └── fabric_configs.yaml # Fabric-specific configurations
├── error_diagnosis/
│   └── error_parser.py     # Error explanation engine
├── cli/
│   └── main.py            # Command-line interface
└── utils/
    └── session_utils.py   # SparkSession utilities

🎓 Examples

Check out the examples directory for:

Basic analysis workflow
Error diagnosis patterns
Configuration Q&A usage
Integration with existing notebooks

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ for the Microsoft Fabric Spark community. Special thanks to the Fabric Data Engineering team for their work on the Native Execution Engine.

📬 Contact

Author: Santhosh Ravindran
GitHub: @santhoshravindran7
Issues: GitHub Issues

Tagline: "Before you head off for the holidays, make sure your Fabric jobs aren't burning budget while you sleep. 🎄"

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.2

Jan 5, 2026

1.4.1

Jan 4, 2026

1.4.0

Jan 4, 2026

1.3.4

Jan 4, 2026

1.3.3

Dec 26, 2025

1.3.2

Dec 25, 2025

0.1.1

Dec 25, 2025

This version

0.1.0

Dec 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkwise-0.1.0.tar.gz (61.2 kB view details)

Uploaded Dec 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparkwise-0.1.0-py3-none-any.whl (68.0 kB view details)

Uploaded Dec 25, 2025 Python 3

File details

Details for the file sparkwise-0.1.0.tar.gz.

File metadata

Download URL: sparkwise-0.1.0.tar.gz
Upload date: Dec 25, 2025
Size: 61.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for sparkwise-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ddbab2e25c932c37d6260c76cb2d98fc871eaaefa0ba105d82a5c6d6b03cf758`
MD5	`cb3feb7efb93097a99cb7f4644b3124e`
BLAKE2b-256	`78127eeadb07b3c5e389e77e92c139052e7f7a510aa8865ac7575c35f3a08e3d`

See more details on using hashes here.

File details

Details for the file sparkwise-0.1.0-py3-none-any.whl.

File metadata

Download URL: sparkwise-0.1.0-py3-none-any.whl
Upload date: Dec 25, 2025
Size: 68.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for sparkwise-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09b8bf55b7f23353d5a3b885f0d4035700b5336432f62b656f2afab84323d106`
MD5	`d04046a1728c4236af9b96f428a006bd`
BLAKE2b-256	`ecc6925c2e6750c8c3aee836a38b1fa06fd8752dda62d8b59cada9fea026d887`

See more details on using hashes here.

sparkwise 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔥 sparkwise

🎯 Why sparkwise?

🚀 Quick Start

Installation

Usage in Fabric Notebook

Interactive Configuration Assistant

Error Diagnosis

🎯 Key Features

1. Native Execution Engine Verification

2. Intelligent Pooling Advisor

3. Data Skew Detection

4. Delta & Storage Optimizations

5. Runtime Tuning Recommendations

6. Interactive Documentation

📋 Core Analysis Modules

🛠️ Advanced Usage

Analyze with DataFrame Context

Get Configuration Report

Export Recommendations

🏗️ Architecture

🎓 Examples

🤝 Contributing

📄 License

🙏 Acknowledgments

📬 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes