The automated technical fellow for your Fabric Spark workloads - intelligent configuration analysis and optimization recommendations
Project description
๐ฅ sparkwise
The automated technical fellow for your Fabric Spark workloads
sparkwise is an intelligent configuration advisor for Apache Spark on Microsoft Fabric. It automatically analyzes your Spark workloads, detects performance issues, and provides actionable optimization recommendations - all without you having to scan through thousands of configuration options.
๐ฏ Why sparkwise?
As a Spark developer on Microsoft Fabric, you face:
- Millions of configuration combinations - impossible to know which ones matter for your workload
- Runtime mysteries - jobs fail or run slowly with cryptic error messages
- Hidden optimizations - missing out on Native Execution Engine, V-Order, or proper pooling strategies
- Immutable config traps - accidentally setting configs that force 3-5min cold-starts
- Documentation overload - OSS Spark, Delta Lake, and Fabric-specific configs scattered everywhere
sparkwise solves this by acting as your personal Spark expert that:
- โ Analyzes your current session and detects misconfigurations
- โ Warns when you accidentally force Custom Pool usage (save 3-5min per run!)
- โ Explains errors in plain English with remediation steps
- โ Recommends configuration tweaks based on your workload characteristics
- โ Provides an interactive Q&A interface for 100+ Spark/Delta/Fabric configurations
๐ Quick Start
Installation
pip install sparkwise
Usage in Fabric Notebook
from sparkwise import diagnose
# Run comprehensive analysis after your Spark job
diagnose.analyze_last_run()
Output:
๐ Running sparkwise Analysis...
๐ --- Native Execution Engine ---
โ
Native Engine ACTIVE: Your query is fully vectorized (Velox detected)
๐ --- Pooling Strategy ---
๐ด CRITICAL: Session-Immutable Configs Detected
======================================================================
The following configs FORCE Custom Pool usage (3-5min cold-start):
โข spark.executor.memory = 8g
โข spark.dynamicAllocation.maxExecutors = 20
๐ก Impact:
โ Cannot use Starter Pool (instant startup)
โ Forced to Custom Pool (3-5 minute cold-start)
โ Additional capacity consumption
โ
Solution:
1. Remove these spark.conf.set() calls from your notebook
2. Use Starter Pool defaults (auto-configured by Fabric)
3. Only set these if you truly need Custom Pool
======================================================================
๐พ --- Storage & Delta Optimizations ---
โ ๏ธ Performance Miss: V-Order is DISABLED
๐ก Set 'spark.sql.parquet.vorder.enabled=true' for 3x faster Power BI reads
โ๏ธ --- Runtime Tuning ---
โ
Adaptive Query Execution (AQE) is Active
โ
Optimal partition sizing for your workload
๐ --- Data Skew Detection ---
โ ๏ธ Data Skew Detected: One task took 145s while median was 32s
๐ก Consider salting your join keys or repartitioning
Done. ๐ Happy Optimizing!
Interactive Configuration Assistant
from sparkwise import ask
# Ask about any configuration
ask.config("spark.sql.shuffle.partitions")
# Search across 100+ documented configs
ask.search("partition")
Knowledge Base: 100+ Configurations
- 55+ Core Spark configurations (shuffle, memory, AQE, serialization, etc.)
- 17 Delta Lake configurations (V-Order, deletion vectors, OPTIMIZE, VACUUM, etc.)
- 12 Fabric-specific configurations (Native Engine, Starter Pools, OneLake, etc.)
- Critical: Session-immutable configs that force Custom Pool usage
Output:
๐ spark.sql.shuffle.partitions
Default: 200
Scope: Session-level, can be changed at runtime
What it does:
Controls the number of partitions created during shuffle operations
(joins, aggregations, etc.). The default 200 is optimized for small
clusters but may be suboptimal for large-scale workloads.
Recommendations for your workload:
- Small data (<10GB): 50-100 partitions
- Medium data (10-100GB): 200-500 partitions
- Large data (>100GB): 1000-2000 partitions
- Formula: num_executors * executor_cores * 2-3
Fabric-specific notes:
On Starter Pools with Native Execution, start with 100-200 and let
AQE (Adaptive Query Execution) handle dynamic coalescing.
Related configs:
- spark.sql.adaptive.coalescePartitions.enabled
- spark.sql.files.maxPartitionBytes
Error Diagnosis
from sparkwise import diagnose
# When you hit an error
diagnose.explain_error("org.apache.spark.shuffle.FetchFailedException")
๐ฏ Key Features
1. Native Execution Engine Verification
Checks if you're actually using Fabric's Velox-based Native Execution Engine or accidentally falling back to slower row-based processing due to UDFs.
2. Intelligent Pooling Advisor
Detects if you're wasting 3-5 minutes spinning up Custom Pools for jobs that could run on Starter Pools.
3. Data Skew Detection
Identifies when one task is taking 2x+ longer than others, indicating skewed data distribution.
4. Delta & Storage Optimizations
- V-Order enablement for Power BI/Direct Lake performance
- Deletion Vectors for efficient MERGE operations
- Optimize Write for small file prevention
5. Runtime Tuning Recommendations
- AQE configuration validation
- Partition sizing analysis
- Scheduler mode recommendations
- Driver vs Executor balance checks
6. Interactive Documentation
Ask questions about any Spark, Delta, or Fabric configuration and get clear, context-aware explanations.
๐ Core Analysis Modules
| Module | What It Checks | Key Metrics |
|---|---|---|
| Native Compliance | Velox engine usage | Physical plan analysis, fallback detection |
| Pooling Efficiency | Starter vs Custom Pool | Node count, startup overhead |
| Skew Detection | Task duration variance | Max vs Median task time |
| Delta Hygiene | V-Order, Deletion Vectors | Storage format, merge performance |
| Runtime Tuning | AQE, partitioning, scheduler | Partition sizes, parallelism |
| Resource Profile | Driver/Executor balance | Memory allocation, OOM risks |
๐ ๏ธ Advanced Usage
Analyze with DataFrame Context
# Provide a DataFrame for deep plan analysis
df = spark.read.parquet("/lakehouse/data/large_table")
result = df.groupBy("category").agg(sum("sales"))
diagnose.analyze(result)
Get Configuration Report
from sparkwise import config_report
# Get detailed report of current vs recommended configurations
report = config_report.generate()
print(report.to_markdown())
Export Recommendations
# Save recommendations to file
diagnose.analyze_last_run(export_path="/lakehouse/reports/optimization_report.json")
๐๏ธ Architecture
sparkwise/
โโโ core/
โ โโโ advisor.py # Main diagnostic engine
โ โโโ native_check.py # Velox/Native execution verification
โ โโโ pool_check.py # Pooling strategy analysis
โ โโโ skew_check.py # Data skew detection
โ โโโ delta_check.py # Delta/Storage optimizations
โ โโโ runtime_check.py # Runtime configuration tuning
โโโ knowledge_base/
โ โโโ spark_configs.yaml # OSS Spark configurations
โ โโโ delta_configs.yaml # Delta Lake configurations
โ โโโ fabric_configs.yaml # Fabric-specific configurations
โโโ error_diagnosis/
โ โโโ error_parser.py # Error explanation engine
โโโ cli/
โ โโโ main.py # Command-line interface
โโโ utils/
โโโ session_utils.py # SparkSession utilities
๐ Examples
Check out the examples directory for:
- Basic analysis workflow
- Error diagnosis patterns
- Configuration Q&A usage
- Integration with existing notebooks
๐ค Contributing
Contributions are welcome! Please read our Contributing Guide for details.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
Built with โค๏ธ for the Microsoft Fabric Spark community. Special thanks to the Fabric Data Engineering team for their work on the Native Execution Engine.
๐ฌ Contact
- Author: Santhosh Ravindran
- GitHub: @santhoshravindran7
- Issues: GitHub Issues
Tagline: "Before you head off for the holidays, make sure your Fabric jobs aren't burning budget while you sleep. ๐"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparkwise-0.1.0.tar.gz.
File metadata
- Download URL: sparkwise-0.1.0.tar.gz
- Upload date:
- Size: 61.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddbab2e25c932c37d6260c76cb2d98fc871eaaefa0ba105d82a5c6d6b03cf758
|
|
| MD5 |
cb3feb7efb93097a99cb7f4644b3124e
|
|
| BLAKE2b-256 |
78127eeadb07b3c5e389e77e92c139052e7f7a510aa8865ac7575c35f3a08e3d
|
File details
Details for the file sparkwise-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sparkwise-0.1.0-py3-none-any.whl
- Upload date:
- Size: 68.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09b8bf55b7f23353d5a3b885f0d4035700b5336432f62b656f2afab84323d106
|
|
| MD5 |
d04046a1728c4236af9b96f428a006bd
|
|
| BLAKE2b-256 |
ecc6925c2e6750c8c3aee836a38b1fa06fd8752dda62d8b59cada9fea026d887
|