Skip to main content

The automated technical fellow for your Fabric Spark workloads - intelligent configuration analysis and optimization recommendations

Project description

๐Ÿ”ฅ sparkwise

The automated technical fellow for your Fabric Spark workloads

Python Version License: MIT PyPI version

sparkwise is an intelligent configuration advisor for Apache Spark on Microsoft Fabric. It automatically analyzes your Spark workloads, detects performance issues, and provides actionable optimization recommendations - all without you having to scan through thousands of configuration options.

๐ŸŽฏ Why sparkwise?

As a Spark developer on Microsoft Fabric, you face:

  • Millions of configuration combinations - impossible to know which ones matter for your workload
  • Runtime mysteries - jobs fail or run slowly with cryptic error messages
  • Hidden optimizations - missing out on Native Execution Engine, V-Order, or proper pooling strategies
  • Immutable config traps - accidentally setting configs that force 3-5min cold-starts
  • Documentation overload - OSS Spark, Delta Lake, and Fabric-specific configs scattered everywhere

sparkwise solves this by acting as your personal Spark expert that:

  • โœ… Analyzes your current session and detects misconfigurations
  • โœ… Warns when you accidentally force Custom Pool usage (save 3-5min per run!)
  • โœ… Explains errors in plain English with remediation steps
  • โœ… Recommends configuration tweaks based on your workload characteristics
  • โœ… Provides an interactive Q&A interface for 100+ Spark/Delta/Fabric configurations

๐Ÿš€ Quick Start

Installation

pip install sparkwise

Usage in Fabric Notebook

from sparkwise import diagnose

# Run comprehensive analysis after your Spark job
diagnose.analyze_last_run()

Output:

๐Ÿš€ Running sparkwise Analysis...

๐Ÿ”Ž --- Native Execution Engine ---
โœ… Native Engine ACTIVE: Your query is fully vectorized (Velox detected)

๐ŸŠ --- Pooling Strategy ---
๐Ÿ”ด CRITICAL: Session-Immutable Configs Detected
======================================================================
The following configs FORCE Custom Pool usage (3-5min cold-start):

   โ€ข spark.executor.memory = 8g
   โ€ข spark.dynamicAllocation.maxExecutors = 20

๐Ÿ’ก Impact:
   โŒ Cannot use Starter Pool (instant startup)
   โŒ Forced to Custom Pool (3-5 minute cold-start)
   โŒ Additional capacity consumption

โœ… Solution:
   1. Remove these spark.conf.set() calls from your notebook
   2. Use Starter Pool defaults (auto-configured by Fabric)
   3. Only set these if you truly need Custom Pool
======================================================================

๐Ÿ’พ --- Storage & Delta Optimizations ---
โš ๏ธ Performance Miss: V-Order is DISABLED
   ๐Ÿ’ก Set 'spark.sql.parquet.vorder.enabled=true' for 3x faster Power BI reads

โš™๏ธ --- Runtime Tuning ---
โœ… Adaptive Query Execution (AQE) is Active
โœ… Optimal partition sizing for your workload

๐Ÿ“Š --- Data Skew Detection ---
โš ๏ธ Data Skew Detected: One task took 145s while median was 32s
   ๐Ÿ’ก Consider salting your join keys or repartitioning

Done. ๐ŸŽ„ Happy Optimizing!

Interactive Configuration Assistant

from sparkwise import ask

# Ask about any configuration
ask.config("spark.sql.shuffle.partitions")

# Search across 100+ documented configs
ask.search("partition")

Knowledge Base: 100+ Configurations

  • 55+ Core Spark configurations (shuffle, memory, AQE, serialization, etc.)
  • 17 Delta Lake configurations (V-Order, deletion vectors, OPTIMIZE, VACUUM, etc.)
  • 12 Fabric-specific configurations (Native Engine, Starter Pools, OneLake, etc.)
  • Critical: Session-immutable configs that force Custom Pool usage

Output:

๐Ÿ“š spark.sql.shuffle.partitions

Default: 200
Scope: Session-level, can be changed at runtime

What it does:
Controls the number of partitions created during shuffle operations 
(joins, aggregations, etc.). The default 200 is optimized for small 
clusters but may be suboptimal for large-scale workloads.

Recommendations for your workload:
- Small data (<10GB): 50-100 partitions
- Medium data (10-100GB): 200-500 partitions  
- Large data (>100GB): 1000-2000 partitions
- Formula: num_executors * executor_cores * 2-3

Fabric-specific notes:
On Starter Pools with Native Execution, start with 100-200 and let
AQE (Adaptive Query Execution) handle dynamic coalescing.

Related configs:
- spark.sql.adaptive.coalescePartitions.enabled
- spark.sql.files.maxPartitionBytes

Error Diagnosis

from sparkwise import diagnose

# When you hit an error
diagnose.explain_error("org.apache.spark.shuffle.FetchFailedException")

๐ŸŽฏ Key Features

1. Native Execution Engine Verification

Checks if you're actually using Fabric's Velox-based Native Execution Engine or accidentally falling back to slower row-based processing due to UDFs.

2. Intelligent Pooling Advisor

Detects if you're wasting 3-5 minutes spinning up Custom Pools for jobs that could run on Starter Pools.

3. Data Skew Detection

Identifies when one task is taking 2x+ longer than others, indicating skewed data distribution.

4. Delta & Storage Optimizations

  • V-Order enablement for Power BI/Direct Lake performance
  • Deletion Vectors for efficient MERGE operations
  • Optimize Write for small file prevention

5. Runtime Tuning Recommendations

  • AQE configuration validation
  • Partition sizing analysis
  • Scheduler mode recommendations
  • Driver vs Executor balance checks

6. Interactive Documentation

Ask questions about any Spark, Delta, or Fabric configuration and get clear, context-aware explanations.

๐Ÿ“‹ Core Analysis Modules

Module What It Checks Key Metrics
Native Compliance Velox engine usage Physical plan analysis, fallback detection
Pooling Efficiency Starter vs Custom Pool Node count, startup overhead
Skew Detection Task duration variance Max vs Median task time
Delta Hygiene V-Order, Deletion Vectors Storage format, merge performance
Runtime Tuning AQE, partitioning, scheduler Partition sizes, parallelism
Resource Profile Driver/Executor balance Memory allocation, OOM risks

๐Ÿ› ๏ธ Advanced Usage

Analyze with DataFrame Context

# Provide a DataFrame for deep plan analysis
df = spark.read.parquet("/lakehouse/data/large_table")
result = df.groupBy("category").agg(sum("sales"))

diagnose.analyze(result)

Get Configuration Report

from sparkwise import config_report

# Get detailed report of current vs recommended configurations
report = config_report.generate()
print(report.to_markdown())

Export Recommendations

# Save recommendations to file
diagnose.analyze_last_run(export_path="/lakehouse/reports/optimization_report.json")

๐Ÿ—๏ธ Architecture

sparkwise/
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ advisor.py          # Main diagnostic engine
โ”‚   โ”œโ”€โ”€ native_check.py     # Velox/Native execution verification
โ”‚   โ”œโ”€โ”€ pool_check.py       # Pooling strategy analysis
โ”‚   โ”œโ”€โ”€ skew_check.py       # Data skew detection
โ”‚   โ”œโ”€โ”€ delta_check.py      # Delta/Storage optimizations
โ”‚   โ””โ”€โ”€ runtime_check.py    # Runtime configuration tuning
โ”œโ”€โ”€ knowledge_base/
โ”‚   โ”œโ”€โ”€ spark_configs.yaml  # OSS Spark configurations
โ”‚   โ”œโ”€โ”€ delta_configs.yaml  # Delta Lake configurations
โ”‚   โ””โ”€โ”€ fabric_configs.yaml # Fabric-specific configurations
โ”œโ”€โ”€ error_diagnosis/
โ”‚   โ””โ”€โ”€ error_parser.py     # Error explanation engine
โ”œโ”€โ”€ cli/
โ”‚   โ””โ”€โ”€ main.py            # Command-line interface
โ””โ”€โ”€ utils/
    โ””โ”€โ”€ session_utils.py   # SparkSession utilities

๐ŸŽ“ Examples

Check out the examples directory for:

  • Basic analysis workflow
  • Error diagnosis patterns
  • Configuration Q&A usage
  • Integration with existing notebooks

๐Ÿค Contributing

Contributions are welcome! Please read our Contributing Guide for details.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

Built with โค๏ธ for the Microsoft Fabric Spark community. Special thanks to the Fabric Data Engineering team for their work on the Native Execution Engine.

๐Ÿ“ฌ Contact


Tagline: "Before you head off for the holidays, make sure your Fabric jobs aren't burning budget while you sleep. ๐ŸŽ„"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkwise-0.1.0.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkwise-0.1.0-py3-none-any.whl (68.0 kB view details)

Uploaded Python 3

File details

Details for the file sparkwise-0.1.0.tar.gz.

File metadata

  • Download URL: sparkwise-0.1.0.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for sparkwise-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ddbab2e25c932c37d6260c76cb2d98fc871eaaefa0ba105d82a5c6d6b03cf758
MD5 cb3feb7efb93097a99cb7f4644b3124e
BLAKE2b-256 78127eeadb07b3c5e389e77e92c139052e7f7a510aa8865ac7575c35f3a08e3d

See more details on using hashes here.

File details

Details for the file sparkwise-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sparkwise-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 68.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for sparkwise-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 09b8bf55b7f23353d5a3b885f0d4035700b5336432f62b656f2afab84323d106
MD5 d04046a1728c4236af9b96f428a006bd
BLAKE2b-256 ecc6925c2e6750c8c3aee836a38b1fa06fd8752dda62d8b59cada9fea026d887

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page