Automated Data Engineering specialist for Fabric Spark workloads - intelligent configuration analysis and optimization recommendations
Project description
๐ฅ Sparkwise
Achieve optimal Fabric Spark price-performance with automated insights - simplifies tuning, makes optimization fun
sparkwise is an automated Data Engineering specialist for Apache Spark on Microsoft Fabric. It provides intelligent diagnostics, configuration recommendations, and comprehensive session profiling to help you achieve the best price-performance for your workloads - all while making Spark tuning simple and enjoyable.
๐ฏ Why sparkwise?
Spark tuning on Microsoft Fabric doesn't have to be complex or expensive. sparkwise helps you:
- ๐ฐ Optimize costs - Detect configurations that waste capacity and increase runtime
- โก Maximize performance - Enable Fabric-specific optimizations (Native Engine, V-Order, resource profiles)
- ๐ Simplify learning - Interactive Q&A for 133 Spark/Delta/Fabric configurations
- ๐ Understand workloads - Comprehensive profiling of sessions, executors, jobs, and resources
- โฑ๏ธ Save time - Avoid 3-5min cold-starts by detecting Starter Pool blockers
- ๐ Make data-driven decisions - Priority-ranked recommendations with impact analysis
โจ Key Features
๐ฌ Automated Diagnostics
- Native Execution Engine - Verifies Velox usage, detects fallbacks to row-based processing
- Spark Compute - Analyzes Starter vs Custom Pool usage, warns about immutable configs
- Data Skew Detection - Identifies imbalanced task distributions
- Delta Optimizations - Checks V-Order, Deletion Vectors, Optimize Write, Auto Compaction
- Runtime Tuning - Validates AQE, partition sizing, scheduler mode
๐ Comprehensive Profiling
- Session Profiling - Application metadata, resource allocation, memory breakdown
- Executor Profiling - Executor status, memory utilization, task distribution
- Job Profiling - Job/stage/task metrics, bottleneck detection
- Resource Profiling - Efficiency scoring, utilization analysis, optimization recommendations
๐ Advanced Performance Analysis (NEW!)
- Real Metrics Collection - Uses actual Spark stage/task data instead of estimates
- Scalability Prediction - Compare Starter vs Custom Pool with real VCore-hour calculations
- Stage Timeline - Visualize execution patterns with parallel/sequential analysis
- Efficiency Analysis - Quantify wasted compute in VCore-hours with actionable recommendations
๐ Advanced Skew Detection (NEW!)
- Task Duration Analysis - Detect stragglers and long-running tasks with variance detection
- Partition-Level Analysis - Identify data distribution imbalances with statistical metrics
- Skewed Join Detection - Analyze join patterns and recommend broadcast vs salting strategies
- Automatic Mitigation - Get code examples for salting, AQE, and broadcast optimizations
๐ฏ SQL Query Plan Analysis (NEW!)
- Anti-Pattern Detection - Identify cartesian products, full scans, and excessive shuffles
- Native Engine Compatibility - Check if queries use Fabric Native Engine (3-8x faster)
- Z-Order Recommendations - Suggest best columns for Delta optimization based on cardinality
- Caching Opportunities - Detect repeated table scans that benefit from caching
- Fabric Best Practices - V-Order, broadcast joins, AQE, and partition recommendations
๐ก Interactive Configuration Assistant
- 133 documented configurations - Spark, Delta Lake, Fabric-specific, and Runtime 1.2 configs
- Context-aware guidance - Workload-specific recommendations with impact analysis
- Resource profile support - Understand writeHeavy, readHeavyForSpark, readHeavyForPBI profiles
- Search capabilities - Find configs by keyword or partial name
๐ Priority-Based Recommendations
- Color-coded priorities - Critical (red) โ High (yellow) โ Medium (blue) โ Low (dim)
- Formatted tables - Clean, readable output with impact explanations
- Actionable guidance - Specific commands and configuration values
๐ Quick Start
Installation
pip install sparkwise
Or install the wheel file directly in Fabric:
%pip install sparkwise-0.1.0-py3-none-any.whl
Basic Usage
from sparkwise import diagnose, ask
# Run comprehensive analysis on current session
diagnose.analyze()
# Ask about any configuration
ask.config('spark.native.enabled')
# Search for configurations
ask.search('optimize')
Session Profiling
from sparkwise import (profile, profile_executors, profile_jobs, profile_resources,
predict_scalability, show_timeline, analyze_efficiency)
# Profile complete session
profile()
# Profile executor metrics
profile_executors()
# Analyze job performance
profile_jobs()
# Check resource efficiency
profile_resources()
# Advanced profiling features
predict_scalability() # Compare pool configurations
show_timeline() # Visualize stage execution
analyze_efficiency() # Quantify compute waste
Advanced Analysis
from sparkwise import detect_skew, analyze_query
# Detect data skew
skew_results = detect_skew() # Analyze task-level skew
# Analyze specific DataFrame for partition skew
from sparkwise.core.advanced_skew_detector import AdvancedSkewDetector
detector = AdvancedSkewDetector()
detector.analyze_partition_skew(your_df, ["key_column"])
# Detect skewed joins
detector.detect_skewed_joins(large_df, small_df, "join_key")
# Analyze SQL query plans
query_results = analyze_query(your_df)
# Get Z-Order recommendations
from sparkwise.core.query_plan_analyzer import QueryPlanAnalyzer
analyzer = QueryPlanAnalyzer()
zorder_cols = analyzer.suggest_zorder_columns(delta_df, ["filtered_col"])
# Detect caching opportunities
analyzer.detect_repeated_subqueries(your_df)
๐ Sample Output
Diagnostic Analysis
๐ฅ sparkwise Analysis ๐ฅ
๐ Native Execution Engine
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๏ธ Warning: Native keywords not found in physical plan
๐ก Check for unsupported operators or complex UDFs
โก Spark Compute
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Your job uses 1 executors - fits in Starter Pool
๐ก Ensure 'Starter Pool' is selected in workspace settings
๐พ Storage & Delta Optimizations
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โน๏ธ V-Order is DISABLED (optimal for write-heavy workloads)
Benefit: 2x faster writes vs V-Order enabled
๐ก Enable only for read-heavy workloads (Power BI/analytics)
Trade-off: 3-10x faster reads, but 15-20% slower writes
โน๏ธ Optimize Write is DISABLED (optimal for writeHeavy profile - default)
Benefit: Maximum write throughput for ETL and data ingestion
๐ก Enable only for read-heavy or streaming workloads
- readHeavyForSpark: spark.fabric.resourceProfile=readHeavyForSpark
- readHeavyForPBI: spark.fabric.resourceProfile=readHeavyForPBI
โ๏ธ Runtime Tuning
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CRITICAL: Adaptive Query Execution (AQE) is DISABLED
๐ก Enable immediately: spark.sql.adaptive.enabled=true
Benefits: Dynamic coalescing, skew joins, better parallelism
๐ Summary of Findings
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ Category โ Status โ Critical Issues โ Recommendations โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโค
โ Native Execution โ โ ๏ธ โ 1 โ 1 โ
โ Spark Compute โ โ
โ 0 โ 1 โ
โ Data Skew โ โ
โ 0 โ 0 โ
โ Delta โ โ
โ 0 โ 3 โ
โ Runtime โ โ ๏ธ โ 1 โ 2 โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
๐ง Configuration Recommendations
Total recommendations: 7
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ Priority โ Configuration โ Action โ Impact โ
โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ CRITICAL โ spark.sql.adaptive.enabled โ Set to 'true' โ Enable โ
โ โ โ โ dynamic โ
โ โ โ โ partition โ
โ โ โ โ coalescing โ
โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ MEDIUM โ spark.sql.parquet.vorder.enabledโ Enable for โ 3-10x faster โ
โ โ โ read-heavy โ reads for โ
โ โ โ workloads only โ Power BI โ
โโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
โจ Analysis complete!
Interactive Q&A
ask.config('spark.fabric.resourceProfile')
Output:
๐ spark.fabric.resourceProfile
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Default: writeHeavy
Scope: session
What it does:
FABRIC CRITICAL: Selects predefined Spark resource profiles optimized
for specific workload patterns. Simplifies configuration tuning.
Recommendations for your workload:
โข etl_ingestion: writeHeavy - optimized for ETL and data ingestion
โข analytics_spark: readHeavyForSpark - optimized for analytical queries
โข power_bi: readHeavyForPBI - optimized for Power BI Direct Lake
โข custom_needs: custom - user-defined configuration
Fabric-specific notes:
Microsoft Fabric resource profiles provide workload-optimized settings:
**writeHeavy (DEFAULT):**
- V-Order: DISABLED for faster writes
- Optimize Write: NULL/DISABLED for maximum throughput
- Use Case: ETL pipelines, data ingestion, batch transformations
**readHeavyForSpark:**
- Optimize Write: ENABLED with 128MB bins
- Use Case: Interactive Spark queries, analytical workloads
**readHeavyForPBI:**
- V-Order: ENABLED for Power BI optimization
- Optimize Write: ENABLED with 1GB bins
- Use Case: Power BI dashboards, Direct Lake scenarios
Related configurations:
โข spark.sql.parquet.vorder.enabled
โข spark.databricks.delta.optimizeWrite.enabled
โข spark.microsoft.delta.optimizeWrite.enabled
Examples:
spark.conf.set('spark.fabric.resourceProfile', 'readHeavyForSpark')
spark.conf.set('spark.fabric.resourceProfile', 'writeHeavy')
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฆ What's Included
Core Modules
diagnose- Main diagnostic engine with 5 check categoriesask- Interactive configuration Q&A systemprofile- Session profilingprofile_executors- Executor-level metricsprofile_jobs- Job/stage/task analysisprofile_resources- Resource efficiency scoring
Knowledge Base (133 Configurations)
- 33 Spark configs - Core settings for shuffle, memory, AQE, serialization
- 45 Delta configs - Delta Lake optimizations, V-Order, Deletion Vectors
- 10 Fabric configs - Native Engine, resource profiles, OneLake storage
- 45 Runtime 1.2 configs - Latest Fabric Runtime 1.2 features
Latest Features
- โ Fabric resource profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI)
- โ Advanced Delta optimizations (Fast Optimize, Adaptive File Size, File Level Target)
- โ Driver Mode Snapshot for faster metadata operations
- โ Comprehensive session profiling tools
- โ Priority-based recommendation tables
- โ Color-coded terminal output with Rich library
๐ฏ Use Cases
Data Engineers
- Optimize ETL pipelines - Detect bottlenecks, tune parallelism, reduce costs
- Validate configurations - Ensure proper resource profiles and pool usage
- Debug job failures - Understand errors with plain English explanations
Data Scientists
- Improve notebook performance - Enable Native Engine, optimize memory usage
- Understand Spark behavior - Learn configurations through interactive Q&A
- Profile experiments - Track resource usage and efficiency
Platform Admins
- Standardize best practices - Share optimal configurations across teams
- Monitor capacity usage - Identify jobs forcing Custom Pool usage
- Cost optimization - Detect over-provisioned or misconfigured workloads
๐ CLI Usage
# Run diagnostics
sparkwise analyze
# Profile session
sparkwise profile session
# Profile executors
sparkwise profile executors
# Profile jobs
sparkwise profile jobs --max-jobs 5
# Profile resources
sparkwise profile resources
# Analyze bottlenecks
sparkwise profile bottlenecks
# Ask about configuration
sparkwise ask spark.sql.shuffle.partitions
# Search configurations
sparkwise search "adaptive"
๐๏ธ Architecture
sparkwise/
โโโ core/
โ โโโ advisor.py # Main diagnostic orchestrator
โ โโโ native_check.py # Velox/Native execution verification
โ โโโ pool_check.py # Starter vs Custom Pool analysis
โ โโโ skew_check.py # Data skew detection
โ โโโ delta_check.py # Delta Lake optimizations
โ โโโ runtime_check.py # Runtime configuration tuning
โโโ profiling/
โ โโโ session_profiler.py # Complete session analysis
โ โโโ executor_profiler.py # Executor metrics
โ โโโ job_profiler.py # Job/stage/task profiling
โ โโโ resource_profiler.py # Resource efficiency analysis
โโโ knowledge_base/
โ โโโ spark_configs.yaml # Core Spark configurations
โ โโโ delta_configs.yaml # Delta Lake configurations
โ โโโ fabric_configs.yaml # Fabric-specific configs
โ โโโ fabric_runtime_1.2_configs.yaml # Runtime 1.2 features
โโโ cli.py # Command-line interface
โโโ config_qa.py # Interactive Q&A assistant
๐ Examples
Check out the examples directory:
- basic_analysis.py - Basic diagnostic workflow
- config_qa_demo.py - Configuration Q&A usage
- profiling_demo.py - Comprehensive profiling examples
- knowledge_base_demo.py - Knowledge base exploration
- immutable_configs_demo.py - Starter Pool optimization
๐งช Running Tests
# Install test dependencies
pip install pytest pytest-cov
# Run all tests
pytest
# Run with coverage
pytest --cov=sparkwise --cov-report=html
# Run specific test file
pytest tests/test_advisor.py
๐ค Contributing
Contributions are welcome! Please read our Contributing Guide for details.
Development Setup
# Clone the repository
git clone https://github.com/santhoshravindran7/sparkwise.git
cd sparkwise
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
Built with โค๏ธ for the Microsoft Fabric Data Engineering and Data Science community.
๐ฌ Contact & Support
- Author: Santhosh Ravindran
- GitHub: @santhoshravindran7
- Issues: Report bugs or request features
๐ What's New in v0.1.0
- โจ Complete profiling suite (session, executor, job, resource profilers)
- ๐จ Rich terminal output with color-coded priorities
- ๐ Priority-based recommendation tables
- ๐ง Fabric resource profile support (writeHeavy, readHeavy profiles)
- โก 4 new advanced Delta optimizations
- ๐ 133 documented configurations (up from 100)
- ๐ฏ Context-aware Optimize Write recommendations
- ๐ CLI support for all profiling operations
Make Spark tuning fun again! ๐โจ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparkwise-1.3.4.tar.gz.
File metadata
- Download URL: sparkwise-1.3.4.tar.gz
- Upload date:
- Size: 90.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3499fad31b5a34ea886f527801ec6f29c8e307caba41cebed39095b66503c51e
|
|
| MD5 |
156a324f42009188ff27ae8ef674538b
|
|
| BLAKE2b-256 |
f7f865850ec805e5eebde456a72e2be18a0e01191d77658b4bf1150e10fec9cb
|
File details
Details for the file sparkwise-1.3.4-py3-none-any.whl.
File metadata
- Download URL: sparkwise-1.3.4-py3-none-any.whl
- Upload date:
- Size: 98.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e48168076f8de7c6eee3ccedfbf0a73b6a485f082eb1dea195c52e987ba15bdc
|
|
| MD5 |
161b32be72455c95d7fcb34770351028
|
|
| BLAKE2b-256 |
142b3e0575685a3a513bd3544589ca04bf5bafdca2ba36955d414ac938dba679
|