Automated Data Engineering specialist for Fabric Spark workloads - intelligent configuration analysis and optimization recommendations
Project description
๐ฅ Sparkwise
Achieve optimal Fabric Spark price-performance with automated insights - simplifies tuning, makes optimization fun
sparkwise is an automated Data Engineering specialist for Apache Spark on Microsoft Fabric. It provides intelligent diagnostics, configuration recommendations, and comprehensive session profiling to help you achieve the best price-performance for your workloads - all while making Spark tuning simple and enjoyable.
๐ฏ Why sparkwise?
Spark tuning on Microsoft Fabric doesn't have to be complex or expensive. sparkwise helps you:
- ๐ฐ Optimize costs - Detect configurations that waste capacity and increase runtime
- โก Maximize performance - Enable Fabric-specific optimizations (Native Engine, V-Order, resource profiles)
- ๐ Simplify learning - Interactive Q&A for 133 Spark/Delta/Fabric configurations
- ๐ Understand workloads - Comprehensive profiling of sessions, executors, jobs, and resources
- โฑ๏ธ Save time - Avoid 3-5min cold-starts by detecting Starter Pool blockers
- ๐ Make data-driven decisions - Priority-ranked recommendations with impact analysis
โจ Key Features
๐ฌ Automated Diagnostics
- Native Execution Engine - Verifies Velox usage, detects fallbacks to row-based processing
- Spark Compute - Analyzes Starter vs Custom Pool usage, warns about immutable configs
- Data Skew Detection - Identifies imbalanced task distributions
- Delta Optimizations - Checks V-Order, Deletion Vectors, Optimize Write, Auto Compaction
- Runtime Tuning - Validates AQE, partition sizing, scheduler mode
๐ Comprehensive Profiling
- Session Profiling - Application metadata, resource allocation, memory breakdown
- Executor Profiling - Executor status, memory utilization, task distribution
- Job Profiling - Job/stage/task metrics, bottleneck detection
- Resource Profiling - Efficiency scoring, utilization analysis, optimization recommendations
๐ Advanced Performance Analysis (NEW!)
- Real Metrics Collection - Uses actual Spark stage/task data instead of estimates
- Scalability Prediction - Compare Starter vs Custom Pool with real VCore-hour calculations
- Stage Timeline - Visualize execution patterns with parallel/sequential analysis
- Efficiency Analysis - Quantify wasted compute in VCore-hours with actionable recommendations
๐ Advanced Skew Detection (NEW!)
- Task Duration Analysis - Detect stragglers and long-running tasks with variance detection
- Partition-Level Analysis - Identify data distribution imbalances with statistical metrics
- Skewed Join Detection - Analyze join patterns and recommend broadcast vs salting strategies
- Automatic Mitigation - Get code examples for salting, AQE, and broadcast optimizations
๐ฏ SQL Query Plan Analysis (NEW!)
- Anti-Pattern Detection - Identify cartesian products, full scans, and excessive shuffles
- Native Engine Compatibility - Check if queries use Fabric Native Engine (3-8x faster)
- Z-Order Recommendations - Suggest best columns for Delta optimization based on cardinality
- Caching Opportunities - Detect repeated table scans that benefit from caching
- Fabric Best Practices - V-Order, broadcast joins, AQE, and partition recommendations
๏ฟฝ Storage Optimization (NEW in v1.4.0!)
- Small File Detection - Identify Delta tables with excessive small files (<10MB configurable threshold)
- VACUUM ROI Calculator - Estimate storage savings vs compute cost using OneLake pricing ($0.023/GB/month)
- Partition Effectiveness - Analyze partition count, skew ratios, and detect over/under-partitioning
- Comprehensive Analysis - Run all storage checks in one command with actionable recommendations
- Storage Cost Tracking - Calculate monthly OneLake storage costs and optimization opportunities
๏ฟฝ๐ก Interactive Configuration Assistant
- 133 documented configurations - Spark, Delta Lake, Fabric-specific, and Runtime 1.2 configs
- Context-aware guidance - Workload-specific recommendations with impact analysis
- Resource profile support - Understand writeHeavy, readHeavyForSpark, readHeavyForPBI profiles
- Search capabilities - Find configs by keyword or partial name
๐ Priority-Based Recommendations
- Color-coded priorities - Critical (red) โ High (yellow) โ Medium (blue) โ Low (dim)
- Formatted tables - Clean, readable output with impact explanations
- Actionable guidance - Specific commands and configuration values
๐ Quick Start
Installation
pip install sparkwise
Or install the wheel file directly in Fabric:
%pip install sparkwise-0.1.0-py3-none-any.whl
Basic Usage
from sparkwise import diagnose, ask
# Run comprehensive analysis on current session
diagnose.analyze()
# Ask about any configuration
ask.config('spark.native.enabled')
# Search for configurations
ask.search('optimize')
Session Profiling
from sparkwise import (profile, profile_executors, profile_jobs, profile_resources,
predict_scalability, show_timeline, analyze_efficiency)
# Profile complete session
profile()
# Profile executor metrics
profile_executors()
# Analyze job performance
profile_jobs()
# Check resource efficiency
profile_resources()
# Advanced profiling features
predict_scalability() # Compare pool configurations
show_timeline() # Visualize stage execution
analyze_efficiency() # Quantify compute waste
Advanced Analysis
from sparkwise import detect_skew, analyze_query
# Detect data skew
skew_results = detect_skew() # Analyze task-level skew
# Analyze specific DataFrame for partition skew
from sparkwise.core.advanced_skew_detector import AdvancedSkewDetector
detector = AdvancedSkewDetector()
detector.analyze_partition_skew(your_df, ["key_column"])
# Detect skewed joins
detector.detect_skewed_joins(large_df, small_df, "join_key")
# Analyze SQL query plans
query_results = analyze_query(your_df)
# Get Z-Order recommendations
from sparkwise.core.query_plan_analyzer import QueryPlanAnalyzer
analyzer = QueryPlanAnalyzer()
zorder_cols = analyzer.suggest_zorder_columns(delta_df, ["filtered_col"])
# Detect caching opportunities
analyzer.detect_repeated_subqueries(your_df)
Storage Optimization
import sparkwise
# Comprehensive storage analysis
sparkwise.analyze_storage("Tables/mytable")
# Individual analyses
sparkwise.check_small_files("Tables/mytable", threshold_mb=10)
sparkwise.vacuum_roi("Tables/mytable", retention_hours=168)
sparkwise.check_partitions("Tables/mytable")
CLI Usage:
# Comprehensive storage analysis
sparkwise storage analyze Tables/mytable
# Check for small files
sparkwise storage small-files Tables/mytable --threshold 10
# Calculate VACUUM ROI
sparkwise storage vacuum-roi Tables/mytable --retention-hours 168
# Analyze partition effectiveness
sparkwise storage partitions Tables/mytable
๐ Sample Output
Diagnostic Analysis
๐ฅ sparkwise Analysis ๐ฅ
๐ Native Execution Engine
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๏ธ Warning: Native keywords not found in physical plan
๐ก Check for unsupported operators or complex UDFs
โก Spark Compute
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Your job uses 1 executors - fits in Starter Pool
๐ก Ensure 'Starter Pool' is selected in workspace settings
๐พ Storage & Delta Optimizations
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โน๏ธ V-Order is DISABLED (optimal for write-heavy workloads)
Benefit: 2x faster writes vs V-Order enabled
๐ก Enable only for read-heavy workloads (Power BI/analytics)
Trade-off: 3-10x faster reads, but 15-20% slower writes
โน๏ธ Optimize Write is DISABLED (optimal for writeHeavy profile - default)
Benefit: Maximum write throughput for ETL and data ingestion
๐ก Enable only for read-heavy or streaming workloads
- readHeavyForSpark: spark.fabric.resourceProfile=readHeavyForSpark
- readHeavyForPBI: spark.fabric.resourceProfile=readHeavyForPBI
โ๏ธ Runtime Tuning
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CRITICAL: Adaptive Query Execution (AQE) is DISABLED
๐ก Enable immediately: spark.sql.adaptive.enabled=true
Benefits: Dynamic coalescing, skew joins, better parallelism
๐ Summary of Findings
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ Category โ Status โ Critical Issues โ Recommendations โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโค
โ Native Execution โ โ ๏ธ โ 1 โ 1 โ
โ Spark Compute โ โ
โ 0 โ 1 โ
โ Data Skew โ โ
โ 0 โ 0 โ
โ Delta โ โ
โ 0 โ 3 โ
โ Runtime โ โ ๏ธ โ 1 โ 2 โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
๐ง Configuration Recommendations
Total recommendations: 7
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ Priority โ Configuration โ Action โ Impact โ
โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ CRITICAL โ spark.sql.adaptive.enabled โ Set to 'true' โ Enable โ
โ โ โ โ dynamic โ
โ โ โ โ partition โ
โ โ โ โ coalescing โ
โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ MEDIUM โ spark.sql.parquet.vorder.enabledโ Enable for โ 3-10x faster โ
โ โ โ read-heavy โ reads for โ
โ โ โ workloads only โ Power BI โ
โโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
โจ Analysis complete!
Interactive Q&A
ask.config('spark.fabric.resourceProfile')
Output:
๐ spark.fabric.resourceProfile
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Default: writeHeavy
Scope: session
What it does:
FABRIC CRITICAL: Selects predefined Spark resource profiles optimized
for specific workload patterns. Simplifies configuration tuning.
Recommendations for your workload:
โข etl_ingestion: writeHeavy - optimized for ETL and data ingestion
โข analytics_spark: readHeavyForSpark - optimized for analytical queries
โข power_bi: readHeavyForPBI - optimized for Power BI Direct Lake
โข custom_needs: custom - user-defined configuration
Fabric-specific notes:
Microsoft Fabric resource profiles provide workload-optimized settings:
**writeHeavy (DEFAULT):**
- V-Order: DISABLED for faster writes
- Optimize Write: NULL/DISABLED for maximum throughput
- Use Case: ETL pipelines, data ingestion, batch transformations
**readHeavyForSpark:**
- Optimize Write: ENABLED with 128MB bins
- Use Case: Interactive Spark queries, analytical workloads
**readHeavyForPBI:**
- V-Order: ENABLED for Power BI optimization
- Optimize Write: ENABLED with 1GB bins
- Use Case: Power BI dashboards, Direct Lake scenarios
Related configurations:
โข spark.sql.parquet.vorder.enabled
โข spark.databricks.delta.optimizeWrite.enabled
โข spark.microsoft.delta.optimizeWrite.enabled
Examples:
spark.conf.set('spark.fabric.resourceProfile', 'readHeavyForSpark')
spark.conf.set('spark.fabric.resourceProfile', 'writeHeavy')
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Scalability Prediction
from sparkwise import predict_scalability
# Run after executing your workload
predict_scalability(runs_per_month=100)
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ SCALABILITY ANALYSIS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Workload Profile
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Current Runtime: 45.2 seconds
Monthly Runs: 100
Total Monthly Runtime: 75.3 minutes
๐ฏ Starter Pool (Current Configuration)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Configuration: 2 vCores, 8GB memory
VCore-Hours/Month: 2.51 hours
Estimated Cost: $2.76/month
Startup Overhead: ~5-10 seconds
Status: โ
OPTIMAL - Workload fits in Starter Pool
โก Custom Pool Comparison
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Configuration: 8 vCores, 32GB memory
VCore-Hours/Month: 10.04 hours
Estimated Cost: $11.04/month
Startup Overhead: 3-5 minutes
Performance Gain: ~2-3x faster execution
๐ก Recommendation: STAY ON STARTER POOL
โข Your workload is well-suited for Starter Pool
โข Custom Pool would cost 4x more with cold-start delays
โข Consider Custom Pool only if runs exceed 500/month
Efficiency Analysis
from sparkwise import analyze_efficiency
# Run after your Spark job completes
analyze_efficiency(runs_per_month=100)
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โก JOB EFFICIENCY ANALYSIS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Execution Metrics
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total Runtime: 45.2 seconds
Active Compute: 38.6 seconds (85.4%)
Wasted Compute: 6.6 seconds (14.6%)
VCore-Hours Used: 0.025 hours
VCore-Hours Wasted: 0.004 hours
๐ฐ Cost Impact (100 runs/month)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Monthly Compute: 2.51 VCore-hours
Monthly Waste: 0.37 VCore-hours (14.6%)
Wasted Cost: $0.41/month
๐ฏ Efficiency Score: 85.4% (GOOD)
โจ Top Optimization Opportunities
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Enable AQE for dynamic partition coalescing
Impact: Reduce shuffle overhead by 20-30%
2. Optimize shuffle partitions
Current: 200 partitions
Recommended: 50 partitions (based on data size)
Impact: Reduce task overhead, improve parallelism
Storage Optimization - Small Files
import sparkwise
sparkwise.check_small_files("Tables/green_tripdata_2017", threshold_mb=10)
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ SMALL FILE ANALYSIS: green_tripdata_2017
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ File Statistics
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ Metric โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค
โ Total Files โ 1,247 โ
โ Total Size โ 15.3 GB โ
โ Average File Size โ 12.6 MB โ
โ Smallest File โ 1.2 MB โ
โ Largest File โ 128.4 MB โ
โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
๐ด CRITICAL: Small File Problem Detected
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Estimated Small Files (<10MB): 498 files (39.9%)
Performance Impact:
โข 40% of files are too small
โข Excessive metadata operations
โข Poor query performance
โข Increased storage costs
๐ก Recommendations
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Run OPTIMIZE to compact small files:
spark.sql("OPTIMIZE delta.`Tables/green_tripdata_2017`")
2. Enable Auto-Optimize for future writes:
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
3. Consider repartitioning on write:
df.repartition(50).write.format("delta").save("Tables/green_tripdata_2017")
Expected Improvements:
โข Reduce file count by 60-80%
โข 3-5x faster query performance
โข 20-30% reduction in metadata overhead
Storage Optimization - VACUUM ROI
import sparkwise
sparkwise.vacuum_roi("Tables/green_tripdata_2017", retention_hours=168)
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฐ VACUUM ROI ANALYSIS: green_tripdata_2017
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Current Storage State
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ Metric โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค
โ Current Size โ 15.3 GB โ
โ Retention Period โ 168 hours (7d) โ
โ Removable Operations โ 23 operations โ
โ Last VACUUM โ 45 days ago โ
โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
๐พ Storage Savings Estimate
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Reclaimable Space: 4.59 GB (30.0%)
OneLake Storage Cost:
Current: $0.35/month ($0.023/GB)
After VACUUM: $0.25/month
Monthly Savings: $0.11/month
โก VACUUM Cost
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Estimated Compute: $1.50
Break-even Period: 13.6 months
โ
RECOMMENDATION: RUN VACUUM
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Although break-even is 14 months, VACUUM provides benefits:
โข Improved query performance (fewer files to scan)
โข Reduced metadata overhead
โข Better data governance
โข Simplified time travel queries
Command:
spark.sql("VACUUM delta.`Tables/green_tripdata_2017` RETAIN 168 HOURS")
Best Practice:
โข Run VACUUM quarterly for large tables
โข Run VACUUM monthly for frequently updated tables
โข Adjust retention based on time travel needs
Storage Optimization - Partition Analysis
import sparkwise
sparkwise.check_partitions("Tables/green_tripdata_2017")
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐๏ธ PARTITION EFFECTIVENESS: green_tripdata_2017
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Partition Statistics
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ Metric โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค
โ Partition Columns โ year, month โ
โ Total Partitions โ 12 โ
โ Partitions Scanned โ 12 (100%) โ
โ Average Rows/Partition โ 850,423 โ
โ Max Rows (Jan) โ 1,104,518 โ
โ Min Rows (Nov) โ 612,847 โ
โ Skew Ratio โ 1.8x โ
โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
๐ข GOOD: Well-Balanced Partitions
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Partition count is optimal (10-100 range)
โข Skew ratio is acceptable (<3x)
โข Each partition has sufficient data
๐ก Optimization Opportunities
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Enable Z-Order for frequently filtered columns:
spark.sql("OPTIMIZE delta.`Tables/green_tripdata_2017`
ZORDER BY (vendor_id, payment_type)")
Benefits:
โข 2-5x faster queries on vendor_id, payment_type
โข No partition overhead
โข Maintains good compression
2. Consider liquid clustering for high-cardinality columns:
ALTER TABLE green_tripdata_2017
CLUSTER BY (vendor_id, payment_type, pickup_location)
Benefits:
โข Automatic optimization on writes
โข Better for evolving query patterns
โข Handles high-cardinality columns
๐ฏ Partition Health: โ
OPTIMAL
Your partitioning strategy is working well!
Comprehensive Storage Analysis
import sparkwise
sparkwise.analyze_storage("Tables/green_tripdata_2017")
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ COMPREHENSIVE STORAGE ANALYSIS: green_tripdata_2017
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[Shows combined output of all three analyses above:]
1. Small File Detection (with recommendations)
2. VACUUM ROI Calculation (with cost analysis)
3. Partition Effectiveness (with optimization suggestions)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ PRIORITY ACTION ITEMS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ด CRITICAL (Do Now):
โข Run OPTIMIZE to compact 498 small files
โข Enable Auto-Optimize for future writes
๐ก HIGH (This Week):
โข Add Z-Order on vendor_id, payment_type
โข Run VACUUM to reclaim 4.59 GB
๐ข MEDIUM (This Month):
โข Review partition strategy quarterly
โข Monitor file growth patterns
โข Set up automated OPTIMIZE jobs
๐ฐ Total Potential Savings:
โข Storage: $0.11/month (after VACUUM)
โข Compute: 20-30% reduction (after OPTIMIZE)
โข Query Performance: 3-5x faster
๐ฆ What's Included
Core Modules
diagnose- Main diagnostic engine with 5 check categoriesask- Interactive configuration Q&A systemprofile- Session profilingprofile_executors- Executor-level metricsprofile_jobs- Job/stage/task analysisprofile_resources- Resource efficiency scoringpredict_scalability- Compare Starter vs Custom Pool configurationsanalyze_efficiency- Quantify wasted compute with VCore-hour metricsshow_timeline- Visualize stage execution patternsdetect_skew- Advanced skew detection with mitigation strategiesanalyze_query- SQL query plan analysis with anti-pattern detectionanalyze_storage- Comprehensive storage optimization (v1.4.0)check_small_files- Small file detection with thresholds (v1.4.0)vacuum_roi- VACUUM ROI calculator with OneLake pricing (v1.4.0)check_partitions- Partition effectiveness analysis (v1.4.0)
Knowledge Base (133 Configurations)
- 33 Spark configs - Core settings for shuffle, memory, AQE, serialization
- 45 Delta configs - Delta Lake optimizations, V-Order, Deletion Vectors
- 10 Fabric configs - Native Engine, resource profiles, OneLake storage
- 45 Runtime 1.2 configs - Latest Fabric Runtime 1.2 features
Latest Features
- โ Storage optimization suite - Small files, VACUUM ROI, partition analysis (v1.4.0)
- โ OneLake cost tracking - Real pricing ($0.023/GB/month) for storage decisions
- โ Advanced skew detection - Task duration, partition-level, and join analysis
- โ SQL query plan analyzer - Anti-patterns, Native Engine checks, Z-Order suggestions
- โ Real metrics profiling - VCore-hour calculations, efficiency scoring
- โ Scalability prediction - Starter vs Custom Pool cost comparison
- โ Fabric resource profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI)
- โ Advanced Delta optimizations (Fast Optimize, Adaptive File Size, File Level Target)
- โ Driver Mode Snapshot for faster metadata operations
- โ Priority-based recommendation tables
- โ Color-coded terminal output with Rich library
๐ฏ Use Cases
Data Engineers
- Optimize ETL pipelines - Detect bottlenecks, tune parallelism, reduce costs
- Validate configurations - Ensure proper resource profiles and pool usage
- Debug job failures - Understand errors with plain English explanations
- Manage storage costs - Track OneLake usage, optimize file layouts, VACUUM ROI
- Monitor table health - Detect small files, partition skew, storage bloat
Data Scientists
- Improve notebook performance - Enable Native Engine, optimize memory usage
- Understand Spark behavior - Learn configurations through interactive Q&A
- Profile experiments - Track resource usage and efficiency
- Optimize data access - Identify caching opportunities, partition pruning
Platform Admins
- Standardize best practices - Share optimal configurations across teams
- Monitor capacity usage - Identify jobs forcing Custom Pool usage
- Cost optimization - Detect over-provisioned or misconfigured workloads
- Storage governance - Track OneLake costs, enforce OPTIMIZE/VACUUM policies
- Performance tracking - Monitor VCore-hour usage, identify waste
๐ Examples
Check out the examples directory:
- basic_analysis.py - Basic diagnostic workflow
- config_qa_demo.py - Configuration Q&A usage
- profiling_demo.py - Comprehensive profiling examples
- scalability_demo.py - Scalability prediction and efficiency analysis
- skew_detection_demo.py - Advanced skew detection
- query_analysis_demo.py - SQL query plan analysis
- storage_optimization_demo.py - Storage optimization (v1.4.0)
- knowledge_base_demo.py - Knowledge base exploration
- immutable_configs_demo.py - Starter Pool optimization
๐งช Running Tests
# Install test dependencies
pip install pytest pytest-cov
# Run all tests
pytest
# Run with coverage
pytest --cov=sparkwise --cov-report=html
# Run specific test file
pytest tests/test_advisor.py
๐ค Contributing
Contributions are welcome! Please read our Contributing Guide for details.
Development Setup
# Clone the repository
git clone https://github.com/santhoshravindran7/sparkwise.git
cd sparkwise
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
Built with โค๏ธ for the Microsoft Fabric Data Engineering and Data Science community.
๐ฌ Contact & Support
- Author: Santhosh Ravindran
- GitHub: @santhoshravindran7
- Feedback: Share your feedback, report bugs, or request features
๐ What's New in v1.4.0
๐พ Storage Optimization Suite
- โ Small file detection - Identify tables with excessive files <10MB (configurable threshold)
- โ VACUUM ROI calculator - Estimate storage savings vs compute cost with OneLake pricing ($0.023/GB/month)
- โ Partition effectiveness - Analyze partition count, skew ratios, detect over/under-partitioning
- โ Comprehensive analysis - Run all storage checks with one command
- โ
CLI integration -
sparkwise storage analyze|small-files|vacuum-roi|partitions - โ Actionable recommendations - Get SQL commands for OPTIMIZE, VACUUM, Z-Order, partitioning
Use Cases
- Cost optimization - Track OneLake storage costs, identify VACUUM opportunities
- Performance tuning - Detect small file problems impacting query speed
- Data governance - Monitor table health, enforce optimization policies
- Capacity planning - Understand storage growth patterns, predict costs
Example
import sparkwise
# Run comprehensive storage analysis
sparkwise.analyze_storage("Tables/mytable")
# Get small file recommendations
sparkwise.check_small_files("Tables/mytable", threshold_mb=10)
# Calculate VACUUM ROI
sparkwise.vacuum_roi("Tables/mytable", retention_hours=168)
# Analyze partition effectiveness
sparkwise.check_partitions("Tables/mytable")
Previous Releases:
v0.1.0 - Initial Release
- โจ Complete profiling suite (session, executor, job, resource profilers)
- ๐จ Rich terminal output with color-coded priorities
- ๐ Priority-based recommendation tables
- ๐ง Fabric resource profile support (writeHeavy, readHeavy profiles)
- โก 4 new advanced Delta optimizations
- ๐ 133 documented configurations (up from 100)
- ๐ฏ Context-aware Optimize Write recommendations
- ๐ CLI support for all profiling operations
Make Spark tuning fun again! ๐โจ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparkwise-1.4.2.tar.gz.
File metadata
- Download URL: sparkwise-1.4.2.tar.gz
- Upload date:
- Size: 110.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d78c28b24c07799e1f4957f6b219f94fa4f90653ed59b7646cf3aeb26487edb
|
|
| MD5 |
2e679da59ce2eafed474fe86b05d8443
|
|
| BLAKE2b-256 |
c26c575df34d91b356b00f499ab2a012a14e68e36306f1eaad2c31d3a382ef97
|
File details
Details for the file sparkwise-1.4.2-py3-none-any.whl.
File metadata
- Download URL: sparkwise-1.4.2-py3-none-any.whl
- Upload date:
- Size: 108.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3b1875197356eb9f4e162ae9958d1fef5f72cca24788a56d5b28124aa1b18ed
|
|
| MD5 |
72cbba8907138a5975304e870b292514
|
|
| BLAKE2b-256 |
a9b02af8b182420a4f80103f04c6ad306693839b8e6675e08abd1a6528b71e78
|