Skip to main content

Automated Data Engineering specialist for Fabric Spark workloads - intelligent configuration analysis and optimization recommendations

Project description

๐Ÿ”ฅ Sparkwise

Achieve optimal Fabric Spark price-performance with automated insights - simplifies tuning, makes optimization fun

Python Version License: MIT

sparkwise is an automated Data Engineering specialist for Apache Spark on Microsoft Fabric. It provides intelligent diagnostics, configuration recommendations, and comprehensive session profiling to help you achieve the best price-performance for your workloads - all while making Spark tuning simple and enjoyable.

๐ŸŽฏ Why sparkwise?

Spark tuning on Microsoft Fabric doesn't have to be complex or expensive. sparkwise helps you:

  • ๐Ÿ’ฐ Optimize costs - Detect configurations that waste capacity and increase runtime
  • โšก Maximize performance - Enable Fabric-specific optimizations (Native Engine, V-Order, resource profiles)
  • ๐ŸŽ“ Simplify learning - Interactive Q&A for 133 Spark/Delta/Fabric configurations
  • ๐Ÿ” Understand workloads - Comprehensive profiling of sessions, executors, jobs, and resources
  • โฑ๏ธ Save time - Avoid 3-5min cold-starts by detecting Starter Pool blockers
  • ๐Ÿ“Š Make data-driven decisions - Priority-ranked recommendations with impact analysis

โœจ Key Features

๐Ÿ”ฌ Automated Diagnostics

  • Native Execution Engine - Verifies Velox usage, detects fallbacks to row-based processing
  • Spark Compute - Analyzes Starter vs Custom Pool usage, warns about immutable configs
  • Data Skew Detection - Identifies imbalanced task distributions
  • Delta Optimizations - Checks V-Order, Deletion Vectors, Optimize Write, Auto Compaction
  • Runtime Tuning - Validates AQE, partition sizing, scheduler mode

๐Ÿ“Š Comprehensive Profiling

  • Session Profiling - Application metadata, resource allocation, memory breakdown
  • Executor Profiling - Executor status, memory utilization, task distribution
  • Job Profiling - Job/stage/task metrics, bottleneck detection
  • Resource Profiling - Efficiency scoring, utilization analysis, optimization recommendations

๐Ÿš€ Advanced Performance Analysis (NEW!)

  • Real Metrics Collection - Uses actual Spark stage/task data instead of estimates
  • Scalability Prediction - Compare Starter vs Custom Pool with real VCore-hour calculations
  • Stage Timeline - Visualize execution patterns with parallel/sequential analysis
  • Efficiency Analysis - Quantify wasted compute in VCore-hours with actionable recommendations

๐Ÿ” Advanced Skew Detection (NEW!)

  • Task Duration Analysis - Detect stragglers and long-running tasks with variance detection
  • Partition-Level Analysis - Identify data distribution imbalances with statistical metrics
  • Skewed Join Detection - Analyze join patterns and recommend broadcast vs salting strategies
  • Automatic Mitigation - Get code examples for salting, AQE, and broadcast optimizations

๐ŸŽฏ SQL Query Plan Analysis (NEW!)

  • Anti-Pattern Detection - Identify cartesian products, full scans, and excessive shuffles
  • Native Engine Compatibility - Check if queries use Fabric Native Engine (3-8x faster)
  • Z-Order Recommendations - Suggest best columns for Delta optimization based on cardinality
  • Caching Opportunities - Detect repeated table scans that benefit from caching
  • Fabric Best Practices - V-Order, broadcast joins, AQE, and partition recommendations

๏ฟฝ Storage Optimization (NEW in v1.4.0!)

  • Small File Detection - Identify Delta tables with excessive small files (<10MB configurable threshold)
  • VACUUM ROI Calculator - Estimate storage savings vs compute cost using OneLake pricing ($0.023/GB/month)
  • Partition Effectiveness - Analyze partition count, skew ratios, and detect over/under-partitioning
  • Comprehensive Analysis - Run all storage checks in one command with actionable recommendations
  • Storage Cost Tracking - Calculate monthly OneLake storage costs and optimization opportunities

๏ฟฝ๐Ÿ’ก Interactive Configuration Assistant

  • 133 documented configurations - Spark, Delta Lake, Fabric-specific, and Runtime 1.2 configs
  • Context-aware guidance - Workload-specific recommendations with impact analysis
  • Resource profile support - Understand writeHeavy, readHeavyForSpark, readHeavyForPBI profiles
  • Search capabilities - Find configs by keyword or partial name

๐Ÿ“ˆ Priority-Based Recommendations

  • Color-coded priorities - Critical (red) โ†’ High (yellow) โ†’ Medium (blue) โ†’ Low (dim)
  • Formatted tables - Clean, readable output with impact explanations
  • Actionable guidance - Specific commands and configuration values

๐Ÿš€ Quick Start

Installation

pip install sparkwise

Or install the wheel file directly in Fabric:

%pip install sparkwise-0.1.0-py3-none-any.whl

Basic Usage

from sparkwise import diagnose, ask

# Run comprehensive analysis on current session
diagnose.analyze()

# Ask about any configuration
ask.config('spark.native.enabled')

# Search for configurations
ask.search('optimize')

Session Profiling

from sparkwise import (profile, profile_executors, profile_jobs, profile_resources,
                       predict_scalability, show_timeline, analyze_efficiency)

# Profile complete session
profile()

# Profile executor metrics
profile_executors()

# Analyze job performance
profile_jobs()

# Check resource efficiency
profile_resources()

# Advanced profiling features
predict_scalability()  # Compare pool configurations
show_timeline()        # Visualize stage execution
analyze_efficiency()   # Quantify compute waste

Advanced Analysis

from sparkwise import detect_skew, analyze_query

# Detect data skew
skew_results = detect_skew()  # Analyze task-level skew

# Analyze specific DataFrame for partition skew
from sparkwise.core.advanced_skew_detector import AdvancedSkewDetector
detector = AdvancedSkewDetector()
detector.analyze_partition_skew(your_df, ["key_column"])

# Detect skewed joins
detector.detect_skewed_joins(large_df, small_df, "join_key")

# Analyze SQL query plans
query_results = analyze_query(your_df)

# Get Z-Order recommendations
from sparkwise.core.query_plan_analyzer import QueryPlanAnalyzer
analyzer = QueryPlanAnalyzer()
zorder_cols = analyzer.suggest_zorder_columns(delta_df, ["filtered_col"])

# Detect caching opportunities
analyzer.detect_repeated_subqueries(your_df)

Storage Optimization

import sparkwise

# Comprehensive storage analysis
sparkwise.analyze_storage("Tables/mytable")

# Individual analyses
sparkwise.check_small_files("Tables/mytable", threshold_mb=10)
sparkwise.vacuum_roi("Tables/mytable", retention_hours=168)
sparkwise.check_partitions("Tables/mytable")

CLI Usage:

# Comprehensive storage analysis
sparkwise storage analyze Tables/mytable

# Check for small files
sparkwise storage small-files Tables/mytable --threshold 10

# Calculate VACUUM ROI
sparkwise storage vacuum-roi Tables/mytable --retention-hours 168

# Analyze partition effectiveness
sparkwise storage partitions Tables/mytable

๐Ÿ“Š Sample Output

Diagnostic Analysis

๐Ÿ”ฅ sparkwise Analysis ๐Ÿ”ฅ

๐Ÿ”Ž Native Execution Engine
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โš ๏ธ Warning: Native keywords not found in physical plan
   ๐Ÿ’ก Check for unsupported operators or complex UDFs

โšก Spark Compute
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โœ… Your job uses 1 executors - fits in Starter Pool
   ๐Ÿ’ก Ensure 'Starter Pool' is selected in workspace settings

๐Ÿ’พ Storage & Delta Optimizations
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ„น๏ธ V-Order is DISABLED (optimal for write-heavy workloads)
   Benefit: 2x faster writes vs V-Order enabled
   ๐Ÿ’ก Enable only for read-heavy workloads (Power BI/analytics)
      Trade-off: 3-10x faster reads, but 15-20% slower writes

โ„น๏ธ Optimize Write is DISABLED (optimal for writeHeavy profile - default)
   Benefit: Maximum write throughput for ETL and data ingestion
   ๐Ÿ’ก Enable only for read-heavy or streaming workloads
      - readHeavyForSpark: spark.fabric.resourceProfile=readHeavyForSpark
      - readHeavyForPBI: spark.fabric.resourceProfile=readHeavyForPBI

โš™๏ธ Runtime Tuning
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ›” CRITICAL: Adaptive Query Execution (AQE) is DISABLED
   ๐Ÿ’ก Enable immediately: spark.sql.adaptive.enabled=true
      Benefits: Dynamic coalescing, skew joins, better parallelism

๐Ÿ“‹ Summary of Findings
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Category            โ”‚ Status โ”‚ Critical Issues โ”‚ Recommendations โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Native Execution    โ”‚ โš ๏ธ     โ”‚ 1               โ”‚ 1               โ”‚
โ”‚ Spark Compute       โ”‚ โœ…     โ”‚ 0               โ”‚ 1               โ”‚
โ”‚ Data Skew           โ”‚ โœ…     โ”‚ 0               โ”‚ 0               โ”‚
โ”‚ Delta               โ”‚ โœ…     โ”‚ 0               โ”‚ 3               โ”‚
โ”‚ Runtime             โ”‚ โš ๏ธ     โ”‚ 1               โ”‚ 2               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ง Configuration Recommendations
Total recommendations: 7

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Priority โ”‚ Configuration                   โ”‚ Action         โ”‚ Impact       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ CRITICAL โ”‚ spark.sql.adaptive.enabled      โ”‚ Set to 'true'  โ”‚ Enable       โ”‚
โ”‚          โ”‚                                 โ”‚                โ”‚ dynamic      โ”‚
โ”‚          โ”‚                                 โ”‚                โ”‚ partition    โ”‚
โ”‚          โ”‚                                 โ”‚                โ”‚ coalescing   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ MEDIUM   โ”‚ spark.sql.parquet.vorder.enabledโ”‚ Enable for     โ”‚ 3-10x faster โ”‚
โ”‚          โ”‚                                 โ”‚ read-heavy     โ”‚ reads for    โ”‚
โ”‚          โ”‚                                 โ”‚ workloads only โ”‚ Power BI     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœจ Analysis complete!

Interactive Q&A

ask.config('spark.fabric.resourceProfile')

Output:

๐Ÿ“š spark.fabric.resourceProfile

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Default: writeHeavy
Scope: session

What it does:
FABRIC CRITICAL: Selects predefined Spark resource profiles optimized 
for specific workload patterns. Simplifies configuration tuning.

Recommendations for your workload:
  โ€ข etl_ingestion: writeHeavy - optimized for ETL and data ingestion
  โ€ข analytics_spark: readHeavyForSpark - optimized for analytical queries
  โ€ข power_bi: readHeavyForPBI - optimized for Power BI Direct Lake
  โ€ข custom_needs: custom - user-defined configuration

Fabric-specific notes:
Microsoft Fabric resource profiles provide workload-optimized settings:

**writeHeavy (DEFAULT):**
- V-Order: DISABLED for faster writes
- Optimize Write: NULL/DISABLED for maximum throughput
- Use Case: ETL pipelines, data ingestion, batch transformations

**readHeavyForSpark:**
- Optimize Write: ENABLED with 128MB bins
- Use Case: Interactive Spark queries, analytical workloads

**readHeavyForPBI:**
- V-Order: ENABLED for Power BI optimization
- Optimize Write: ENABLED with 1GB bins
- Use Case: Power BI dashboards, Direct Lake scenarios

Related configurations:
  โ€ข spark.sql.parquet.vorder.enabled
  โ€ข spark.databricks.delta.optimizeWrite.enabled
  โ€ข spark.microsoft.delta.optimizeWrite.enabled

Examples:
  spark.conf.set('spark.fabric.resourceProfile', 'readHeavyForSpark')
  spark.conf.set('spark.fabric.resourceProfile', 'writeHeavy')

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Scalability Prediction

from sparkwise import predict_scalability

# Run after executing your workload
predict_scalability(runs_per_month=100)

Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ“Š SCALABILITY ANALYSIS
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“ˆ Workload Profile
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Current Runtime: 45.2 seconds
  Monthly Runs: 100
  Total Monthly Runtime: 75.3 minutes

๐ŸŽฏ Starter Pool (Current Configuration)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Configuration: 2 vCores, 8GB memory
  VCore-Hours/Month: 2.51 hours
  Estimated Cost: $2.76/month
  Startup Overhead: ~5-10 seconds
  Status: โœ… OPTIMAL - Workload fits in Starter Pool

โšก Custom Pool Comparison
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Configuration: 8 vCores, 32GB memory
  VCore-Hours/Month: 10.04 hours
  Estimated Cost: $11.04/month
  Startup Overhead: 3-5 minutes
  Performance Gain: ~2-3x faster execution

๐Ÿ’ก Recommendation: STAY ON STARTER POOL
  โ€ข Your workload is well-suited for Starter Pool
  โ€ข Custom Pool would cost 4x more with cold-start delays
  โ€ข Consider Custom Pool only if runs exceed 500/month

Efficiency Analysis

from sparkwise import analyze_efficiency

# Run after your Spark job completes
analyze_efficiency(runs_per_month=100)

Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
โšก JOB EFFICIENCY ANALYSIS
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“Š Execution Metrics
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Total Runtime: 45.2 seconds
  Active Compute: 38.6 seconds (85.4%)
  Wasted Compute: 6.6 seconds (14.6%)
  
  VCore-Hours Used: 0.025 hours
  VCore-Hours Wasted: 0.004 hours

๐Ÿ’ฐ Cost Impact (100 runs/month)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Monthly Compute: 2.51 VCore-hours
  Monthly Waste: 0.37 VCore-hours (14.6%)
  Wasted Cost: $0.41/month

๐ŸŽฏ Efficiency Score: 85.4% (GOOD)

โœจ Top Optimization Opportunities
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  1. Enable AQE for dynamic partition coalescing
     Impact: Reduce shuffle overhead by 20-30%
  
  2. Optimize shuffle partitions
     Current: 200 partitions
     Recommended: 50 partitions (based on data size)
     Impact: Reduce task overhead, improve parallelism

Storage Optimization - Small Files

import sparkwise
sparkwise.check_small_files("Tables/green_tripdata_2017", threshold_mb=10)

Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ“ SMALL FILE ANALYSIS: green_tripdata_2017
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“Š File Statistics
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric                 โ”‚ Value            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Total Files            โ”‚ 1,247            โ”‚
โ”‚ Total Size             โ”‚ 15.3 GB          โ”‚
โ”‚ Average File Size      โ”‚ 12.6 MB          โ”‚
โ”‚ Smallest File          โ”‚ 1.2 MB           โ”‚
โ”‚ Largest File           โ”‚ 128.4 MB         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ด CRITICAL: Small File Problem Detected
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Estimated Small Files (<10MB): 498 files (39.9%)
  
  Performance Impact:
    โ€ข 40% of files are too small
    โ€ข Excessive metadata operations
    โ€ข Poor query performance
    โ€ข Increased storage costs

๐Ÿ’ก Recommendations
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  1. Run OPTIMIZE to compact small files:
     spark.sql("OPTIMIZE delta.`Tables/green_tripdata_2017`")
  
  2. Enable Auto-Optimize for future writes:
     spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
     spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
  
  3. Consider repartitioning on write:
     df.repartition(50).write.format("delta").save("Tables/green_tripdata_2017")
  
  Expected Improvements:
    โ€ข Reduce file count by 60-80%
    โ€ข 3-5x faster query performance
    โ€ข 20-30% reduction in metadata overhead

Storage Optimization - VACUUM ROI

import sparkwise
sparkwise.vacuum_roi("Tables/green_tripdata_2017", retention_hours=168)

Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ’ฐ VACUUM ROI ANALYSIS: green_tripdata_2017
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“Š Current Storage State
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric                 โ”‚ Value            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Current Size           โ”‚ 15.3 GB          โ”‚
โ”‚ Retention Period       โ”‚ 168 hours (7d)   โ”‚
โ”‚ Removable Operations   โ”‚ 23 operations    โ”‚
โ”‚ Last VACUUM            โ”‚ 45 days ago      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ’พ Storage Savings Estimate
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Reclaimable Space: 4.59 GB (30.0%)
  
  OneLake Storage Cost:
    Current: $0.35/month ($0.023/GB)
    After VACUUM: $0.25/month
    Monthly Savings: $0.11/month

โšก VACUUM Cost
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Estimated Compute: $1.50
  Break-even Period: 13.6 months

โœ… RECOMMENDATION: RUN VACUUM
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Although break-even is 14 months, VACUUM provides benefits:
    โ€ข Improved query performance (fewer files to scan)
    โ€ข Reduced metadata overhead
    โ€ข Better data governance
    โ€ข Simplified time travel queries

  Command:
    spark.sql("VACUUM delta.`Tables/green_tripdata_2017` RETAIN 168 HOURS")
  
  Best Practice:
    โ€ข Run VACUUM quarterly for large tables
    โ€ข Run VACUUM monthly for frequently updated tables
    โ€ข Adjust retention based on time travel needs

Storage Optimization - Partition Analysis

import sparkwise
sparkwise.check_partitions("Tables/green_tripdata_2017")

Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ—‚๏ธ PARTITION EFFECTIVENESS: green_tripdata_2017
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“Š Partition Statistics
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric                 โ”‚ Value            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Partition Columns      โ”‚ year, month      โ”‚
โ”‚ Total Partitions       โ”‚ 12               โ”‚
โ”‚ Partitions Scanned     โ”‚ 12 (100%)        โ”‚
โ”‚ Average Rows/Partition โ”‚ 850,423          โ”‚
โ”‚ Max Rows (Jan)         โ”‚ 1,104,518        โ”‚
โ”‚ Min Rows (Nov)         โ”‚ 612,847          โ”‚
โ”‚ Skew Ratio             โ”‚ 1.8x             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐ŸŸข GOOD: Well-Balanced Partitions
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  โ€ข Partition count is optimal (10-100 range)
  โ€ข Skew ratio is acceptable (<3x)
  โ€ข Each partition has sufficient data

๐Ÿ’ก Optimization Opportunities
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  1. Enable Z-Order for frequently filtered columns:
     spark.sql("OPTIMIZE delta.`Tables/green_tripdata_2017` 
                ZORDER BY (vendor_id, payment_type)")
     
     Benefits:
       โ€ข 2-5x faster queries on vendor_id, payment_type
       โ€ข No partition overhead
       โ€ข Maintains good compression
  
  2. Consider liquid clustering for high-cardinality columns:
     ALTER TABLE green_tripdata_2017 
     CLUSTER BY (vendor_id, payment_type, pickup_location)
     
     Benefits:
       โ€ข Automatic optimization on writes
       โ€ข Better for evolving query patterns
       โ€ข Handles high-cardinality columns

๐ŸŽฏ Partition Health: โœ… OPTIMAL
  Your partitioning strategy is working well!

Comprehensive Storage Analysis

import sparkwise
sparkwise.analyze_storage("Tables/green_tripdata_2017")

Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ” COMPREHENSIVE STORAGE ANALYSIS: green_tripdata_2017
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

[Shows combined output of all three analyses above:]
  1. Small File Detection (with recommendations)
  2. VACUUM ROI Calculation (with cost analysis)
  3. Partition Effectiveness (with optimization suggestions)

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ“‹ PRIORITY ACTION ITEMS
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
  ๐Ÿ”ด CRITICAL (Do Now):
    โ€ข Run OPTIMIZE to compact 498 small files
    โ€ข Enable Auto-Optimize for future writes
  
  ๐ŸŸก HIGH (This Week):
    โ€ข Add Z-Order on vendor_id, payment_type
    โ€ข Run VACUUM to reclaim 4.59 GB
  
  ๐ŸŸข MEDIUM (This Month):
    โ€ข Review partition strategy quarterly
    โ€ข Monitor file growth patterns
    โ€ข Set up automated OPTIMIZE jobs

๐Ÿ’ฐ Total Potential Savings:
  โ€ข Storage: $0.11/month (after VACUUM)
  โ€ข Compute: 20-30% reduction (after OPTIMIZE)
  โ€ข Query Performance: 3-5x faster

๐Ÿ“ฆ What's Included

Core Modules

  • diagnose - Main diagnostic engine with 5 check categories
  • ask - Interactive configuration Q&A system
  • profile - Session profiling
  • profile_executors - Executor-level metrics
  • profile_jobs - Job/stage/task analysis
  • profile_resources - Resource efficiency scoring
  • predict_scalability - Compare Starter vs Custom Pool configurations
  • analyze_efficiency - Quantify wasted compute with VCore-hour metrics
  • show_timeline - Visualize stage execution patterns
  • detect_skew - Advanced skew detection with mitigation strategies
  • analyze_query - SQL query plan analysis with anti-pattern detection
  • analyze_storage - Comprehensive storage optimization (v1.4.0)
  • check_small_files - Small file detection with thresholds (v1.4.0)
  • vacuum_roi - VACUUM ROI calculator with OneLake pricing (v1.4.0)
  • check_partitions - Partition effectiveness analysis (v1.4.0)

Knowledge Base (133 Configurations)

  • 33 Spark configs - Core settings for shuffle, memory, AQE, serialization
  • 45 Delta configs - Delta Lake optimizations, V-Order, Deletion Vectors
  • 10 Fabric configs - Native Engine, resource profiles, OneLake storage
  • 45 Runtime 1.2 configs - Latest Fabric Runtime 1.2 features

Latest Features

  • โœ… Storage optimization suite - Small files, VACUUM ROI, partition analysis (v1.4.0)
  • โœ… OneLake cost tracking - Real pricing ($0.023/GB/month) for storage decisions
  • โœ… Advanced skew detection - Task duration, partition-level, and join analysis
  • โœ… SQL query plan analyzer - Anti-patterns, Native Engine checks, Z-Order suggestions
  • โœ… Real metrics profiling - VCore-hour calculations, efficiency scoring
  • โœ… Scalability prediction - Starter vs Custom Pool cost comparison
  • โœ… Fabric resource profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI)
  • โœ… Advanced Delta optimizations (Fast Optimize, Adaptive File Size, File Level Target)
  • โœ… Driver Mode Snapshot for faster metadata operations
  • โœ… Priority-based recommendation tables
  • โœ… Color-coded terminal output with Rich library

๐ŸŽฏ Use Cases

Data Engineers

  • Optimize ETL pipelines - Detect bottlenecks, tune parallelism, reduce costs
  • Validate configurations - Ensure proper resource profiles and pool usage
  • Debug job failures - Understand errors with plain English explanations
  • Manage storage costs - Track OneLake usage, optimize file layouts, VACUUM ROI
  • Monitor table health - Detect small files, partition skew, storage bloat

Data Scientists

  • Improve notebook performance - Enable Native Engine, optimize memory usage
  • Understand Spark behavior - Learn configurations through interactive Q&A
  • Profile experiments - Track resource usage and efficiency
  • Optimize data access - Identify caching opportunities, partition pruning

Platform Admins

  • Standardize best practices - Share optimal configurations across teams
  • Monitor capacity usage - Identify jobs forcing Custom Pool usage
  • Cost optimization - Detect over-provisioned or misconfigured workloads
  • Storage governance - Track OneLake costs, enforce OPTIMIZE/VACUUM policies
  • Performance tracking - Monitor VCore-hour usage, identify waste

๐ŸŽ“ Examples

Check out the examples directory:

๐Ÿงช Running Tests

# Install test dependencies
pip install pytest pytest-cov

# Run all tests
pytest

# Run with coverage
pytest --cov=sparkwise --cov-report=html

# Run specific test file
pytest tests/test_advisor.py

๐Ÿค Contributing

Contributions are welcome! Please read our Contributing Guide for details.

Development Setup

# Clone the repository
git clone https://github.com/santhoshravindran7/sparkwise.git
cd sparkwise

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

Built with โค๏ธ for the Microsoft Fabric Data Engineering and Data Science community.

๐Ÿ“ฌ Contact & Support

๐ŸŽ‰ What's New in v1.4.0

๐Ÿ’พ Storage Optimization Suite

  • โœ… Small file detection - Identify tables with excessive files <10MB (configurable threshold)
  • โœ… VACUUM ROI calculator - Estimate storage savings vs compute cost with OneLake pricing ($0.023/GB/month)
  • โœ… Partition effectiveness - Analyze partition count, skew ratios, detect over/under-partitioning
  • โœ… Comprehensive analysis - Run all storage checks with one command
  • โœ… CLI integration - sparkwise storage analyze|small-files|vacuum-roi|partitions
  • โœ… Actionable recommendations - Get SQL commands for OPTIMIZE, VACUUM, Z-Order, partitioning

Use Cases

  • Cost optimization - Track OneLake storage costs, identify VACUUM opportunities
  • Performance tuning - Detect small file problems impacting query speed
  • Data governance - Monitor table health, enforce optimization policies
  • Capacity planning - Understand storage growth patterns, predict costs

Example

import sparkwise

# Run comprehensive storage analysis
sparkwise.analyze_storage("Tables/mytable")

# Get small file recommendations
sparkwise.check_small_files("Tables/mytable", threshold_mb=10)

# Calculate VACUUM ROI
sparkwise.vacuum_roi("Tables/mytable", retention_hours=168)

# Analyze partition effectiveness
sparkwise.check_partitions("Tables/mytable")

Previous Releases:

v0.1.0 - Initial Release
  • โœจ Complete profiling suite (session, executor, job, resource profilers)
  • ๐ŸŽจ Rich terminal output with color-coded priorities
  • ๐Ÿ“Š Priority-based recommendation tables
  • ๐Ÿ”ง Fabric resource profile support (writeHeavy, readHeavy profiles)
  • โšก 4 new advanced Delta optimizations
  • ๐Ÿ“š 133 documented configurations (up from 100)
  • ๐ŸŽฏ Context-aware Optimize Write recommendations
  • ๐Ÿš€ CLI support for all profiling operations

Make Spark tuning fun again! ๐Ÿš€โœจ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkwise-1.4.1.tar.gz (110.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkwise-1.4.1-py3-none-any.whl (108.0 kB view details)

Uploaded Python 3

File details

Details for the file sparkwise-1.4.1.tar.gz.

File metadata

  • Download URL: sparkwise-1.4.1.tar.gz
  • Upload date:
  • Size: 110.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for sparkwise-1.4.1.tar.gz
Algorithm Hash digest
SHA256 1b88968ee938ebb9f5f03ac70957d72b7ae61cdc0628d2176d609eb02d1160cf
MD5 2cb213b5b78681391283489694b12ad9
BLAKE2b-256 d1724de92edb697f4eb78ea4ee47c9bd9b31942eaf4e3959c1be3fdb67e38628

See more details on using hashes here.

File details

Details for the file sparkwise-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: sparkwise-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 108.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for sparkwise-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d1f8479cb8788462ec26361b793472f7be5f1510ffbe39265194e38e39e119dd
MD5 9c15989ae75c8fffc007feb2aefc7cba
BLAKE2b-256 c3daf1fcf9387facb2de74167a693353d27a389dd78ce8c0168c59c0058759a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page