A modern, intuitive Python package for data lakehouse operations
Project description
SuperLake: Unified Data Lakehouse Management for Apache Spark & Delta Lake
SuperLake is a powerful Python framework for building, managing, and monitoring modern data lakehouse architectures on Apache Spark and Delta Lake. Designed for data engineers and analytics teams, SuperLake streamlines ETL pipeline orchestration, Delta table management, and operational monitoring—all in one extensible package.
Main SuperLake Classes
-
SuperSpark: Unified SparkSession manager for Delta Lake. Handles Spark initialization, Delta Lake integration, warehouse/external paths, and catalog configuration (classic Spark, Databricks, Unity Catalog). Ensures consistent Spark setup for all pipelines and environments.
-
SuperDeltaTable: Advanced Delta table abstraction. Supports managed/external tables, schema evolution (Merge, Overwrite, Keep), SCD2 (Slowly Changing Dimension) logic, partitioning, z-order, compression, and table properties. Provides robust methods for create, read, write, merge, SCD2 merge, delete, drop, optimize, vacuum, and schema alignment. Works seamlessly across Spark, Databricks, and Unity Catalog.
-
SuperPipeline: Orchestrates end-to-end ETL pipelines (bronze → silver). Manages idempotent ingestion, CDC (Change Data Capture), transformation, and deletion logic. Integrates with SuperTracer for run tracking and supports force_cdc, force_caching, and robust error handling. Designed for medallion architecture and production-grade reliability.
-
SuperSimplePipeline / SuperGoldPipeline: Simplified pipeline for gold-layer aggregations or single-table jobs. Runs a function (e.g., aggregation, modeling) and saves results to a Delta table, with full logging, tracing, and error handling.
-
SuperDataframe: Utility class for DataFrame cleaning, transformation, and schema management. Features include column name/value cleaning, type casting, dropping/renaming columns, null handling, deduplication, distributed pivot, surrogate key generation, and schema-aligned union across DataFrames.
-
SuperLogger: Unified logging and metrics for all pipeline operations. Supports contextual logging, metrics collection, and optional Azure Application Insights integration. Enables info, warning, error, and metric logging with sub-pipeline context.
-
SuperTracer: Pipeline run trace manager. Persists run metadata (e.g., bronze/silver/gold updates, skips, deletions) in a Delta table for full auditability and idempotency. Enables robust recovery and monitoring of pipeline execution state.
-
SuperOrchestrator: (For advanced users) Dependency-aware pipeline orchestrator. Discovers, groups, and executes pipelines based on dependency graphs. Supports parallelization, cycle detection, partial graph execution, and robust error handling for complex lakehouse projects.
-
MetricsCollector: (Monitoring) Collects and aggregates table, data quality, performance, and storage metrics. Supports custom metric definitions and saving metrics to Delta tables for monitoring and alerting.
-
AlertManager: (Monitoring) Flexible alerting engine. Supports custom alert rules, severity levels, and handlers (email, Slack, Teams, etc.) for real-time notifications based on metrics or pipeline events.
Features
-
Delta Table Management
- Managed and external Delta tables (classic Spark, Databricks, Unity Catalog)
- Schema evolution: Merge, Overwrite, Keep (add/drop/modify columns)
- SCD2 (Slowly Changing Dimension) support with automatic history tracking
- Partitioning, z-order, compression, and generated columns
- Table properties, descriptions, and catalog registration
- Optimize and vacuum operations for performance and storage
-
ETL Pipeline Orchestration
- Medallion architecture: bronze (raw), silver (cleaned), gold (aggregated)
- Idempotent, traceable pipeline execution (SuperTracer)
- Change Data Capture (CDC) and deletion logic
- Force CDC and force caching for robust reruns and testing
- Custom transformation and deletion functions
- Full support for test, dev, and production environments
-
DataFrame Utilities
- Column name/value cleaning and normalization
- Type casting and schema alignment
- Drop, rename, and deduplicate columns/rows
- Null value handling and replacement
- Distributed pivot and schema-aligned union (type promotion)
- Surrogate key generation (SHA-256 hash of fields)
-
Monitoring & Logging
- Unified logging (SuperLogger) with contextual sub-pipeline names
- Metrics collection (row counts, durations, custom metrics)
- Optional Azure Application Insights integration for enterprise observability
- Pipeline run tracing (SuperTracer) for full auditability
-
Alerting & Notifications
- Custom alert rules and severity levels (info, warning, error, critical)
- Handlers for email, Slack, Teams, and custom integrations
- Real-time notifications based on metrics or pipeline events
-
Orchestration (Advanced)
- Dependency graph analysis and cycle detection
- Group-based orchestration (roots-to-leaves or leaves-to-roots)
- Parallel or serial execution of pipeline groups
- Thread-safe status tracking and contextual logging
- Partial graph execution and cascading skips on failure
-
Metrics & Data Quality
- Table, data quality, performance, and storage metrics
- Null counts, distinct counts, basic statistics, and version history
- Save metrics to Delta tables for monitoring and alerting
-
Extensibility & Modularity
- Modular design: use only what you need (core, monitoring, orchestration)
- Easy to add new data sources, models, and custom pipeline logic
- Open source, MIT-licensed, and community-driven
Why SuperLake?
- Accelerate Data Engineering: Focus on business logic, not boilerplate.
- Production-Ready: Built-in monitoring, error handling, and alerting for reliable data operations.
- Extensible & Modular: Use only what you need—core data management, monitoring, or both.
- Open Source: MIT-licensed and community-driven.
Installation
pip install superlake
Quick Start
Best way to get started:
Check out the superlake-lakehouse repository for a full example project and ready-to-use templates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file superlake-1.0.0.tar.gz.
File metadata
- Download URL: superlake-1.0.0.tar.gz
- Upload date:
- Size: 46.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b4511768a4934d626a14edbea4984476cc1a9199e493b2edd2af3952173b279
|
|
| MD5 |
c20be75efd00a1b04869cbe0393e6503
|
|
| BLAKE2b-256 |
0796651173320c3777c9fbc186a6961317ab0037db6f391772418954f2f2320e
|
File details
Details for the file superlake-1.0.0-py3-none-any.whl.
File metadata
- Download URL: superlake-1.0.0-py3-none-any.whl
- Upload date:
- Size: 38.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8df9a60f7520df5f0ca2ab9edaa720a798124133846f62528cd16a718884fd52
|
|
| MD5 |
bff24dd07185079358f9971dc93bb238
|
|
| BLAKE2b-256 |
ad5f097b46111f4f615dfea82fbb393c62d53eb75e535e28227abce112c010fc
|