A modern, intuitive Python package for data lakehouse operations

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

SuperLake: Unified Data Lakehouse Management for Apache Spark & Delta Lake

SuperLake is a powerful Python framework for building, managing, and monitoring modern data lakehouse architectures on Apache Spark and Delta Lake. Designed for data engineers and analytics teams, SuperLake streamlines ETL pipeline orchestration, Delta table management, and operational monitoring—all in one extensible package.

Main SuperLake Classes

SuperSpark: Unified SparkSession manager for Delta Lake. Handles Spark initialization, Delta Lake integration, warehouse/external paths, and catalog configuration (classic Spark, Databricks, Unity Catalog). Ensures consistent Spark setup for all pipelines and environments.
SuperDeltaTable: Advanced Delta table abstraction. Supports managed/external tables, schema evolution (Merge, Overwrite, Keep), SCD2 (Slowly Changing Dimension) logic, partitioning, z-order, compression, and table properties. Provides robust methods for create, read, write, merge, SCD2 merge, delete, drop, optimize, vacuum, and schema alignment. Works seamlessly across Spark, Databricks, and Unity Catalog.
SuperPipeline: Orchestrates end-to-end ETL pipelines (bronze → silver). Manages idempotent ingestion, CDC (Change Data Capture), transformation, and deletion logic. Integrates with SuperTracer for run tracking and supports force_cdc, force_caching, and robust error handling. Designed for medallion architecture and production-grade reliability.
SuperSimplePipeline / SuperGoldPipeline: Simplified pipeline for gold-layer aggregations or single-table jobs. Runs a function (e.g., aggregation, modeling) and saves results to a Delta table, with full logging, tracing, and error handling.
SuperDataframe: Utility class for DataFrame cleaning, transformation, and schema management. Features include column name/value cleaning, type casting, dropping/renaming columns, null handling, deduplication, distributed pivot, surrogate key generation, and schema-aligned union across DataFrames.
SuperLogger: Unified logging and metrics for all pipeline operations. Supports contextual logging, metrics collection, and optional Azure Application Insights integration. Enables info, warning, error, and metric logging with sub-pipeline context.
SuperTracer: Pipeline run trace manager. Persists run metadata (e.g., bronze/silver/gold updates, skips, deletions) in a Delta table for full auditability and idempotency. Enables robust recovery and monitoring of pipeline execution state.
SuperOrchestrator: (For advanced users) Dependency-aware pipeline orchestrator. Discovers, groups, and executes pipelines based on dependency graphs. Supports parallelization, cycle detection, partial graph execution, and robust error handling for complex lakehouse projects.
MetricsCollector: (Monitoring) Collects and aggregates table, data quality, performance, and storage metrics. Supports custom metric definitions and saving metrics to Delta tables for monitoring and alerting.
AlertManager: (Monitoring) Flexible alerting engine. Supports custom alert rules, severity levels, and handlers (email, Slack, Teams, etc.) for real-time notifications based on metrics or pipeline events.

Features

Delta Table Management
- Managed and external Delta tables (classic Spark, Databricks, Unity Catalog)
- Schema evolution: Merge, Overwrite, Keep (add/drop/modify columns)
- SCD2 (Slowly Changing Dimension) support with automatic history tracking
- Partitioning, z-order, compression, and generated columns
- Table properties, descriptions, and catalog registration
- Optimize and vacuum operations for performance and storage
ETL Pipeline Orchestration
- Medallion architecture: bronze (raw), silver (cleaned), gold (aggregated)
- Idempotent, traceable pipeline execution (SuperTracer)
- Change Data Capture (CDC) and deletion logic
- Force CDC and force caching for robust reruns and testing
- Custom transformation and deletion functions
- Full support for test, dev, and production environments
DataFrame Utilities
- Column name/value cleaning and normalization
- Type casting and schema alignment
- Drop, rename, and deduplicate columns/rows
- Null value handling and replacement
- Distributed pivot and schema-aligned union (type promotion)
- Surrogate key generation (SHA-256 hash of fields)
Monitoring & Logging
- Unified logging (SuperLogger) with contextual sub-pipeline names
- Metrics collection (row counts, durations, custom metrics)
- Optional Azure Application Insights integration for enterprise observability
- Pipeline run tracing (SuperTracer) for full auditability
Alerting & Notifications
- Custom alert rules and severity levels (info, warning, error, critical)
- Handlers for email, Slack, Teams, and custom integrations
- Real-time notifications based on metrics or pipeline events
Orchestration (Advanced)
- Dependency graph analysis and cycle detection
- Group-based orchestration (roots-to-leaves or leaves-to-roots)
- Parallel or serial execution of pipeline groups
- Thread-safe status tracking and contextual logging
- Partial graph execution and cascading skips on failure
Metrics & Data Quality
- Table, data quality, performance, and storage metrics
- Null counts, distinct counts, basic statistics, and version history
- Save metrics to Delta tables for monitoring and alerting
Extensibility & Modularity
- Modular design: use only what you need (core, monitoring, orchestration)
- Easy to add new data sources, models, and custom pipeline logic
- Open source, MIT-licensed, and community-driven

Why SuperLake?

Accelerate Data Engineering: Focus on business logic, not boilerplate.
Production-Ready: Built-in monitoring, error handling, and alerting for reliable data operations.
Extensible & Modular: Use only what you need—core data management, monitoring, or both.
Open Source: MIT-licensed and community-driven.

Installation

pip install superlake

Quick Start

Best way to get started:
Check out the superlake-lakehouse repository for a full example project and ready-to-use templates.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

1.0.3

Jun 14, 2025

1.0.2

Jun 7, 2025

1.0.1

Jun 2, 2025

This version

1.0.0

Jun 1, 2025

0.1.4

May 7, 2025

0.1.3

May 6, 2025

0.1.2

May 6, 2025

0.1.1

May 6, 2025

0.1.0

May 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superlake-1.0.0.tar.gz (46.4 kB view details)

Uploaded Jun 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

superlake-1.0.0-py3-none-any.whl (38.4 kB view details)

Uploaded Jun 1, 2025 Python 3

File details

Details for the file superlake-1.0.0.tar.gz.

File metadata

Download URL: superlake-1.0.0.tar.gz
Upload date: Jun 1, 2025
Size: 46.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for superlake-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`4b4511768a4934d626a14edbea4984476cc1a9199e493b2edd2af3952173b279`
MD5	`c20be75efd00a1b04869cbe0393e6503`
BLAKE2b-256	`0796651173320c3777c9fbc186a6961317ab0037db6f391772418954f2f2320e`

See more details on using hashes here.

File details

Details for the file superlake-1.0.0-py3-none-any.whl.

File metadata

Download URL: superlake-1.0.0-py3-none-any.whl
Upload date: Jun 1, 2025
Size: 38.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for superlake-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8df9a60f7520df5f0ca2ab9edaa720a798124133846f62528cd16a718884fd52`
MD5	`bff24dd07185079358f9971dc93bb238`
BLAKE2b-256	`ad5f097b46111f4f615dfea82fbb393c62d53eb75e535e28227abce112c010fc`

See more details on using hashes here.

superlake 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SuperLake: Unified Data Lakehouse Management for Apache Spark & Delta Lake

Features

Why SuperLake?

Installation

Quick Start

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes