Skip to main content

A modern, intuitive Python package for data lakehouse operations

Project description

SuperLake: Unified Data Lakehouse Management for Apache Spark & Delta Lake

SuperLake is a powerful Python framework for building, managing, and monitoring modern data lakehouse architectures on Apache Spark and Delta Lake. Designed for data engineers and analytics teams, SuperLake streamlines ETL pipeline orchestration, Delta table management, and operational monitoring—all in one extensible package.

Main SuperLake Classes

  • SuperSpark: Unified SparkSession manager for Delta Lake. Handles Spark initialization, Delta Lake integration, warehouse/external paths, and catalog configuration (classic Spark, Databricks, Unity Catalog). Ensures consistent Spark setup for all pipelines and environments.

  • SuperDeltaTable: Advanced Delta table abstraction. Supports managed/external tables, schema evolution (Merge, Overwrite, Keep), SCD2 (Slowly Changing Dimension) logic, partitioning, z-order, compression, and table properties. Provides robust methods for create, read, write, merge, SCD2 merge, delete, drop, optimize, vacuum, and schema alignment. Works seamlessly across Spark, Databricks, and Unity Catalog.

  • SuperPipeline: Orchestrates end-to-end ETL pipelines (bronze → silver). Manages idempotent ingestion, CDC (Change Data Capture), transformation, and deletion logic. Integrates with SuperTracer for run tracking and supports force_cdc, force_caching, and robust error handling. Designed for medallion architecture and production-grade reliability.

  • SuperSimplePipeline / SuperGoldPipeline: Simplified pipeline for gold-layer aggregations or single-table jobs. Runs a function (e.g., aggregation, modeling) and saves results to a Delta table, with full logging, tracing, and error handling.

  • SuperDataframe: Utility class for DataFrame cleaning, transformation, and schema management. Features include column name/value cleaning, type casting, dropping/renaming columns, null handling, deduplication, distributed pivot, surrogate key generation, and schema-aligned union across DataFrames.

  • SuperLogger: Unified logging and metrics for all pipeline operations. Supports contextual logging, metrics collection, and optional Azure Application Insights integration. Enables info, warning, error, and metric logging with sub-pipeline context.

  • SuperTracer: Pipeline run trace manager. Persists run metadata (e.g., bronze/silver/gold updates, skips, deletions) in a Delta table for full auditability and idempotency. Enables robust recovery and monitoring of pipeline execution state.

  • SuperOrchestrator: (For advanced users) Dependency-aware pipeline orchestrator. Discovers, groups, and executes pipelines based on dependency graphs. Supports parallelization, cycle detection, partial graph execution, and robust error handling for complex lakehouse projects.

  • MetricsCollector: (Monitoring) Collects and aggregates table, data quality, performance, and storage metrics. Supports custom metric definitions and saving metrics to Delta tables for monitoring and alerting.

  • AlertManager: (Monitoring) Flexible alerting engine. Supports custom alert rules, severity levels, and handlers (email, Slack, Teams, etc.) for real-time notifications based on metrics or pipeline events.

Features

  • Delta Table Management

    • Managed and external Delta tables (classic Spark, Databricks, Unity Catalog)
    • Schema evolution: Merge, Overwrite, Keep (add/drop/modify columns)
    • SCD2 (Slowly Changing Dimension) support with automatic history tracking
    • Partitioning, z-order, compression, and generated columns
    • Table properties, descriptions, and catalog registration
    • Optimize and vacuum operations for performance and storage
  • ETL Pipeline Orchestration

    • Medallion architecture: bronze (raw), silver (cleaned), gold (aggregated)
    • Idempotent, traceable pipeline execution (SuperTracer)
    • Change Data Capture (CDC) and deletion logic
    • Force CDC and force caching for robust reruns and testing
    • Custom transformation and deletion functions
    • Full support for test, dev, and production environments
  • DataFrame Utilities

    • Column name/value cleaning and normalization
    • Type casting and schema alignment
    • Drop, rename, and deduplicate columns/rows
    • Null value handling and replacement
    • Distributed pivot and schema-aligned union (type promotion)
    • Surrogate key generation (SHA-256 hash of fields)
  • Monitoring & Logging

    • Unified logging (SuperLogger) with contextual sub-pipeline names
    • Metrics collection (row counts, durations, custom metrics)
    • Optional Azure Application Insights integration for enterprise observability
    • Pipeline run tracing (SuperTracer) for full auditability
  • Alerting & Notifications

    • Custom alert rules and severity levels (info, warning, error, critical)
    • Handlers for email, Slack, Teams, and custom integrations
    • Real-time notifications based on metrics or pipeline events
  • Orchestration (Advanced)

    • Dependency graph analysis and cycle detection
    • Group-based orchestration (roots-to-leaves or leaves-to-roots)
    • Parallel or serial execution of pipeline groups
    • Thread-safe status tracking and contextual logging
    • Partial graph execution and cascading skips on failure
  • Metrics & Data Quality

    • Table, data quality, performance, and storage metrics
    • Null counts, distinct counts, basic statistics, and version history
    • Save metrics to Delta tables for monitoring and alerting
  • Extensibility & Modularity

    • Modular design: use only what you need (core, monitoring, orchestration)
    • Easy to add new data sources, models, and custom pipeline logic
    • Open source, MIT-licensed, and community-driven

Why SuperLake?

  • Accelerate Data Engineering: Focus on business logic, not boilerplate.
  • Production-Ready: Built-in monitoring, error handling, and alerting for reliable data operations.
  • Extensible & Modular: Use only what you need—core data management, monitoring, or both.
  • Open Source: MIT-licensed and community-driven.

Installation

pip install superlake

Quick Start

Best way to get started:
Check out the superlake-lakehouse repository for a full example project and ready-to-use templates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superlake-1.0.0.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

superlake-1.0.0-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file superlake-1.0.0.tar.gz.

File metadata

  • Download URL: superlake-1.0.0.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for superlake-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4b4511768a4934d626a14edbea4984476cc1a9199e493b2edd2af3952173b279
MD5 c20be75efd00a1b04869cbe0393e6503
BLAKE2b-256 0796651173320c3777c9fbc186a6961317ab0037db6f391772418954f2f2320e

See more details on using hashes here.

File details

Details for the file superlake-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: superlake-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 38.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for superlake-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8df9a60f7520df5f0ca2ab9edaa720a798124133846f62528cd16a718884fd52
MD5 bff24dd07185079358f9971dc93bb238
BLAKE2b-256 ad5f097b46111f4f615dfea82fbb393c62d53eb75e535e28227abce112c010fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page