Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Trust Your Data. Scale Your Logic.

Write Once. Run Anywhere. — The open-source runtime for data contracts with quarantine.

LakeLogic is a SQL-first, infrastructure-agnostic quality gate that ensures your business decisions are based on data you can trust. It scales your validation logic from local Polars to petabyte-scale Spark without rewriting a single rule.

Documentation GitHub License Python


The Core Value: Write Once. Run Anywhere

Stop paying the "Infrastructure Lock-In Tax." In a traditional stack, moving from a Warehouse (Snowflake) to a Lakehouse (Databricks) means months of rewriting validation rules. LakeLogic decouples your Business Logic from your Execution Engine.

  1. Cost Efficiency (The Spark Tax ROI): Run 80% of your maintenance checks on Polars or DuckDB for pennies, while reserving Spark for your massive production scales.
  2. Risk Mitigation (100% Reconciliation): Ensure Source = Good + Quarantined. Mathematically prove that no record was lost or double-counted across your layers.
  3. Stakeholder Trust (Visual Traceability): Use aggregate roll-ups to give your business users a visual drill-down from board-level KPIs back to raw source records.

Key Features

  • SQL-First Logic: Use the SQL expressions you already know for transformations and quality rules.
  • Schema Enforcement: Type casting, required fields, and unknown-field handling.
  • Intelligent Quarantine: Records that fail rules are detoured, tagged with error messages, and saved for correction.
  • Lineage Injection: Tag records with source path, run ID, and processing timestamp.
  • Materialization: Write validated data to local CSV/Parquet targets or Delta/Iceberg when running on Spark.
  • Referential Integrity: Validate keys against dimensions using local reference tables.
  • Notifications (Demo): Built-in adapters log alerts for quarantine and rule failures.
  • External Logic Hooks: Run dedicated Python modules or notebooks for advanced Gold processing.
  • 🆕 Delta Lake Support (Spark-Free): Read/write/merge Delta tables with Polars, DuckDB, or Pandas—no Spark required!
  • 🆕 Catalog Table Names: Use Unity Catalog, Fabric LakeDB, and Synapse table names (catalog.schema.table) directly.

Installation

# Get the full engine suite
uv pip install "lakelogic[all]"

# Or just use Polars for local speed
uv pip install "lakelogic[polars]"

# Delta Lake support (Spark-free)
uv pip install "lakelogic[delta]"

# Profiling + PII detection (bootstrap)
uv pip install "lakelogic[profiling]"

See the full installation guide in docs/installation.md.

Quick Start

# 1. Run the Quality Gate (Automatic Engine Selection)
processor = DataProcessor(contract="silver_crm_customers.yaml")
source_df, good_df, bad_df = processor.run_source("bronze_crm_customers.csv")

# good_df -> Ready for Silver Layer
# bad_df  -> Sent to Quarantine

🆕 Delta Lake & Catalog Support (Spark-Free!)

Unity Catalog (Databricks)

from lakelogic import DataProcessor

# Use Unity Catalog table names directly (no Spark required!)
processor = DataProcessor(engine="polars", contract="contracts/customers.yaml")
source_df, good_df, bad_df = processor.run_source("main.default.customers")

# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules

print(f"Total: {len(source_df)} | Valid: {len(good_df)} | Invalid: {len(bad_df)}")

Fabric LakeDB (Microsoft)

# Use Fabric table names directly
processor = DataProcessor(engine="polars", contract="contracts/sales.yaml")
source_df, good_df, bad_df = processor.run_source("myworkspace.sales_lakehouse.customers")

print(f"Total: {len(source_df)} | Valid: {len(good_df)} | Invalid: {len(bad_df)}")

Synapse Analytics (Azure)

# Use Synapse table names directly
processor = DataProcessor(engine="polars", contract="contracts/sales.yaml")
source_df, good_df, bad_df = processor.run_source("salesdb.dbo.customers")

print(f"Total: {len(source_df)} | Valid: {len(good_df)} | Invalid: {len(bad_df)}")

Learn more: Delta Lake Support | Catalog Table Names

Get Started

📚 Read the Docs | 🚀 Quickstart Guide | 💬 Discussions

Run Your First Contract (5 Minutes)

# Clone the repo
git clone https://github.com/LineageLogic/LakeLogic.git
cd LakeLogic/examples/01_getting_started/basic_validation

# Run the example
lakelogic run --contract contract.yaml --source data/sample_customers.csv

You'll see:

  • ✅ Good records that passed validation
  • ❌ Quarantined records with error reasons
  • 📊 Quality metrics and health scores

Explore 90+ Examples

The examples/ directory contains runnable examples organized by skill level:

  • Getting Started - Your first contract in 5 minutes
  • Tutorials - Medallion architecture, reference joins, notifications
  • Patterns - Bronze quality gates, SCD2, deduplication, late-arriving data
  • Production - Complete insurance ELT pipeline with multi-entity contracts
  • Integrations - Airflow, Prefect, Dagster, Databricks job templates

Documentation

Contributing

See docs/installation.md#developer-installation to get started.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-0.1.0b2.tar.gz (639.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-0.1.0b2-py3-none-any.whl (147.0 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-0.1.0b2.tar.gz.

File metadata

  • Download URL: lakelogic-0.1.0b2.tar.gz
  • Upload date:
  • Size: 639.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.1.0b2.tar.gz
Algorithm Hash digest
SHA256 69009e1cfc4684f9d7a6d13f76ad77d451ffa21dd1f3752e4c3187a5dccc267f
MD5 37dc57884b4101b1d7c41ef4286699cb
BLAKE2b-256 211ccc8674f61d6bff118362e610e0fa030afde6e2e432cb73cfd4b0138c6a47

See more details on using hashes here.

File details

Details for the file lakelogic-0.1.0b2-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-0.1.0b2-py3-none-any.whl
  • Upload date:
  • Size: 147.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.1.0b2-py3-none-any.whl
Algorithm Hash digest
SHA256 c6b36e46eaf361217dd8da6c441607a16665b1cfd771286ec3cbef7ba4b3d78c
MD5 e2b12c906883ebab2b3545d7739edfd6
BLAKE2b-256 1a7373025ef7cb1f098cbb479a1f1b422cc8fccc5dacb0c68f8f09b0636f6eea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page