A Python-based data contract runtime for consistent quality across engines.
Project description
LakeLogic
Trust Your Data. Scale Your Logic.
Write Once. Run Anywhere. — The open-source runtime for data contracts with quarantine.
LakeLogic is a SQL-first, infrastructure-agnostic quality gate that ensures your business decisions are based on data you can trust. It scales your validation logic from local Polars to petabyte-scale Spark without rewriting a single rule.
The Core Value: Write Once. Run Anywhere
Stop paying the "Infrastructure Lock-In Tax." In a traditional stack, moving from a Warehouse (Snowflake) to a Lakehouse (Databricks) means months of rewriting validation rules. LakeLogic decouples your Business Logic from your Execution Engine.
- Cost Efficiency (The Spark Tax ROI): Run 80% of your maintenance checks on Polars or DuckDB for pennies, while reserving Spark for your massive production scales.
- Risk Mitigation (100% Reconciliation): Ensure
Source = Good + Quarantined. Mathematically prove that no record was lost or double-counted across your layers. - Stakeholder Trust (Visual Traceability): Use aggregate roll-ups to give your business users a visual drill-down from board-level KPIs back to raw source records.
Key Features
- SQL-First Logic: Use the SQL expressions you already know for transformations and quality rules.
- Schema Enforcement: Type casting, required fields, and unknown-field handling.
- Intelligent Quarantine: Records that fail rules are detoured, tagged with error messages, and saved for correction.
- Lineage Injection: Tag records with source path, run ID, and processing timestamp.
- Materialization: Write validated data to local CSV/Parquet targets or Delta/Iceberg when running on Spark.
- Referential Integrity: Validate keys against dimensions using local reference tables.
- Contract Inference: Auto-generate contracts from landing-zone files with
lakelogic bootstrap. - dbt Import: Convert dbt
schema.yml/sources.ymlinto LakeLogic contracts withlakelogic import-dbt. - Synthetic Data Generation: Generate realistic test data from any contract with
DataGenerator. - External Logic Hooks: Run dedicated Python modules or notebooks for advanced Gold processing.
- Policy Packs: Apply standardised rule sets and defaults across all contracts.
- Notifications: Built-in adapters log alerts for quarantine and rule failures.
- Observability: Prometheus metrics endpoint, summary tables, and execution tracing.
- Delta Lake Support (Spark-Free): Read/write/merge Delta tables with Polars, DuckDB, or Pandas — no Spark required.
- Catalog Table Names: Use Unity Catalog, Fabric LakeDB, and Synapse table names (
catalog.schema.table) directly. - Streaming Ingestion: Kafka, WebSocket, SSE, Azure Service Bus, GCP Pub/Sub, AWS SQS.
- Database CDC: Azure SQL, PostgreSQL, MySQL, MongoDB, Oracle, SQL Server change capture.
Installation
# Get the full engine suite
uv pip install "lakelogic[all]"
# Or just use Polars for local speed
uv pip install "lakelogic[polars]"
# Delta Lake support (Spark-free)
uv pip install "lakelogic[delta]"
# Profiling + PII detection (bootstrap)
uv pip install "lakelogic[profiling]"
# Database CDC connectors
uv pip install "lakelogic[databases]"
# Streaming sources
uv pip install "lakelogic[streaming]"
See the full installation guide in docs/installation.md.
Quick Start
from lakelogic import DataProcessor
# 1. Run the Quality Gate (Automatic Engine Selection)
processor = DataProcessor(contract="silver_crm_customers.yaml")
good_df, bad_df = processor.run_source()
# good_df -> Ready for Silver Layer
# bad_df -> Sent to Quarantine
run_source() automatically reads the source path from your contract. You can also pass an explicit path:
good_df, bad_df = processor.run_source("bronze_crm_customers.csv")
The return value is a ValidationResult that unpacks as two DataFrames. Access the raw (pre-validation) frame via result.raw:
result = processor.run_source()
print(f"Total: {len(result.raw)} | Valid: {len(result.good)} | Quarantined: {len(result.bad)}")
Delta Lake & Catalog Support (Spark-Free!)
Unity Catalog (Databricks)
from lakelogic import DataProcessor
# Use Unity Catalog table names directly (no Spark required!)
processor = DataProcessor(engine="polars", contract="contracts/customers.yaml")
good_df, bad_df = processor.run_source("main.default.customers")
# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules
Fabric LakeDB (Microsoft)
processor = DataProcessor(engine="polars", contract="contracts/sales.yaml")
good_df, bad_df = processor.run_source("myworkspace.sales_lakehouse.customers")
Synapse Analytics (Azure)
processor = DataProcessor(engine="polars", contract="contracts/sales.yaml")
good_df, bad_df = processor.run_source("salesdb.dbo.customers")
Learn more: Delta Lake Support | Catalog Table Names
dbt Integration
Import existing dbt projects directly — no rewrite needed:
# Convert a dbt model to a LakeLogic contract
lakelogic import-dbt --schema models/schema.yml --model customers --output contracts/
# Or use the Python API
from lakelogic import DataProcessor
proc = DataProcessor.from_dbt("models/schema.yml", model="customers")
good_df, bad_df = proc.run_source()
Get Started
📚 Read the Docs | 🚀 Quickstart Guide | 💬 Discussions
Run Your First Contract (5 Minutes)
# Clone the repo
git clone https://github.com/LineageLogic/LakeLogic.git
cd LakeLogic/examples/01_quickstart
# Run the example
lakelogic run --contract users_contract.yaml --source data/sample_customers.csv
You'll see:
- ✅ Good records that passed validation
- ❌ Quarantined records with error reasons
- 📊 Quality metrics and health scores
Explore the Examples
The examples/ directory contains 24 runnable notebooks across 4 tested categories:
| Category | Directory | What You'll Learn |
|---|---|---|
| Quickstart | 01_quickstart/ |
Your first contract in 5 minutes, database governance, dbt+PII |
| Core Patterns | 02_core_patterns/ |
Medallion architecture, bronze quality gates, SCD2, deduplication, reference joins, soft deletes |
| Advanced Workflows | 03_advanced_workflows/ |
Insurance ELT pipeline, GDPR compliance, late-arriving data, external Python logic, environment promotion, bootstrap, date dimensions, multi-tenant isolation, partitioned merge, payments lifecycle, streaming, synthetic data generation |
| Compliance | 04_compliance_governance/ |
HIPAA PII masking |
Looking for more? Additional examples for data sources, cloud platforms, orchestration, and production patterns are in
examples/_archive/. These are functional but not yet fully tested.
Documentation
- Full Documentation — Complete guides and API reference
- How It Works — Medallion architecture and core concepts
- CLI Reference — Command-line usage
- API Reference — Python API documentation
- Reprocessing Guide — Handle late-arriving data
- Contract Template — Full YAML reference for all contract fields
- Streaming — Real-time ingestion guide
Contributing
See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lakelogic-0.2.0b0.tar.gz.
File metadata
- Download URL: lakelogic-0.2.0b0.tar.gz
- Upload date:
- Size: 796.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53832e0eb91621e6a7f28a5d9004a8153be8d970a31df946b559c668baa06f7d
|
|
| MD5 |
8223254d4b40771f1dc1bb3d945c2c22
|
|
| BLAKE2b-256 |
4c4b01f86ad85e2bb0bb2cd7512899e843a1c76e4fdd504c651be3fa83e57562
|
File details
Details for the file lakelogic-0.2.0b0-py3-none-any.whl.
File metadata
- Download URL: lakelogic-0.2.0b0-py3-none-any.whl
- Upload date:
- Size: 268.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7afee103bfc73c8e73935548631a4ddbfea0dcb5ee2c01730c8fb84e0533910a
|
|
| MD5 |
3d580e1172823b37f0e41e8b456993de
|
|
| BLAKE2b-256 |
b51fd55ecd57ea55ec933bc9d3060cf0e4ba94155745412cbeac5f420f3a71e5
|