A Python-based data contract runtime for consistent quality across engines.
Project description
LakeLogic
Your Data Estate. Under Contract.
A declarative, contract-driven medallion pipeline engine for data mesh architectures.
Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.
Write once. Run on Spark, Polars, or DuckDB. The vendor-neutral alternative to Databricks Lakeflow Pipelines.
Data Mesh Alignment
LakeLogic is the missing runtime layer for Data Mesh — where domain ownership and federated governance stop being principles and start being enforced.
| Pillar | How LakeLogic Delivers |
|---|---|
| Domain Ownership | Contracts are owned and defined by domain teams (e.g., CRM, Finance) who know the data best. |
| Data as a Product | The contract IS the product interface — a versioned, schema-enforced, SLA-backed guarantee that consuming teams can depend on. |
| Self-Serve Platform | A standardized runtime that any team can use to deploy quality gates without infra silos. |
| Federated Governance | PII masking rules, SLA thresholds, and schema standards defined once in a central registry — automatically enforced at every domain pipeline. |
Quick Start
pip install lakelogic
from lakelogic import DataProcessor
result = DataProcessor("contract.yaml").run_source()
print(f"Valid: {result.good_count} | Quarantined: {result.bad_count}")
Technical Capabilities
Data Quality & Trust
- 100% Reconciliation — Mathematically guaranteed:
source = good + bad. Every row is accounted for — nothing silently dropped - Pydantic-Powered Validation — Every contract, system & domain configs are parsed through strict Pydantic models with
Literaltype enforcement — invalid YAML is caught at load time, not at runtime - SQL-First Rules — Define business logic in the language your team already speaks — no SDK, no custom DSL
- SLO Monitoring & Anomaly Detection — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach
Compliance & Governance
- GDPR & HIPAA Compliance — Contract-driven
forget_subjects()with nullify, hash, or redact strategies and immutable audit trail - Automatic Lineage — Every row stamped with Run IDs and source paths — traceable from landing zone to Gold layer
- Pipeline Cost Intelligence — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration
Engine & Scale
- Engine Agnostic — Write once, run on Spark, Polars, or DuckDB — same contract, zero code changes
- Dimensional Modeling — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual
MERGE INTOSQL required - Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
- Parallel Processing — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
- Backfill & Reprocessing — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
- External Logic — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
- Production Resilience — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (
max_consecutive_failures) — pipelines self-heal transient failures without operator intervention
Developer Experience
- Structured Diagnostics & Observability — Deep contextual logging out-of-the-box (powered by
loguru) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time - Dry Run Mode — Validate contracts, resolve dependencies, and preview execution plans without touching any data
- DDL-Only Mode — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
- DAG Dependency Viewer — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
- Data Reset & Reload — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
- Multi-Channel Alerts — Powered by Apprise for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full Jinja2 templating support for custom formatting
Data Generation & AI
- Synthetic Data — Built-in
DataGenerator(powered by Faker) with streaming simulation, time-windowed output, referential integrity, and AI-powered edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation - AI Contract Onboarding —
lakelogic inferauto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions - Unstructured Processing — LLM extraction from PDFs, images, audio with same contract validation + lineage
- Automated Run Logs — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table
Integrations
- dbt Adapter — Import dbt
schema.ymlmodels and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting - dlt (Data Load Tool) — Native
DltAdaptersupporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival
What a Contract Looks Like
One YAML file replaces hundreds of lines of validation code:
version: "1.0"
info:
title: "Silver Customers"
domain: "CRM"
system: "Salesforce"
model:
fields:
- name: customer_id
type: integer
required: true
- name: email
type: string
pii: true
masking: "hash"
- name: status
type: string
transformations:
- deduplicate: [customer_id]
- sql: "SELECT *, UPPER(status) AS status_norm FROM source"
phase: pre
quality:
row_rules:
- sql: "email LIKE '%@%.%'"
- sql: "status IN ('active', 'churned', 'pending')"
dataset_rules:
- unique: customer_id
materialization:
strategy: merge
merge_keys: [customer_id]
format: delta
Same contract, any engine — swap engine="polars" for "spark" or "duckdb". Zero code changes.
Analogy: A contract is like a building inspection checklist. The inspector (LakeLogic) checks every room (row) against the blueprint (schema), flags violations (quarantine), and stamps a certificate (lineage) — regardless of whether the building was constructed with bricks (Spark), timber (Polars), or prefab (DuckDB).
What this buys you
| Without LakeLogic | With LakeLogic |
|---|---|
| 500+ lines of PySpark/Pandas validation per table | 40 lines of YAML |
| Bad rows silently dropped or crash the pipeline | Bad rows quarantined with error reasons |
| Schema drift discovered in production dashboards | Schema drift caught at ingestion |
| Manual dedup scripts per team | deduplicate: [key] — one line |
| PII scattered across notebooks | pii: true, masking: hash — automatic |
| No audit trail | Every row stamped with run ID, source path, timestamp |
[!TIP] View the Complete Contract Reference for every available configuration option.
Architecture
LakeLogic enforces Data Contracts as quality gates across the Medallion Architecture (Bronze → Silver → Gold).
Each layer uses its own contract:
| Layer | Role | Guarantee |
|---|---|---|
| Bronze | Capture everything raw, no validation | Immutable record of source |
| Silver | Full validation, business rules, dedup | Trusted, queryable data |
| Gold | Aggregations, KPIs, ML features | Analytics-ready datasets |
| Quarantine | Failed rows isolated with error reasons | Nothing silently dropped |
Key Guarantee: source_count = good_count + bad_count — 100% reconciliation, always.
Examples
For a complete list of runnable guides and end-to-end notebooks, please visit the Examples section of our Documentation.
Documentation
For full guides, API references, tutorials, and contract templates, please visit the LakeLogic Documentation Site.
Contributing
See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lakelogic-1.15.0.tar.gz.
File metadata
- Download URL: lakelogic-1.15.0.tar.gz
- Upload date:
- Size: 3.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
945597c6f11c4ac412a17f430062df8a52eddee94dbe4f0fc21dc516a9dd73f5
|
|
| MD5 |
607c7675ac230569044f4ef3023f429f
|
|
| BLAKE2b-256 |
f18193b81309ab1cbf4f72e0cc7ad669f873ba12382bae9bb6d47fa5189f69e6
|
File details
Details for the file lakelogic-1.15.0-py3-none-any.whl.
File metadata
- Download URL: lakelogic-1.15.0-py3-none-any.whl
- Upload date:
- Size: 533.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
842af51ec8120158286555be06b8515d91c2f6800c9e42b1e7f40424dabf5a42
|
|
| MD5 |
dbae07391b16b30712438e4eb62fad96
|
|
| BLAKE2b-256 |
d3627cba77b88fe7285425daf8776e39f49d480d223b4b114fc45e192d5bd952
|