Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Your Data Estate. Under Contract.

Documentation PyPI CI Coverage Python License

A declarative, contract-driven medallion pipeline engine for data mesh architectures.

Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.

Write once. Run on Spark, Polars, or DuckDB. The vendor-neutral alternative to Databricks Lakeflow Pipelines.


Data Mesh Alignment

LakeLogic is the missing runtime layer for Data Mesh — where domain ownership and federated governance stop being principles and start being enforced.

Pillar How LakeLogic Delivers
Domain Ownership Contracts are owned and defined by domain teams (e.g., CRM, Finance) who know the data best.
Data as a Product The contract IS the product interface — a versioned, schema-enforced, SLA-backed guarantee that consuming teams can depend on.
Self-Serve Platform A standardized runtime that any team can use to deploy quality gates without infra silos.
Federated Governance PII masking rules, SLA thresholds, and schema standards defined once in a central registry — automatically enforced at every domain pipeline.

Quick Start

pip install lakelogic
from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()
print(f"Valid: {result.good_count}  |  Quarantined: {result.bad_count}")

Technical Capabilities

Data Quality & Trust

  • 100% Reconciliation — Mathematically guaranteed: source = good + bad. Every row is accounted for — nothing silently dropped
  • Pydantic-Powered Validation — Every contract, system & domain configs are parsed through strict Pydantic models with Literal type enforcement — invalid YAML is caught at load time, not at runtime
  • SQL-First Rules — Define business logic in the language your team already speaks — no SDK, no custom DSL
  • SLO Monitoring & Anomaly Detection — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach

✏️ Try it out in Google Colab: Data Quality & Trust

Compliance & Governance

  • GDPR & HIPAA Compliance — Contract-driven forget_subjects() with nullify, hash, or redact strategies and immutable audit trail
  • Automatic Lineage — Every row stamped with Run IDs and source paths — traceable from landing zone to Gold layer
  • Pipeline Cost Intelligence — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration

✏️ Try it out in Google Colab: Compliance & Governance

Engine & Scale

  • Engine Agnostic — Write once, run on Spark, Polars, or DuckDB — same contract, zero code changes
  • Dimensional Modeling — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual MERGE INTO SQL required
  • Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
  • Parallel Processing — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
  • Backfill & Reprocessing — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
  • External Logic — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
  • Production Resilience — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (max_consecutive_failures) — pipelines self-heal transient failures without operator intervention

✏️ Try it out in Google Colab: Engine & Scale

Developer Experience

  • Structured Diagnostics & Observability — Deep contextual logging out-of-the-box (powered by loguru) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time
  • Dry Run Mode — Validate contracts, resolve dependencies, and preview execution plans without touching any data
  • DDL-Only Mode — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
  • DAG Dependency Viewer — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
  • Data Reset & Reload — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
  • Multi-Channel Alerts — Powered by Apprise for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full Jinja2 templating support for custom formatting

✏️ Try it out in Google Colab: Developer Experience

Data Generation & AI

  • Synthetic Data — Built-in DataGenerator (powered by Faker) with streaming simulation, time-windowed output, referential integrity, and edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation
  • Descriptive AI Test Data — Steer synthetic data generation with natural language prompts (e.g. "Generate users who are French or Japanese only, enterprise-tier, over 60 years old with SQL injection attempts in email fields") — output strictly adheres to the YAML contract schema
  • AI Contract Onboardinglakelogic infer auto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions
  • Unstructured Processing — LLM extraction from PDFs, images, audio with same contract validation + lineage
  • Automated Run Logs — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table

✏️ Try it out in Google Colab: Data Generation & AI

Integrations

  • dbt Adapter — Import dbt schema.yml models and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting
  • dlt (Data Load Tool) — Native DltAdapter supporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival

✏️ Try it out in Google Colab: Integrations


What a Contract Looks Like

One YAML file replaces hundreds of lines of validation code:

version: "1.0"
info:
  title: "Silver Customers"
  domain: "CRM"
  system: "Salesforce"

model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
      pii: true
      masking: "hash"
    - name: status
      type: string

transformations:
  - deduplicate: [customer_id]
  - sql: "SELECT *, UPPER(status) AS status_norm FROM source"
    phase: pre

quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"
    - sql: "status IN ('active', 'churned', 'pending')"
  dataset_rules:
    - unique: customer_id

materialization:
  strategy: merge
  merge_keys: [customer_id]
  format: delta

Same contract, any engine — swap engine="polars" for "spark" or "duckdb". Zero code changes.

Analogy: A contract is like a building inspection checklist. The inspector (LakeLogic) checks every room (row) against the blueprint (schema), flags violations (quarantine), and stamps a certificate (lineage) — regardless of whether the building was constructed with bricks (Spark), timber (Polars), or prefab (DuckDB).

What this buys you

Without LakeLogic With LakeLogic
500+ lines of PySpark/Pandas validation per table 40 lines of YAML
Bad rows silently dropped or crash the pipeline Bad rows quarantined with error reasons
Schema drift discovered in production dashboards Schema drift caught at ingestion
Manual dedup scripts per team deduplicate: [key] — one line
PII scattered across notebooks pii: true, masking: hash — automatic
No audit trail Every row stamped with run ID, source path, timestamp

[!TIP] View the Complete Contract Reference for every available configuration option.


Architecture

LakeLogic enforces Data Contracts as quality gates across the Medallion Architecture (Bronze → Silver → Gold).

LakeLogic Architecture

Each layer uses its own contract:

Layer Role Guarantee
Bronze Capture everything raw, no validation Immutable record of source
Silver Full validation, business rules, dedup Trusted, queryable data
Gold Aggregations, KPIs, ML features Analytics-ready datasets
Quarantine Failed rows isolated with error reasons Nothing silently dropped

Key Guarantee: source_count = good_count + bad_count — 100% reconciliation, always.

Examples

For a complete list of runnable guides and end-to-end notebooks, please visit the Examples section of our Documentation.


Documentation

For full guides, API references, tutorials, and contract templates, please visit the LakeLogic Documentation Site.

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-1.17.0.tar.gz (3.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-1.17.0-py3-none-any.whl (534.2 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-1.17.0.tar.gz.

File metadata

  • Download URL: lakelogic-1.17.0.tar.gz
  • Upload date:
  • Size: 3.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.17.0.tar.gz
Algorithm Hash digest
SHA256 eb61525dfff311890d8481fa15c1e9fa295cca8f1e25b575c4e5de35f3b39409
MD5 764aa33d737fb6223f0b9bcc68d7d58a
BLAKE2b-256 1e1b67720aeea7cc89a865610f672beaf86dbfbcd86fad3c52885f2cdc4ca5ad

See more details on using hashes here.

File details

Details for the file lakelogic-1.17.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-1.17.0-py3-none-any.whl
  • Upload date:
  • Size: 534.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.17.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4ade19e4a85d3a6d7031871bfca647c47ab1a91548f3e9787c9b9f5129ef697
MD5 82e35a626193e38cb8c7d87189bda187
BLAKE2b-256 3ea38b185e3de3d702e9e1f1298f92fbd1c528ace6c9ef6a1d4d88abf163f065

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page