Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Your data pipeline breaks silently. LakeLogic catches it.

One YAML contract. Any engine. Every row validated, quarantined, or promoted — automatically.

Documentation PyPI Python License Downloads


The Problem

You write quality checks in Spark. Then you need to run locally with Polars. Now you're maintaining two codebases. Your bronze layer has no validation. Your silver layer silently drops rows. Nobody knows which records failed or why.

The Solution

# contract.yaml — this is your entire quality gate
version: "1.0"
info:
  title: Silver Customers
  owner: data-team
model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
    - name: revenue
      type: float
    - name: status
      type: string
source:
  type: landing
  path: "data/customers/*.csv"
  load_mode: incremental
quality:
  row_rules:
    - sql: "customer_id IS NOT NULL AND email IS NOT NULL"
    - sql: "status IN ('active', 'churned', 'pending')"
    - sql: "revenue >= 0"
    - sql: "email LIKE '%@%.%'"
materialization:
  strategy: merge
  target_path: "silver/customers"
  format: parquet
  merge_keys: [customer_id]
quarantine:
  enabled: true
  target: "quarantine/customers"
from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()

print(f"✅ Valid: {len(result.good)}  |  ❌ Quarantined: {len(result.bad)}")

Same contract runs on Polars, Spark, DuckDB, or Pandas. Zero code changes.


Install

pip install lakelogic                    # Core + Polars
pip install "lakelogic[spark]"           # + PySpark
pip install "lakelogic[delta]"           # + Delta Lake (Spark-free)
pip install "lakelogic[notifications]"   # + Apprise + Jinja2 alerts
pip install "lakelogic[all]"             # Everything

What You Get

🔒 Schema & Quality Gate

Define fields, types, required constraints, and SQL-based rules in YAML. Bad rows are quarantined with tagged error reasons — never silently dropped.

🔄 Engine Portability

One contract, four engines. Develop locally on Polars in milliseconds. Deploy to Spark at scale. Same validation semantics everywhere.

📊 Declarative Transformations

Rename, derive, deduplicate, pivot, unpivot, bucket, join, filter, JSON extract, date range explode — all in YAML, all engine-agnostic.

🔗 Automatic Lineage

Every row is stamped with _lakelogic_source, _lakelogic_processed_at, and _lakelogic_run_id. Upstream lineage columns are preserved with _upstream_* prefix across layers.

📦 Incremental Processing

Watermark-based incremental loads, file-mtime tracking, run logs, and CDC support. Process only what's new.

🔔 Notifications

Slack, Teams, Email, Discord, and 90+ channels via Apprise. Built-in Jinja2 templates per event. Just add a target URL.

🏗️ Materialization

Write validated data to CSV, Parquet, Delta Lake, or Unity Catalog tables. Supports append, overwrite, merge, and SCD2 strategies.

🧪 Synthetic Data

Generate realistic test data from any contract: lakelogic generate --contract contract.yaml --rows 1000

🔌 dbt Import

Already using dbt? Convert your schema.yml in one command: lakelogic import-dbt --schema models/schema.yml --output contracts/


Quick Start (5 Minutes)

1. Bootstrap a contract from your data

lakelogic bootstrap --landing data/ --output contracts/

This scans your files, infers schemas, detects PII, and generates ready-to-use contracts.

2. Run the quality gate

lakelogic run --contract contracts/customers.yaml --source data/customers.csv

3. See the results

✅ Good records: 847 → output/customers_good.parquet
❌ Quarantined:  23  → output/customers_quarantine.parquet
📊 Quality score: 97.4%

4. Check your environment

lakelogic doctor
LakeLogic Doctor
═══════════════════════════════════════
  Version     : 0.2.0
  Python      : 3.11.7
  OS          : Windows 11

  Engines
  ───────
  ✅ polars    1.18.0
  ✅ duckdb    1.1.3
  ✅ pandas    2.2.1
  ⬚  pyspark  not installed

  Extras
  ──────
  ✅ deltalake  0.22.3
  ✅ jinja2     3.1.4
  ✅ apprise    1.9.0
  ⬚  dataprofiler  not installed
═══════════════════════════════════════

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Contract YAML                           │
│  schema · SQL quality rules · transforms · lineage · target     │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                      ┌──────▼──────┐
                      │ DataProcessor│
                      └──────┬──────┘
                             │
        ┌────────────┬───────┼───────┬────────────┐
        ▼            ▼       ▼       ▼            │
   ┌────────┐  ┌────────┐ ┌───────┐ ┌────────┐   │
   │ Polars │  │ Spark  │ │DuckDB │ │ Pandas │   │
   └───┬────┘  └───┬────┘ └──┬────┘ └───┬────┘   │
       │           │         │          │         │
       └───────────┴────┬────┴──────────┘         │
                        │                         │
               ┌────────▼────────┐                │
               │  Validated Data  │                │
               │  ┌────┐ ┌─────┐ │                │
               │  │Good│ │ Bad │ │                │
               │  └──┬─┘ └──┬──┘ │                │
               └─────┼──────┼────┘                │
                     │      │                     │
               ┌─────▼┐  ┌──▼────────┐            │
               │Target│  │Quarantine │            │
               └──────┘  └───────────┘            │

Explore the Examples

The examples/ directory contains runnable notebooks across three learning tracks:

Folder What You'll Learn
01_quickstart/ Remote CSV ingestion, database governance, dbt + PII quality
02_core_patterns/ Bronze quality gate, medallion architecture, SCD2, deduplication, reference joins, soft deletes
03_compliance_governance/ HIPAA & GDPR Policy Packs, automated PII masking, audit-ready quarantine

Documentation

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-0.9.0.tar.gz (932.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-0.9.0-py3-none-any.whl (296.5 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-0.9.0.tar.gz.

File metadata

  • Download URL: lakelogic-0.9.0.tar.gz
  • Upload date:
  • Size: 932.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.9.0.tar.gz
Algorithm Hash digest
SHA256 69d9fb2ca7813ba8e2bf6ca9f94acc4e49368cf6dadc82965b71ef177e4df803
MD5 2044c7cb1b15724902b5a910ee0400ea
BLAKE2b-256 b6300ab850308a0d76e46bc3680bb48ba8d3b7d9ae6edfd0a90a497b53b6bdcb

See more details on using hashes here.

File details

Details for the file lakelogic-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 296.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1942bee3a52c3f563c7b95639876ef3d7078af702184f52f2ab24e0ca95caaaf
MD5 442d5409e27a4a1572672b3f8db30095
BLAKE2b-256 64299e41f611bd6ec6862ec46bc12d18b3235816322910a51ba93e2c152d4e95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page