Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Your data pipeline breaks silently. LakeLogic catches it.

One YAML contract. Any engine. Every row validated, quarantined, or promoted — automatically.

CI Documentation PyPI Python License Downloads


The Problem

You write quality checks in Spark. Then you need to run locally with Polars. Now you're maintaining two codebases. Your bronze layer has no validation. Your silver layer silently drops rows. Nobody knows which records failed or why.

The Solution

# contract.yaml — this is your entire quality gate
version: "1.0"
info:
  title: Silver Customers
  owner: data-team
model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
    - name: revenue
      type: float
    - name: status
      type: string
source:
  type: landing
  path: "data/customers/*.csv"
  load_mode: incremental
quality:
  row_rules:
    - sql: "customer_id IS NOT NULL AND email IS NOT NULL"
    - sql: "status IN ('active', 'churned', 'pending')"
    - sql: "revenue >= 0"
    - sql: "email LIKE '%@%.%'"
materialization:
  strategy: merge
  target_path: "silver/customers"
  format: parquet
  merge_keys: [customer_id]
quarantine:
  enabled: true
  target: "quarantine/customers"
from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()

print(f"✅ Valid: {len(result.good)}  |  ❌ Quarantined: {len(result.bad)}")

Same contract runs on Polars, Spark, DuckDB, or Pandas. Zero code changes.


Install

pip install lakelogic                    # Core + Polars
pip install "lakelogic[spark]"           # + PySpark
pip install "lakelogic[delta]"           # + Delta Lake (Spark-free)
pip install "lakelogic[notifications]"   # + Apprise + Jinja2 alerts
pip install "lakelogic[all]"             # Everything

What You Get

🔒 Schema & Quality Gate

Define fields, types, required constraints, and SQL-based rules in YAML. Bad rows are quarantined with tagged error reasons — never silently dropped.

🔄 Engine Portability

One contract, four engines. Develop locally on Polars in milliseconds. Deploy to Spark at scale. Same validation semantics everywhere.

📊 Declarative Transformations

Rename, derive, deduplicate, pivot, unpivot, bucket, join, filter, JSON extract, date range explode — all in YAML, all engine-agnostic.

🔗 Automatic Lineage

Every row is stamped with _lakelogic_source, _lakelogic_processed_at, and _lakelogic_run_id. Upstream lineage columns are preserved with _upstream_* prefix across layers.

📦 Incremental Processing

Watermark-based incremental loads, file-mtime tracking, run logs, and CDC support. Process only what's new.

🔔 Notifications

Slack, Teams, Email, Discord, and 90+ channels via Apprise. Built-in Jinja2 templates per event. Just add a target URL.

🏗️ Materialization

Write validated data to CSV, Parquet, Delta Lake, or Unity Catalog tables. Supports append, overwrite, merge, and SCD2 strategies.

🧪 Synthetic Data

Generate realistic test data from any contract: lakelogic generate --contract contract.yaml --rows 1000

🔌 dbt Import

Already using dbt? Convert your schema.yml in one command: lakelogic import-dbt --schema models/schema.yml --output contracts/


Quick Start (5 Minutes)

1. Bootstrap a contract from your data

lakelogic bootstrap --landing data/ --output contracts/

This scans your files, infers schemas, detects PII, and generates ready-to-use contracts.

2. Run the quality gate

lakelogic run --contract contracts/customers.yaml --source data/customers.csv

3. See the results

✅ Good records: 847 → output/customers_good.parquet
❌ Quarantined:  23  → output/customers_quarantine.parquet
📊 Quality score: 97.4%

4. Check your environment

lakelogic doctor
LakeLogic Doctor
═══════════════════════════════════════
  Version     : 0.2.0
  Python      : 3.11.7
  OS          : Windows 11

  Engines
  ───────
  ✅ polars    1.18.0
  ✅ duckdb    1.1.3
  ✅ pandas    2.2.1
  ⬚  pyspark  not installed

  Extras
  ──────
  ✅ deltalake  0.22.3
  ✅ jinja2     3.1.4
  ✅ apprise    1.9.0
  ⬚  dataprofiler  not installed
═══════════════════════════════════════

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Contract YAML                           │
│  schema · SQL quality rules · transforms · lineage · target     │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                      ┌──────▼──────┐
                      │ DataProcessor│
                      └──────┬──────┘
                             │
        ┌────────────┬───────┼───────┬────────────┐
        ▼            ▼       ▼       ▼            │
   ┌────────┐  ┌────────┐ ┌───────┐ ┌────────┐   │
   │ Polars │  │ Spark  │ │DuckDB │ │ Pandas │   │
   └───┬────┘  └───┬────┘ └──┬────┘ └───┬────┘   │
       │           │         │          │         │
       └───────────┴────┬────┴──────────┘         │
                        │                         │
               ┌────────▼────────┐                │
               │  Validated Data  │                │
               │  ┌────┐ ┌─────┐ │                │
               │  │Good│ │ Bad │ │                │
               │  └──┬─┘ └──┬──┘ │                │
               └─────┼──────┼────┘                │
                     │      │                     │
               ┌─────▼┐  ┌──▼────────┐            │
               │Target│  │Quarantine │            │
               └──────┘  └───────────┘            │

Explore the Examples

The examples/ directory contains 24 runnable notebooks:

Category What You'll Learn
Quickstart Your first contract in 5 minutes, database governance, dbt+PII
Core Patterns Medallion architecture, bronze quality gates, SCD2, deduplication, soft deletes
Advanced Insurance ELT, GDPR compliance, late-arriving data, external logic, streaming, synthetic data
Compliance HIPAA PII masking

Documentation

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-0.5.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-0.5.0-py3-none-any.whl (292.6 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-0.5.0.tar.gz.

File metadata

  • Download URL: lakelogic-0.5.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.5.0.tar.gz
Algorithm Hash digest
SHA256 4e9a900d2b1906926458d98c51d91648871c9f280b81cd44d0402aaf769179a0
MD5 d32f649d3428e296df04da46ceb51869
BLAKE2b-256 d372a35a08d786bdcedfc5a4f4c22a9d2a48ff12f8f1f1c951b8583e282ee2dc

See more details on using hashes here.

File details

Details for the file lakelogic-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 292.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea50f07108cc3ca3d77051d9392266c6be34050900df2de01392bc9f0bf9e5e1
MD5 a016a36cfaa5e7357956f55083bb3cbd
BLAKE2b-256 edd5f29ed4846d0004be2967a285b7e40cead13f9f0a414f3357187848a21c23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page