Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Your data pipeline breaks silently. LakeLogic catches it.

One YAML contract. Any engine. Every row validated, quarantined, or promoted — automatically.

CI Documentation PyPI Python License Downloads


The Problem

You write quality checks in Spark. Then you need to run locally with Polars. Now you're maintaining two codebases. Your bronze layer has no validation. Your silver layer silently drops rows. Nobody knows which records failed or why.

The Solution

# contract.yaml — this is your entire quality gate
version: "1.0"
info:
  title: Silver Customers
  owner: data-team
model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
    - name: revenue
      type: float
    - name: status
      type: string
source:
  type: landing
  path: "data/customers/*.csv"
  load_mode: incremental
quality:
  row_rules:
    - sql: "customer_id IS NOT NULL AND email IS NOT NULL"
    - sql: "status IN ('active', 'churned', 'pending')"
    - sql: "revenue >= 0"
    - sql: "email LIKE '%@%.%'"
materialization:
  strategy: merge
  target_path: "silver/customers"
  format: parquet
  merge_keys: [customer_id]
quarantine:
  enabled: true
  target: "quarantine/customers"
from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()

print(f"✅ Valid: {len(result.good)}  |  ❌ Quarantined: {len(result.bad)}")

Same contract runs on Polars, Spark, DuckDB, or Pandas. Zero code changes.


Install

pip install lakelogic                    # Core + Polars
pip install "lakelogic[spark]"           # + PySpark
pip install "lakelogic[delta]"           # + Delta Lake (Spark-free)
pip install "lakelogic[notifications]"   # + Apprise + Jinja2 alerts
pip install "lakelogic[all]"             # Everything

What You Get

🔒 Schema & Quality Gate

Define fields, types, required constraints, and SQL-based rules in YAML. Bad rows are quarantined with tagged error reasons — never silently dropped.

🔄 Engine Portability

One contract, four engines. Develop locally on Polars in milliseconds. Deploy to Spark at scale. Same validation semantics everywhere.

📊 Declarative Transformations

Rename, derive, deduplicate, pivot, unpivot, bucket, join, filter, JSON extract, date range explode — all in YAML, all engine-agnostic.

🔗 Automatic Lineage

Every row is stamped with _lakelogic_source, _lakelogic_processed_at, and _lakelogic_run_id. Upstream lineage columns are preserved with _upstream_* prefix across layers.

📦 Incremental Processing

Watermark-based incremental loads, file-mtime tracking, run logs, and CDC support. Process only what's new.

🔔 Notifications

Slack, Teams, Email, Discord, and 90+ channels via Apprise. Built-in Jinja2 templates per event. Just add a target URL.

🏗️ Materialization

Write validated data to CSV, Parquet, Delta Lake, or Unity Catalog tables. Supports append, overwrite, merge, and SCD2 strategies.

🧪 Synthetic Data

Generate realistic test data from any contract: lakelogic generate --contract contract.yaml --rows 1000

🔌 dbt Import

Already using dbt? Convert your schema.yml in one command: lakelogic import-dbt --schema models/schema.yml --output contracts/


Quick Start (5 Minutes)

1. Bootstrap a contract from your data

lakelogic bootstrap --landing data/ --output contracts/

This scans your files, infers schemas, detects PII, and generates ready-to-use contracts.

2. Run the quality gate

lakelogic run --contract contracts/customers.yaml --source data/customers.csv

3. See the results

✅ Good records: 847 → output/customers_good.parquet
❌ Quarantined:  23  → output/customers_quarantine.parquet
📊 Quality score: 97.4%

4. Check your environment

lakelogic doctor
LakeLogic Doctor
═══════════════════════════════════════
  Version     : 0.2.0
  Python      : 3.11.7
  OS          : Windows 11

  Engines
  ───────
  ✅ polars    1.18.0
  ✅ duckdb    1.1.3
  ✅ pandas    2.2.1
  ⬚  pyspark  not installed

  Extras
  ──────
  ✅ deltalake  0.22.3
  ✅ jinja2     3.1.4
  ✅ apprise    1.9.0
  ⬚  dataprofiler  not installed
═══════════════════════════════════════

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Contract YAML                           │
│  schema · SQL quality rules · transforms · lineage · target     │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                      ┌──────▼──────┐
                      │ DataProcessor│
                      └──────┬──────┘
                             │
        ┌────────────┬───────┼───────┬────────────┐
        ▼            ▼       ▼       ▼            │
   ┌────────┐  ┌────────┐ ┌───────┐ ┌────────┐   │
   │ Polars │  │ Spark  │ │DuckDB │ │ Pandas │   │
   └───┬────┘  └───┬────┘ └──┬────┘ └───┬────┘   │
       │           │         │          │         │
       └───────────┴────┬────┴──────────┘         │
                        │                         │
               ┌────────▼────────┐                │
               │  Validated Data  │                │
               │  ┌────┐ ┌─────┐ │                │
               │  │Good│ │ Bad │ │                │
               │  └──┬─┘ └──┬──┘ │                │
               └─────┼──────┼────┘                │
                     │      │                     │
               ┌─────▼┐  ┌──▼────────┐            │
               │Target│  │Quarantine │            │
               └──────┘  └───────────┘            │

Explore the Examples

The examples/ directory contains 24 runnable notebooks:

Category What You'll Learn
Quickstart Your first contract in 5 minutes, database governance, dbt+PII
Core Patterns Medallion architecture, bronze quality gates, SCD2, deduplication, soft deletes
Advanced Insurance ELT, GDPR compliance, late-arriving data, external logic, streaming, synthetic data
Compliance HIPAA PII masking

Documentation

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-0.8.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-0.8.0-py3-none-any.whl (293.9 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-0.8.0.tar.gz.

File metadata

  • Download URL: lakelogic-0.8.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.8.0.tar.gz
Algorithm Hash digest
SHA256 81296c6e1144affea273a6dc78def4f35441ddc134030dd3d8563ff7acbba3eb
MD5 f5fa5a3408fa46507aac0af090309a6a
BLAKE2b-256 4f6ef4056af9421da50fe3c5307cfe998108f22a6f3f21b0fd04bd1a61f51447

See more details on using hashes here.

File details

Details for the file lakelogic-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 293.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32a37e45f7d8411e585c53dec214f030cbe21955b4b172edf724ebed82c771bd
MD5 ddd5a4a005f105f3880b147cd78ccce8
BLAKE2b-256 326a5e7c23e57d123d24c74e0ee2ec6580cd0f855b8510c0615229c67f220977

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page