Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Your data pipeline breaks silently. LakeLogic catches it.

One YAML contract. Any engine. Every row validated, quarantined, or promoted — automatically.

Documentation PyPI Python License Downloads


The Problem

You write quality checks in Spark. Then you need to run locally with Polars. Now you're maintaining two codebases. Your bronze layer has no validation. Your silver layer silently drops rows. Nobody knows which records failed or why.

The Solution

# contract.yaml — this is your entire quality gate
version: "1.0"
info:
  title: Silver Customers
  owner: data-team
model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
    - name: revenue
      type: float
    - name: status
      type: string
source:
  type: landing
  path: "data/customers/*.csv"
  load_mode: incremental
quality:
  row_rules:
    - sql: "customer_id IS NOT NULL AND email IS NOT NULL"
    - sql: "status IN ('active', 'churned', 'pending')"
    - sql: "revenue >= 0"
    - sql: "email LIKE '%@%.%'"
materialization:
  strategy: merge
  target_path: "silver/customers"
  format: parquet
  merge_keys: [customer_id]
quarantine:
  enabled: true
  target: "quarantine/customers"
from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()

print(f"✅ Valid: {len(result.good)}  |  ❌ Quarantined: {len(result.bad)}")

Same contract runs on Polars, Spark, DuckDB, or Pandas. Zero code changes.


Install

pip install lakelogic                    # Core + Polars
pip install "lakelogic[spark]"           # + PySpark
pip install "lakelogic[delta]"           # + Delta Lake (Spark-free)
pip install "lakelogic[notifications]"   # + Apprise + Jinja2 alerts
pip install "lakelogic[all]"             # Everything

What You Get

🔒 Schema & Quality Gate

Define fields, types, required constraints, and SQL-based rules in YAML. Bad rows are quarantined with tagged error reasons — never silently dropped.

🔄 Engine Portability

One contract, four engines. Develop locally on Polars in milliseconds. Deploy to Spark at scale. Same validation semantics everywhere.

📊 Declarative Transformations

Rename, derive, deduplicate, pivot, unpivot, bucket, join, filter, JSON extract, date range explode — all in YAML, all engine-agnostic.

🔗 Automatic Lineage

Every row is stamped with _lakelogic_source, _lakelogic_processed_at, and _lakelogic_run_id. Upstream lineage columns are preserved with _upstream_* prefix across layers.

📦 Incremental Processing

Watermark-based incremental loads, file-mtime tracking, run logs, and CDC support. Process only what's new.

🔔 Notifications

Slack, Teams, Email, Discord, and 90+ channels via Apprise. Built-in Jinja2 templates per event. Just add a target URL.

🏗️ Materialization

Write validated data to CSV, Parquet, Delta Lake, or Unity Catalog tables. Supports append, overwrite, merge, and SCD2 strategies.

🧪 Synthetic Data

Generate realistic test data from any contract: lakelogic generate --contract contract.yaml --rows 1000

🔌 dbt Import

Already using dbt? Convert your schema.yml in one command: lakelogic import-dbt --schema models/schema.yml --output contracts/


Quick Start (5 Minutes)

1. Bootstrap a contract from your data

lakelogic bootstrap --landing data/ --output contracts/

This scans your files, infers schemas, detects PII, and generates ready-to-use contracts.

2. Run the quality gate

lakelogic run --contract contracts/customers.yaml --source data/customers.csv

3. See the results

✅ Good records: 847 → output/customers_good.parquet
❌ Quarantined:  23  → output/customers_quarantine.parquet
📊 Quality score: 97.4%

4. Check your environment

lakelogic doctor
LakeLogic Doctor
═══════════════════════════════════════
  Version     : 0.2.0
  Python      : 3.11.7
  OS          : Windows 11

  Engines
  ───────
  ✅ polars    1.18.0
  ✅ duckdb    1.1.3
  ✅ pandas    2.2.1
  ⬚  pyspark  not installed

  Extras
  ──────
  ✅ deltalake  0.22.3
  ✅ jinja2     3.1.4
  ✅ apprise    1.9.0
  ⬚  dataprofiler  not installed
═══════════════════════════════════════

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Contract YAML                           │
│  schema · SQL quality rules · transforms · lineage · target     │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                      ┌──────▼──────┐
                      │ DataProcessor│
                      └──────┬──────┘
                             │
        ┌────────────┬───────┼───────┬────────────┐
        ▼            ▼       ▼       ▼            │
   ┌────────┐  ┌────────┐ ┌───────┐ ┌────────┐   │
   │ Polars │  │ Spark  │ │DuckDB │ │ Pandas │   │
   └───┬────┘  └───┬────┘ └──┬────┘ └───┬────┘   │
       │           │         │          │         │
       └───────────┴────┬────┴──────────┘         │
                        │                         │
               ┌────────▼────────┐                │
               │  Validated Data  │                │
               │  ┌────┐ ┌─────┐ │                │
               │  │Good│ │ Bad │ │                │
               │  └──┬─┘ └──┬──┘ │                │
               └─────┼──────┼────┘                │
                     │      │                     │
               ┌─────▼┐  ┌──▼────────┐            │
               │Target│  │Quarantine │            │
               └──────┘  └───────────┘            │

Explore the Examples

The examples/ directory contains runnable notebooks across three learning tracks:

Folder What You'll Learn
01_quickstart/ Remote CSV ingestion, database governance, dbt + PII quality
02_core_patterns/ Bronze quality gate, medallion architecture, SCD2, deduplication, reference joins, soft deletes
03_compliance_governance/ HIPAA & GDPR Policy Packs, automated PII masking, audit-ready quarantine

Documentation

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-1.1.0.tar.gz (947.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-1.1.0-py3-none-any.whl (300.6 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-1.1.0.tar.gz.

File metadata

  • Download URL: lakelogic-1.1.0.tar.gz
  • Upload date:
  • Size: 947.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.1.0.tar.gz
Algorithm Hash digest
SHA256 33a4a2efa4c95f89522176fc19d89ec4b4de1d841a822d69613a86c9c67a1fd5
MD5 67b3370d4ae1dd34011b113f4dc74424
BLAKE2b-256 188fdb9c413b97c08497c61b055672c9be41d7589f3066384851e8eb7599996a

See more details on using hashes here.

File details

Details for the file lakelogic-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 300.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65e305bdb8d1b6123f9d2d071aef2cda37958beb09991a105d4c4d4db652f986
MD5 3298373c9de68c6867ebee06ee61c152
BLAKE2b-256 491dca2d752385ae8dcbe5c2422200b200078543f371c4b048fb1539f2c3f48e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page