Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Your data pipeline breaks silently. LakeLogic catches it.

One YAML contract. Any engine. Every row validated, quarantined, or promoted — automatically.

Documentation PyPI Python License Downloads


The Problem

You write quality checks in Spark. Then you need to run locally with Polars. Now you're maintaining two codebases. Your bronze layer has no validation. Your silver layer silently drops rows. Nobody knows which records failed or why.

The Solution

# contract.yaml — this is your entire quality gate
version: "1.0"
info:
  title: Silver Customers
  owner: data-team
model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
    - name: revenue
      type: float
    - name: status
      type: string
source:
  type: landing
  path: "data/customers/*.csv"
  load_mode: incremental
quality:
  row_rules:
    - sql: "customer_id IS NOT NULL AND email IS NOT NULL"
    - sql: "status IN ('active', 'churned', 'pending')"
    - sql: "revenue >= 0"
    - sql: "email LIKE '%@%.%'"
materialization:
  strategy: merge
  target_path: "silver/customers"
  format: parquet
  merge_keys: [customer_id]
quarantine:
  enabled: true
  target: "quarantine/customers"
from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()

print(f"✅ Valid: {len(result.good)}  |  ❌ Quarantined: {len(result.bad)}")

Same contract runs on Polars, Spark, DuckDB, or Pandas. Zero code changes.


Install

pip install lakelogic                    # Core + Polars
pip install "lakelogic[spark]"           # + PySpark
pip install "lakelogic[delta]"           # + Delta Lake (Spark-free)
pip install "lakelogic[notifications]"   # + Apprise + Jinja2 alerts
pip install "lakelogic[all]"             # Everything

What You Get

🔒 Schema & Quality Gate

Define fields, types, required constraints, and SQL-based rules in YAML. Bad rows are quarantined with tagged error reasons — never silently dropped.

🔄 Engine Portability

One contract, four engines. Develop locally on Polars in milliseconds. Deploy to Spark at scale. Same validation semantics everywhere.

📊 Declarative Transformations

Rename, derive, deduplicate, pivot, unpivot, bucket, join, filter, JSON extract, date range explode — all in YAML, all engine-agnostic.

🔗 Automatic Lineage

Every row is stamped with _lakelogic_source, _lakelogic_processed_at, and _lakelogic_run_id. Upstream lineage columns are preserved with _upstream_* prefix across layers.

📦 Incremental Processing

Watermark-based incremental loads, file-mtime tracking, run logs, and CDC support. Process only what's new.

🔔 Notifications

Slack, Teams, Email, Discord, and 90+ channels via Apprise. Built-in Jinja2 templates per event. Just add a target URL.

🏗️ Materialization

Write validated data to CSV, Parquet, Delta Lake, or Unity Catalog tables. Supports append, overwrite, merge, and SCD2 strategies.

🧪 Synthetic Data

Generate realistic test data from any contract: lakelogic generate --contract contract.yaml --rows 1000

🔌 dbt Import

Already using dbt? Convert your schema.yml in one command: lakelogic import-dbt --schema models/schema.yml --output contracts/


Quick Start (5 Minutes)

1. Bootstrap a contract from your data

lakelogic bootstrap --landing data/ --output contracts/

This scans your files, infers schemas, detects PII, and generates ready-to-use contracts.

2. Run the quality gate

lakelogic run --contract contracts/customers.yaml --source data/customers.csv

3. See the results

✅ Good records: 847 → output/customers_good.parquet
❌ Quarantined:  23  → output/customers_quarantine.parquet
📊 Quality score: 97.4%

4. Check your environment

lakelogic doctor
LakeLogic Doctor
═══════════════════════════════════════
  Version     : 0.2.0
  Python      : 3.11.7
  OS          : Windows 11

  Engines
  ───────
  ✅ polars    1.18.0
  ✅ duckdb    1.1.3
  ✅ pandas    2.2.1
  ⬚  pyspark  not installed

  Extras
  ──────
  ✅ deltalake  0.22.3
  ✅ jinja2     3.1.4
  ✅ apprise    1.9.0
  ⬚  dataprofiler  not installed
═══════════════════════════════════════

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Contract YAML                           │
│  schema · SQL quality rules · transforms · lineage · target     │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                      ┌──────▼──────┐
                      │ DataProcessor│
                      └──────┬──────┘
                             │
        ┌────────────┬───────┼───────┬────────────┐
        ▼            ▼       ▼       ▼            │
   ┌────────┐  ┌────────┐ ┌───────┐ ┌────────┐   │
   │ Polars │  │ Spark  │ │DuckDB │ │ Pandas │   │
   └───┬────┘  └───┬────┘ └──┬────┘ └───┬────┘   │
       │           │         │          │         │
       └───────────┴────┬────┴──────────┘         │
                        │                         │
               ┌────────▼────────┐                │
               │  Validated Data  │                │
               │  ┌────┐ ┌─────┐ │                │
               │  │Good│ │ Bad │ │                │
               │  └──┬─┘ └──┬──┘ │                │
               └─────┼──────┼────┘                │
                     │      │                     │
               ┌─────▼┐  ┌──▼────────┐            │
               │Target│  │Quarantine │            │
               └──────┘  └───────────┘            │

Explore the Examples

The examples/ directory contains runnable notebooks across three learning tracks:

Folder What You'll Learn
01_quickstart/ Remote CSV ingestion, database governance, dbt + PII quality
02_core_patterns/ Bronze quality gate, medallion architecture, SCD2, deduplication, reference joins, soft deletes
03_compliance_governance/ HIPAA & GDPR Policy Packs, automated PII masking, audit-ready quarantine

Documentation

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-0.13.0.tar.gz (935.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-0.13.0-py3-none-any.whl (298.6 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-0.13.0.tar.gz.

File metadata

  • Download URL: lakelogic-0.13.0.tar.gz
  • Upload date:
  • Size: 935.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.13.0.tar.gz
Algorithm Hash digest
SHA256 202809d11cd597f43c79604d573dcf4bce9e373bf5818f90a546d862d7892525
MD5 fe95c43d904e5043c185b769d3b08a44
BLAKE2b-256 1d6b3a687463c1e9a75138486316f320f9723a4adf7417683d3664ac90751064

See more details on using hashes here.

File details

Details for the file lakelogic-0.13.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-0.13.0-py3-none-any.whl
  • Upload date:
  • Size: 298.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-0.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f3d8712d6e0806e0486ff2def70c14fe99f36c0f2c6006bee07b2dd3dfc9854
MD5 09034623abe1747b9f7acc663c3a5ea5
BLAKE2b-256 7f8ee413e9ffd58945f7564e804cc669ddffd8efc7895b54cf4f34088173d5c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page