Skip to main content

A Python-based data contract runtime for consistent quality across engines.

Project description

LakeLogic

Your data estate. Under Contract.

Documentation PyPI Installs Python License

Catch breaking data changes before they reach production. One YAML contract. Any engine. Every row validated, quarantined, or promoted — automatically.


🌐 Data Mesh Alignment

LakeLogic is built for the decentralized data estate, directly supporting the four pillars of Data Mesh:

Pillar How LakeLogic Delivers
Domain Ownership Contracts are owned and defined by domain teams (e.g., CRM, Finance) who know the data best.
Data as a Product Contracts serve as the explicit "product interface," guaranteeing quality for consumers.
Self-Serve Platform A standardized runtime that any team can use to deploy quality gates without infra silos.
Federated Governance Global standards (e.g., PII masking) are defined centrally but enforced locally at every layer.

Quick Start (60 Seconds)

pip install "lakelogic[all]"

1. Bootstrap a contract

lakelogic bootstrap --landing data/ --output contracts/ --ai

Scans data, infers schemas, detects PII, and generates rules using AI.

2. Run the quality gate

lakelogic run --contract contracts/customers.yaml --source data/customers.csv

3. Or use Python directly

from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()
print(f"Valid: {result.good_count}  |  Quarantined: {result.bad_count}")

Contract Example

This single YAML file replaces hundreds of lines of validation code:

# REQUIRED: Contract version for compatibility tracking
version: "1.0"

# REQUIRED: Metadata — who owns this data and where it lives in the org
info:
  title: Silver Customers                 # Human-readable name for logs and monitoring
  owner: data-team                        # Team responsible for this contract
  domain: CRM                             # Data mesh domain (CRM, Finance, Marketing...)
  system: Salesforce                      # Source system this data originates from
  classification: "confidential"          # Data sensitivity: public | internal | confidential | restricted
  status: "production"                    # Lifecycle stage: development | staging | production | deprecated

# OPTIONAL: Custom tags for governance, cost tracking, and SLA enforcement
metadata:
  pii_present: true                       # Flags this dataset as containing personal data
  retention_days: 2555                    # Operational retention policy (7 years) — used by automated purge jobs
  sla_tier: "tier1"                       # SLA priority: tier1 = critical (< 4hr response)

# REQUIRED: Schema definition — expected columns, types, and constraints
# Field descriptions serve two purposes:
#   1. Business documentation — so analysts understand each field without asking
#   2. LLM context — used by `lakelogic bootstrap --ai` to generate smarter rules
model:
  fields:
    - name: customer_id
      type: integer
      required: true                      # Generates automatic NOT NULL quality rule
      description: "Unique identifier for each customer record"
    - name: email
      type: string
      pii: true                           # Marks as personally identifiable — enables auto-masking
      description: "Primary email address used for account login and communications"
    - name: revenue
      type: float
      description: "Lifetime revenue attributed to this customer in base currency"
    - name: status
      type: string
      description: "Current account state: active, churned, or pending onboarding"

# OPTIONAL: Schema evolution and unknown field handling
schema_policy:
  evolution: "strict"                     # Schema change behavior: strict | compatible | allow
  unknown_fields: "quarantine"            # Unknown columns: quarantine | drop | allow

# REQUIRED: Where to load data from (supports files, S3, ADLS, databases)
source:
  type: landing                           # Acquisition pattern: landing (files) | table (DB) | stream (Kafka)
  path: "data/customers/*.csv"            # Glob pattern — also supports s3://, abfss://, Unity Catalog tables
  load_mode: incremental                  # Only process new/changed data: full | incremental | cdc

# OPTIONAL: Reference data for joins and enrichment
links:
  - name: "dim_countries"                  # Logical name used in lookup/join transformations
    path: "./reference/countries.parquet"   # File path, S3 URI, or Unity Catalog table
    type: "parquet"                         # Format: parquet | csv | table
    broadcast: true                        # Broadcast join for small dimensions (Spark)

# OPTIONAL: Environment-specific overrides (activate via LAKELOGIC_ENV)
environments:
  dev:
    path: "dev/customers"                  # Cheaper storage for development
    format: "parquet"
  prod:
    path: "s3://prod-lake/silver/customers"
    format: "delta"

# OPTIONAL: Data transformations — pre (before validation) and post (after validation)
transformations:
  - rename:                               # Fix source naming drift before schema checks
      from: "cust_id"
      to: "customer_id"
    phase: "pre"                          # PRE = applied before quality rules run
  - deduplicate:                          # Keep most recent record per business key
      columns: ["customer_id"]
      order_by: "updated_at"
  - sql: |                                # Full SQL for complex enrichment logic
      SELECT *, UPPER(status) as status_code,
        revenue * 0.1 as tax_estimate
      FROM source
    phase: "post"                         # POST = applied after validation, on good data only

# OPTIONAL: Quality rules — rows that fail are quarantined, not silently dropped
quality:
  row_rules:                              # Row-level: each row evaluated independently
    - sql: "customer_id IS NOT NULL AND email IS NOT NULL"   # Completeness check
    - sql: "status IN ('active', 'churned', 'pending')"     # Enum validation
    - sql: "revenue >= 0"                                    # Range validation
    - sql: "email LIKE '%@%.%'"                              # Format validation
  dataset_rules:                          # Dataset-level: aggregate checks on all good rows
    - unique: "customer_id"               # No duplicate business keys

# OPTIONAL: Data provenance and audit trail
lineage:
  enabled: true                           # Stamps every row with run_id, source path, timestamps

# REQUIRED: Output — where and how to write validated data
materialization:
  strategy: merge                         # Write mode: overwrite | append | merge (upsert)
  target_path: "silver/customers"         # Destination path (also supports Unity Catalog table names)
  format: delta                           # Storage format: delta | parquet | iceberg | csv
  merge_keys: [customer_id]              # Business keys for merge/upsert operations
  partition_by:                           # Partition columns for query performance
    - "country"
    - "created_date"
  cluster_by: ["customer_id"]            # Clustering columns (Delta/Iceberg optimization)
  reprocess_policy: "overwrite_partition" # Idempotent re-runs: overwrite_partition | append | fail

# OPTIONAL: Soft deletes — GDPR "right to erasure" without losing audit trail
soft_deletes:
  enabled: true                           # Mark rows as deleted instead of hard-deleting
  flag_field: "_is_deleted"               # Boolean column added to target table
  reason_field: "_delete_reason"          # e.g. "GDPR request", "duplicate"
  timestamp_field: "_deleted_at"          # When the deletion was recorded

# OPTIONAL: Quarantine — isolate failed rows with error reasons for replay
quarantine:
  enabled: true                           # If false, pipeline hard-fails on any quality error
  target: "quarantine/customers"          # Where bad rows are written (with _lakelogic_errors column)
  notifications:                          # Alert channels when rows are quarantined
    - target: "https://hooks.slack.com/services/YOUR/WEBHOOK"  # Slack, Teams, email auto-detected
      on_events: ["quarantine", "failure", "schema_drift"]

# OPTIONAL: Service Level Objectives — data reliability monitoring
service_levels:
  freshness:
    threshold: "24h"                      # Data must be refreshed within this window
    field: "updated_at"                   # Timestamp field to check staleness against
  availability:
    threshold: 99.9                       # % of runs that must produce valid output

# OPTIONAL: Regulatory compliance metadata — used for audit-ready reports
compliance:
  gdpr:
    applicable: true                      # Whether GDPR applies to this dataset
    legal_basis: "legitimate_interest"    # Art. 6(1) lawful basis for processing
    purpose: "Customer engagement tracking"  # Why this data is processed (Art. 5(1)(b))
    retention_period: "24 months"         # Legal retention limit for PII — separate from operational retention
  eu_ai_act:
    applicable: false                     # Whether EU AI Act applies (for ML feature datasets)

[!TIP] View the Complete Contract Reference for every available configuration option.


Architecture

LakeLogic enforces Data Contracts as quality gates across the Medallion Architecture (Bronze → Silver → Gold).

LakeLogic Architecture

Each layer uses its own contract:

Layer Role Guarantee
Bronze Capture everything raw, no validation Immutable record of source
Silver Full validation, business rules, dedup Trusted, queryable data
Gold Aggregations, KPIs, ML features Analytics-ready datasets
Quarantine Failed rows isolated with error reasons Nothing silently dropped

Key Guarantee: source_count = good_count + bad_count — 100% reconciliation, always.


Business Impact

Benefit Detail
Cut Compute Spend by 80% Not every job needs Spark. Run maintenance tasks on Polars or DuckDB locally.
Guaranteed Integrity Dirty data goes to quarantine — dashboards are never poisoned.
Full Transparency Trace any KPI back to raw source records and the contract that validated them.
Parallel Development Two engineers work on two tables simultaneously without touching the same file.
Easier Debugging Logs tell you exactly which module failed — no searching through monster scripts.

Data Mesh Alignment

LakeLogic directly supports the four pillars of Data Mesh:

  • Domain Ownership — Contracts are owned by the teams who know the data best.
  • Data as a Product — Contracts serve as the explicit "product interface" guaranteeing quality.
  • Self-Serve Platform — Any team can deploy quality gates without infra silos.
  • Federated Governance — Global standards defined centrally, enforced locally at every layer.

Examples

The examples directory contains runnable notebooks:

Folder What You'll Learn
01_quickstart/ Remote CSV ingestion, database governance
02_core_patterns/ Bronze quality gate, medallion architecture, SCD2, deduplication, soft deletes
03_compliance_governance/ HIPAA & GDPR Policy Packs, automated PII masking, audit-ready quarantine

Documentation

Technical Capabilities

  • Engine Agnostic — Auto-optimizes for Spark, Polars, DuckDB, or Pandas
  • Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
  • SQL-First Rules — Define business logic in the language your team already speaks
  • Automatic Lineage — Every row stamped with Run IDs and source paths
  • 100% Reconciliation — Mathematically guaranteed: source = good + bad

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-1.8.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-1.8.0-py3-none-any.whl (432.6 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-1.8.0.tar.gz.

File metadata

  • Download URL: lakelogic-1.8.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.8.0.tar.gz
Algorithm Hash digest
SHA256 cac8f458d68cd7389d8f7ab31d7827a5e2795266c35573548da7cca2f6f0e66a
MD5 2ae0760bb9bc07c380dc616daacd6c79
BLAKE2b-256 434e10272f9ab673f32ea22347dc4996d90106b3deb7477ce040bfcdddd22882

See more details on using hashes here.

File details

Details for the file lakelogic-1.8.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-1.8.0-py3-none-any.whl
  • Upload date:
  • Size: 432.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c07dbfc69addc51d3566572ebc3fd2176943617cce212cabe46178771dea31b1
MD5 3ab4d53178128a19a9c7a9b8a0a4d9ba
BLAKE2b-256 f7f91c778adf3339d96612e4a4211fbf6252e068ca51bc2b8acf2faff8e2ba5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page