A Python-based data contract runtime for consistent quality across engines.

These details have not been verified by PyPI

Project description

LakeLogic

Your data estate. Under Contract.

Catch breaking data changes before they reach production. One YAML contract. Any engine. Every row validated, quarantined, or promoted — automatically.

🌐 Data Mesh Alignment

LakeLogic is built for the decentralized data estate, directly supporting the four pillars of Data Mesh:

Pillar	How LakeLogic Delivers
Domain Ownership	Contracts are owned and defined by domain teams (e.g., CRM, Finance) who know the data best.
Data as a Product	Contracts serve as the explicit "product interface," guaranteeing quality for consumers.
Self-Serve Platform	A standardized runtime that any team can use to deploy quality gates without infra silos.
Federated Governance	Global standards (e.g., PII masking) are defined centrally but enforced locally at every layer.

Quick Start (60 Seconds)

pip install "lakelogic[all]"

1. Bootstrap a contract

lakelogic bootstrap --landing data/ --output contracts/ --ai

Scans data, infers schemas, detects PII, and generates rules using AI.

2. Run the quality gate

lakelogic run --contract contracts/customers.yaml --source data/customers.csv

3. Or use Python directly

from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()
print(f"Valid: {result.good_count}  |  Quarantined: {result.bad_count}")

Contract Example

This single YAML file replaces hundreds of lines of validation code:

# REQUIRED: Contract version for compatibility tracking
version: "1.0"

# REQUIRED: Metadata — who owns this data and where it lives in the org
info:
  title: Silver Customers                 # Human-readable name for logs and monitoring
  owner: data-team                        # Team responsible for this contract
  domain: CRM                             # Data mesh domain (CRM, Finance, Marketing...)
  system: Salesforce                      # Source system this data originates from
  classification: "confidential"          # Data sensitivity: public | internal | confidential | restricted
  status: "production"                    # Lifecycle stage: development | staging | production | deprecated

# OPTIONAL: Custom tags for governance, cost tracking, and SLA enforcement
metadata:
  pii_present: true                       # Flags this dataset as containing personal data
  retention_days: 2555                    # Operational retention policy (7 years) — used by automated purge jobs
  sla_tier: "tier1"                       # SLA priority: tier1 = critical (< 4hr response)

# REQUIRED: Schema definition — expected columns, types, and constraints
# Field descriptions serve two purposes:
#   1. Business documentation — so analysts understand each field without asking
#   2. LLM context — used by `lakelogic bootstrap --ai` to generate smarter rules
model:
  fields:
    - name: customer_id
      type: integer
      required: true                      # Generates automatic NOT NULL quality rule
      description: "Unique identifier for each customer record"
    - name: email
      type: string
      pii: true                           # Marks as personally identifiable — enables auto-masking
      description: "Primary email address used for account login and communications"
    - name: revenue
      type: float
      description: "Lifetime revenue attributed to this customer in base currency"
    - name: status
      type: string
      description: "Current account state: active, churned, or pending onboarding"

# OPTIONAL: Schema evolution and unknown field handling
schema_policy:
  evolution: "strict"                     # Schema change behavior: strict | compatible | allow
  unknown_fields: "quarantine"            # Unknown columns: quarantine | drop | allow

# REQUIRED: Where to load data from (supports files, S3, ADLS, databases)
source:
  type: landing                           # Acquisition pattern: landing (files) | table (DB) | stream (Kafka)
  path: "data/customers/*.csv"            # Glob pattern — also supports s3://, abfss://, Unity Catalog tables
  load_mode: incremental                  # Only process new/changed data: full | incremental | cdc

# OPTIONAL: Reference data for joins and enrichment
links:
  - name: "dim_countries"                  # Logical name used in lookup/join transformations
    path: "./reference/countries.parquet"   # File path, S3 URI, or Unity Catalog table
    type: "parquet"                         # Format: parquet | csv | table
    broadcast: true                        # Broadcast join for small dimensions (Spark)

# OPTIONAL: Environment-specific overrides (activate via LAKELOGIC_ENV)
environments:
  dev:
    path: "dev/customers"                  # Cheaper storage for development
    format: "parquet"
  prod:
    path: "s3://prod-lake/silver/customers"
    format: "delta"

# OPTIONAL: Data transformations — pre (before validation) and post (after validation)
transformations:
  - rename:                               # Fix source naming drift before schema checks
      from: "cust_id"
      to: "customer_id"
    phase: "pre"                          # PRE = applied before quality rules run
  - deduplicate:                          # Keep most recent record per business key
      columns: ["customer_id"]
      order_by: "updated_at"
  - sql: |                                # Full SQL for complex enrichment logic
      SELECT *, UPPER(status) as status_code,
        revenue * 0.1 as tax_estimate
      FROM source
    phase: "post"                         # POST = applied after validation, on good data only

# OPTIONAL: Quality rules — rows that fail are quarantined, not silently dropped
quality:
  row_rules:                              # Row-level: each row evaluated independently
    - sql: "customer_id IS NOT NULL AND email IS NOT NULL"   # Completeness check
    - sql: "status IN ('active', 'churned', 'pending')"     # Enum validation
    - sql: "revenue >= 0"                                    # Range validation
    - sql: "email LIKE '%@%.%'"                              # Format validation
  dataset_rules:                          # Dataset-level: aggregate checks on all good rows
    - unique: "customer_id"               # No duplicate business keys

# OPTIONAL: Data provenance and audit trail
lineage:
  enabled: true                           # Stamps every row with run_id, source path, timestamps

# REQUIRED: Output — where and how to write validated data
materialization:
  strategy: merge                         # Write mode: overwrite | append | merge (upsert)
  target_path: "silver/customers"         # Destination path (also supports Unity Catalog table names)
  format: delta                           # Storage format: delta | parquet | iceberg | csv
  merge_keys: [customer_id]              # Business keys for merge/upsert operations
  partition_by:                           # Partition columns for query performance
    - "country"
    - "created_date"
  cluster_by: ["customer_id"]            # Clustering columns (Delta/Iceberg optimization)
  reprocess_policy: "overwrite_partition" # Idempotent re-runs: overwrite_partition | append | fail

# OPTIONAL: Soft deletes — GDPR "right to erasure" without losing audit trail
soft_deletes:
  enabled: true                           # Mark rows as deleted instead of hard-deleting
  flag_field: "_is_deleted"               # Boolean column added to target table
  reason_field: "_delete_reason"          # e.g. "GDPR request", "duplicate"
  timestamp_field: "_deleted_at"          # When the deletion was recorded

# OPTIONAL: Quarantine — isolate failed rows with error reasons for replay
quarantine:
  enabled: true                           # If false, pipeline hard-fails on any quality error
  target: "quarantine/customers"          # Where bad rows are written (with _lakelogic_errors column)
  notifications:                          # Alert channels when rows are quarantined
    - target: "https://hooks.slack.com/services/YOUR/WEBHOOK"  # Slack, Teams, email auto-detected
      on_events: ["quarantine", "failure", "schema_drift"]

# OPTIONAL: Service Level Objectives — data reliability monitoring
service_levels:
  freshness:
    threshold: "24h"                      # Data must be refreshed within this window
    field: "updated_at"                   # Timestamp field to check staleness against
  availability:
    threshold: 99.9                       # % of runs that must produce valid output

# OPTIONAL: Regulatory compliance metadata — used for audit-ready reports
compliance:
  gdpr:
    applicable: true                      # Whether GDPR applies to this dataset
    legal_basis: "legitimate_interest"    # Art. 6(1) lawful basis for processing
    purpose: "Customer engagement tracking"  # Why this data is processed (Art. 5(1)(b))
    retention_period: "24 months"         # Legal retention limit for PII — separate from operational retention
  eu_ai_act:
    applicable: false                     # Whether EU AI Act applies (for ML feature datasets)

[!TIP] View the Complete Contract Reference for every available configuration option.

Architecture

LakeLogic enforces Data Contracts as quality gates across the Medallion Architecture (Bronze → Silver → Gold).

LakeLogic Architecture

Each layer uses its own contract:

Layer	Role	Guarantee
Bronze	Capture everything raw, no validation	Immutable record of source
Silver	Full validation, business rules, dedup	Trusted, queryable data
Gold	Aggregations, KPIs, ML features	Analytics-ready datasets
Quarantine	Failed rows isolated with error reasons	Nothing silently dropped

Key Guarantee: source_count = good_count + bad_count — 100% reconciliation, always.

Business Impact

Benefit	Detail
Cut Compute Spend by 80%	Not every job needs Spark. Run maintenance tasks on Polars or DuckDB locally.
Guaranteed Integrity	Dirty data goes to quarantine — dashboards are never poisoned.
Full Transparency	Trace any KPI back to raw source records and the contract that validated them.
Parallel Development	Two engineers work on two tables simultaneously without touching the same file.
Easier Debugging	Logs tell you exactly which module failed — no searching through monster scripts.

Data Mesh Alignment

LakeLogic directly supports the four pillars of Data Mesh:

Domain Ownership — Contracts are owned by the teams who know the data best.
Data as a Product — Contracts serve as the explicit "product interface" guaranteeing quality.
Self-Serve Platform — Any team can deploy quality gates without infra silos.
Federated Governance — Global standards defined centrally, enforced locally at every layer.

Examples

The examples directory contains runnable notebooks:

Folder	What You'll Learn
`01_quickstart/`	Remote CSV ingestion, database governance
`02_core_patterns/`	Bronze quality gate, medallion architecture, SCD2, deduplication, soft deletes
`03_compliance_governance/`	HIPAA & GDPR Policy Packs, automated PII masking, audit-ready quarantine

Documentation

Full Docs — Guides and API reference
Architecture Overview — Medallion with Quality Gates
Contract Reference — Full YAML field reference
Governance at Scale — Organizing 1,000s of contracts
CLI Reference — Command-line usage
Changelog — Release history

Technical Capabilities

Engine Agnostic — Auto-optimizes for Spark, Polars, DuckDB, or Pandas
Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
SQL-First Rules — Define business logic in the language your team already speaks
Automatic Lineage — Every row stamped with Run IDs and source paths
100% Reconciliation — Mathematically guaranteed: source = good + bad

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.18.0

Apr 16, 2026

1.17.1

Apr 15, 2026

1.17.0

Apr 15, 2026

1.16.0

Apr 15, 2026

1.15.0

Apr 15, 2026

1.13.0

Apr 14, 2026

1.12.0

Apr 7, 2026

1.10.0

Mar 30, 2026

1.9.0

Mar 27, 2026

This version

1.8.0

Mar 27, 2026

1.7.0

Mar 27, 2026

1.6.0

Mar 26, 2026

1.5.0

Mar 24, 2026

1.2.0

Mar 8, 2026

1.1.0

Mar 7, 2026

0.14.0

Mar 4, 2026

0.13.0

Mar 4, 2026

0.12.0

Mar 3, 2026

0.10.0

Mar 1, 2026

0.9.0

Mar 1, 2026

0.8.0

Feb 28, 2026

0.5.0

Feb 28, 2026

0.2.0

Feb 27, 2026

0.2.0b0 pre-release

Feb 26, 2026

0.1.0b2 pre-release

Feb 14, 2026

0.1.0b1 pre-release

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-1.8.0.tar.gz (1.3 MB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakelogic-1.8.0-py3-none-any.whl (432.6 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file lakelogic-1.8.0.tar.gz.

File metadata

Download URL: lakelogic-1.8.0.tar.gz
Upload date: Mar 27, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.8.0.tar.gz
Algorithm	Hash digest
SHA256	`cac8f458d68cd7389d8f7ab31d7827a5e2795266c35573548da7cca2f6f0e66a`
MD5	`2ae0760bb9bc07c380dc616daacd6c79`
BLAKE2b-256	`434e10272f9ab673f32ea22347dc4996d90106b3deb7477ce040bfcdddd22882`

See more details on using hashes here.

File details

Details for the file lakelogic-1.8.0-py3-none-any.whl.

File metadata

Download URL: lakelogic-1.8.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 432.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c07dbfc69addc51d3566572ebc3fd2176943617cce212cabe46178771dea31b1`
MD5	`3ab4d53178128a19a9c7a9b8a0a4d9ba`
BLAKE2b-256	`f7f91c778adf3339d96612e4a4211fbf6252e068ca51bc2b8acf2faff8e2ba5f`

See more details on using hashes here.

lakelogic 1.8.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LakeLogic

🌐 Data Mesh Alignment

Quick Start (60 Seconds)

1. Bootstrap a contract

2. Run the quality gate

3. Or use Python directly

Contract Example

Architecture

Business Impact

Data Mesh Alignment

Examples

Documentation

Technical Capabilities

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes