Skip to main content

Intelligent ML data observability for the lakehouse — sketch-based profiling with LLM interpretation

Project description

lakesense

Intelligent ML data observability for the lakehouse.

lakesense profiles your datasets using mergeable probabilistic sketches (MinHash, HyperLogLog, KLL) and deterministic column profiles, builds dynamic baselines per job, and uses an LLM agent to investigate and explain drift signals — with pluggable alerting and storage.

CI License Python 3.10+


Why lakesense?

Existing tools stop at drift detection — they tell you a number changed. lakesense adds an interpretation layer: a two-tier pipeline that runs a fast LLM assessment on every job, and fires an investigative agent only when something is actually wrong.

Key properties:

  • Probabilistic sketches — MinHash, HLL, KLL for O(1) memory profiling with mergeable baselines
  • Full column profiling — null rates, int ranges, categorical distributions, boolean ratios, string lengths, schema drift
  • Distributed compute — Spark provider for distributed sketch computation via mapInPandas
  • Zero-infra quickstart — Parquet backend, no catalog or cluster required
  • Plugin architecture — bring your own storage, alerting, and agent tools
  • Two-tier cost control — heuristics always run free; LLM only invoked on warn/alert; expensive agent only on alert
  • No-network mode — works 100% locally using heuristic rules when no API key is set

Quickstart

pip install lakesense
import asyncio
import pandas as pd
from lakesense.core import SketchFramework
from lakesense.storage.parquet import ParquetBackend
from lakesense.sketches.providers.pandas import PandasProvider
from lakesense.sketches.merge import BaselineConfig

# 1. Compute sketches from your data
df = pd.read_parquet("features/latest.parquet")
provider = PandasProvider()
records = provider.sketch(
    data=df,
    dataset_id="user_features",
    job_id="train_job_42",
    text_columns=["description"],
    id_columns=["user_id"],
    numeric_columns=["session_count", "revenue"],
)

# 2. Run the interpretation pipeline
framework = SketchFramework(storage=ParquetBackend("./sketches"))

result = asyncio.run(framework.run({
    "dataset_id": "user_features",
    "job_id":     "train_job_42",
    "sketch_records": records,
    "baseline_config": BaselineConfig(dataset_id="user_features", window_days=7),
}))

print(result.severity)   # ok | warn | alert
print(result.summary)    # "Jaccard similarity dropped 34% vs 7-day baseline..."

Heuristic rules run on every job (free, instant). Set OPENAI_API_KEY or ANTHROPIC_API_KEY to add LLM-powered interpretation — the LLM is only invoked when heuristics flag warn/alert, so healthy runs never incur an API call.

Run the full quickstart example (no API key needed):

pip install lakesense[duckdb]
python examples/quickstart.py

Architecture

Every run   →  Tier 1: sketch compute + baseline merge + LLM interpret  →  severity + summary
warn/alert  →  Tier 2: plugins (investigative agent, Slack, PagerDuty)  →  root cause + action

Tier 1 — base interpretation (always runs)

  1. Compute sketches (MinHash, HLL, KLL) and column profiles from the dataset
  2. Merge historical sketches into a baseline (rolling window, snapshot, or EWMA)
  3. Compute drift signals (Jaccard delta, cardinality ratio, quantile shifts, null rate, schema drift)
  4. Run heuristic rules — if severity is ok, return immediately (no LLM cost)
  5. On warn/alert — call the LLM for nuanced interpretation + summary (LLM can upgrade severity but not downgrade below the heuristic floor)

Tier 2 — plugin chain (on warn/alert only)

Plugins run in registration order, each receiving the result enriched by prior plugins:

framework = (
    SketchFramework(storage=ParquetBackend("./sketches"))
    .register(InvestigativeAgentPlugin())   # root cause analysis
    .register(SlackAlertPlugin(webhook=WEBHOOK))  # needs owners from agent
)

Sketch providers

Provider Use case Install
PandasProvider Single-machine, local dev pip install lakesense
SparkProvider Distributed compute via mapInPandas pip install lakesense[spark]
StreamingProvider Incremental / micro-batch pip install lakesense

Sketch types

Sketch Use case Merge cost
MinHash (Theta) Text/set similarity, near-duplicate detection O(num_perm)
HyperLogLog Cardinality estimation (unique users, items) O(registers)
KLL Quantile estimation, distribution shape shifts approx via sorted sample
Profile Deterministic column metrics (nulls, ranges, categoricals) scalar comparison

Storage backends

Backend Use case Install
ParquetBackend Zero-infra, local dev pip install lakesense
DuckDBBackend Local + SQL queries pip install lakesense[duckdb]
IcebergBackend Production lakehouse (v0.2) pip install lakesense[iceberg]

Baseline strategies

from lakesense.sketches.merge import BaselineConfig, BaselineStrategy

# Rolling window — merge all runs in the last N days
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.ROLLING_WINDOW, window_days=7)

# Snapshot — pin a known-good run as reference
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.SNAPSHOT,
               snapshot_id="2024-01-15T00:00:00+00:00")

# EWMA — exponentially weight recent runs more
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.EWMA, decay_factor=0.85)

Writing a custom plugin

from lakesense.core import SketchPlugin, InterpretationResult, Severity

class PagerDutyPlugin(SketchPlugin):
    def __init__(self, routing_key: str):
        self._key = routing_key

    def should_run(self, result: InterpretationResult) -> bool:
        return result.severity == Severity.ALERT and result.is_agent_enriched()

    async def run(self, result: InterpretationResult) -> InterpretationResult:
        await self._page(result)
        result.metadata["pagerduty"] = "paged"
        return result

Roadmap

  • v0.1 — core sketches, column profiles, Parquet + DuckDB storage, Tier 1 LLM interpret, Spark provider
  • v0.2 — agent plugin, DataHub lineage, Slack plugin, IcebergBackend
  • v0.3 — DeltaBackend, Airflow operator, OpenLineage support
  • v0.4 — JIRA plugin, column-level lineage

Contributing

See CONTRIBUTING.md. PRs welcome — especially new storage backends and plugins.

pip install -e ".[dev]"
pytest tests/unit/
ruff check .
mypy lakesense/

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakesense-0.1.0.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakesense-0.1.0-py3-none-any.whl (45.7 kB view details)

Uploaded Python 3

File details

Details for the file lakesense-0.1.0.tar.gz.

File metadata

  • Download URL: lakesense-0.1.0.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakesense-0.1.0.tar.gz
Algorithm Hash digest
SHA256 95888b7bb093c1f5490252018265fbb5f55c333223dfb87ffa597290f24bb1fb
MD5 ffc78d310912874e9c7571ec82650506
BLAKE2b-256 cead1077f6d8ceb6db05dd05c6618a2601e3087e12112f54fc4249e22624c4b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakesense-0.1.0.tar.gz:

Publisher: publish.yml on ramannanda9/lakesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lakesense-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lakesense-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakesense-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6dfe225cebd3c9c35a6c71d6faa57a720bec4de96202c243c6834901eb4d0bba
MD5 c4ddaaf1404347e456b59902f119caa7
BLAKE2b-256 aded971459a50094569a382de4c692d1dcbaea97fa060be223b3b65442833037

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakesense-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ramannanda9/lakesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page