Intelligent ML data observability for the lakehouse — sketch-based profiling with LLM interpretation
Project description
lakesense
Intelligent ML data observability for the lakehouse.
lakesense profiles your datasets using mergeable probabilistic sketches (MinHash, HyperLogLog, KLL) and deterministic column profiles, builds dynamic baselines per job, and uses an LLM agent to investigate and explain drift signals — with pluggable alerting and storage.
Why lakesense?
Existing tools stop at drift detection — they tell you a number changed. lakesense adds an interpretation layer: a two-tier pipeline that runs a fast LLM assessment on every job, and fires an investigative agent only when something is actually wrong.
Key properties:
- Probabilistic sketches — MinHash, HLL, KLL for O(1) memory profiling with mergeable baselines
- Full column profiling — null rates, int ranges, categorical distributions, boolean ratios, string lengths, schema drift
- Distributed compute — Spark provider for distributed sketch computation via
mapInPandas - Zero-infra quickstart — Parquet backend, no catalog or cluster required
- Plugin architecture — bring your own storage, alerting, and agent tools
- Two-tier cost control — heuristics always run free; LLM only invoked on warn/alert; expensive agent only on alert
- No-network mode — works 100% locally using heuristic rules when no API key is set
Quickstart
pip install lakesense
import asyncio
import pandas as pd
from lakesense.core import SketchFramework
from lakesense.storage.parquet import ParquetBackend
from lakesense.sketches.providers.pandas import PandasProvider
from lakesense.sketches.merge import BaselineConfig
# 1. Compute sketches from your data
df = pd.read_parquet("features/latest.parquet")
provider = PandasProvider()
records = provider.sketch(
data=df,
dataset_id="user_features",
job_id="train_job_42",
text_columns=["description"],
id_columns=["user_id"],
numeric_columns=["session_count", "revenue"],
)
# 2. Run the interpretation pipeline
framework = SketchFramework(storage=ParquetBackend("./sketches"))
result = asyncio.run(framework.run({
"dataset_id": "user_features",
"job_id": "train_job_42",
"sketch_records": records,
"baseline_config": BaselineConfig(dataset_id="user_features", window_days=7),
}))
print(result.severity) # ok | warn | alert
print(result.summary) # "Jaccard similarity dropped 34% vs 7-day baseline..."
Heuristic rules run on every job (free, instant). Set OPENAI_API_KEY or ANTHROPIC_API_KEY
to add LLM-powered interpretation — the LLM is only invoked when heuristics flag warn/alert,
so healthy runs never incur an API call.
Run the full quickstart example (no API key needed):
pip install lakesense[duckdb]
python examples/quickstart.py
Architecture
Every run → Tier 1: sketch compute + baseline merge + LLM interpret → severity + summary
warn/alert → Tier 2: plugins (investigative agent, Slack, PagerDuty) → root cause + action
Tier 1 — base interpretation (always runs)
- Compute sketches (MinHash, HLL, KLL) and column profiles from the dataset
- Merge historical sketches into a baseline (rolling window, snapshot, or EWMA)
- Compute drift signals (Jaccard delta, cardinality ratio, quantile shifts, null rate, schema drift)
- Run heuristic rules — if severity is
ok, return immediately (no LLM cost) - On
warn/alert— call the LLM for nuanced interpretation + summary (LLM can upgrade severity but not downgrade below the heuristic floor)
Tier 2 — plugin chain (on warn/alert only)
Plugins run in registration order, each receiving the result enriched by prior plugins:
framework = (
SketchFramework(storage=ParquetBackend("./sketches"))
.register(InvestigativeAgentPlugin()) # root cause analysis
.register(SlackAlertPlugin(webhook=WEBHOOK)) # needs owners from agent
)
Sketch providers
| Provider | Use case | Install |
|---|---|---|
PandasProvider |
Single-machine, local dev | pip install lakesense |
SparkProvider |
Distributed compute via mapInPandas |
pip install lakesense[spark] |
StreamingProvider |
Incremental / micro-batch | pip install lakesense |
Sketch types
| Sketch | Use case | Merge cost |
|---|---|---|
| MinHash (Theta) | Text/set similarity, near-duplicate detection | O(num_perm) |
| HyperLogLog | Cardinality estimation (unique users, items) | O(registers) |
| KLL | Quantile estimation, distribution shape shifts | approx via sorted sample |
| Profile | Deterministic column metrics (nulls, ranges, categoricals) | scalar comparison |
Storage backends
| Backend | Use case | Install |
|---|---|---|
ParquetBackend |
Zero-infra, local dev | pip install lakesense |
DuckDBBackend |
Local + SQL queries | pip install lakesense[duckdb] |
IcebergBackend |
Production lakehouse (v0.2) | pip install lakesense[iceberg] |
Baseline strategies
from lakesense.sketches.merge import BaselineConfig, BaselineStrategy
# Rolling window — merge all runs in the last N days
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.ROLLING_WINDOW, window_days=7)
# Snapshot — pin a known-good run as reference
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.SNAPSHOT,
snapshot_id="2024-01-15T00:00:00+00:00")
# EWMA — exponentially weight recent runs more
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.EWMA, decay_factor=0.85)
Writing a custom plugin
from lakesense.core import SketchPlugin, InterpretationResult, Severity
class PagerDutyPlugin(SketchPlugin):
def __init__(self, routing_key: str):
self._key = routing_key
def should_run(self, result: InterpretationResult) -> bool:
return result.severity == Severity.ALERT and result.is_agent_enriched()
async def run(self, result: InterpretationResult) -> InterpretationResult:
await self._page(result)
result.metadata["pagerduty"] = "paged"
return result
Roadmap
- v0.1 — core sketches, column profiles, Parquet + DuckDB storage, Tier 1 LLM interpret, Spark provider
- v0.2 — agent plugin, DataHub lineage, Slack plugin, IcebergBackend
- v0.3 — DeltaBackend, Airflow operator, OpenLineage support
- v0.4 — JIRA plugin, column-level lineage
Contributing
See CONTRIBUTING.md. PRs welcome — especially new storage backends and plugins.
pip install -e ".[dev]"
pytest tests/unit/
ruff check .
mypy lakesense/
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lakesense-0.1.0.tar.gz.
File metadata
- Download URL: lakesense-0.1.0.tar.gz
- Upload date:
- Size: 43.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95888b7bb093c1f5490252018265fbb5f55c333223dfb87ffa597290f24bb1fb
|
|
| MD5 |
ffc78d310912874e9c7571ec82650506
|
|
| BLAKE2b-256 |
cead1077f6d8ceb6db05dd05c6618a2601e3087e12112f54fc4249e22624c4b4
|
Provenance
The following attestation bundles were made for lakesense-0.1.0.tar.gz:
Publisher:
publish.yml on ramannanda9/lakesense
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lakesense-0.1.0.tar.gz -
Subject digest:
95888b7bb093c1f5490252018265fbb5f55c333223dfb87ffa597290f24bb1fb - Sigstore transparency entry: 1201078410
- Sigstore integration time:
-
Permalink:
ramannanda9/lakesense@5a87813ffb085bb37786a2b8de8fcf38c42c6883 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ramannanda9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5a87813ffb085bb37786a2b8de8fcf38c42c6883 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lakesense-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lakesense-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dfe225cebd3c9c35a6c71d6faa57a720bec4de96202c243c6834901eb4d0bba
|
|
| MD5 |
c4ddaaf1404347e456b59902f119caa7
|
|
| BLAKE2b-256 |
aded971459a50094569a382de4c692d1dcbaea97fa060be223b3b65442833037
|
Provenance
The following attestation bundles were made for lakesense-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on ramannanda9/lakesense
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lakesense-0.1.0-py3-none-any.whl -
Subject digest:
6dfe225cebd3c9c35a6c71d6faa57a720bec4de96202c243c6834901eb4d0bba - Sigstore transparency entry: 1201078415
- Sigstore integration time:
-
Permalink:
ramannanda9/lakesense@5a87813ffb085bb37786a2b8de8fcf38c42c6883 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ramannanda9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5a87813ffb085bb37786a2b8de8fcf38c42c6883 -
Trigger Event:
push
-
Statement type: