Intelligent ML data observability for the lakehouse — sketch-based profiling with LLM interpretation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ramannanda9

These details have not been verified by PyPI

Project description

lakesense

Intelligent ML data observability for the lakehouse.

lakesense profiles your datasets using mergeable probabilistic sketches (MinHash, HyperLogLog, KLL) and deterministic column profiles, builds dynamic baselines per job, and uses an LLM agent to investigate and explain drift signals — with pluggable alerting and storage.

Why lakesense?

Existing tools stop at drift detection — they tell you a number changed. lakesense adds an interpretation layer: a two-tier pipeline that runs a fast LLM assessment on every job, and fires an investigative agent only when something is actually wrong.

Key properties:

Probabilistic sketches — MinHash, HLL, KLL for O(1) memory profiling with mergeable baselines
Full column profiling — null rates, int ranges, categorical distributions, boolean ratios, string lengths, schema drift
Distributed compute — Spark provider for distributed sketch computation via mapInPandas
Zero-infra quickstart — Parquet backend, no catalog or cluster required
Plugin architecture — bring your own storage, alerting, and agent tools
Two-tier cost control — heuristics always run free; LLM only invoked on warn/alert; expensive agent only on alert
No-network mode — works 100% locally using heuristic rules when no API key is set

Quickstart

pip install lakesense

import asyncio
import pandas as pd
from lakesense.core import SketchFramework
from lakesense.storage.parquet import ParquetBackend
from lakesense.sketches.providers.pandas import PandasProvider
from lakesense.sketches.merge import BaselineConfig

# 1. Compute sketches from your data
df = pd.read_parquet("features/latest.parquet")
provider = PandasProvider()
records = provider.sketch(
    data=df,
    dataset_id="user_features",
    job_id="train_job_42",
    text_columns=["description"],
    id_columns=["user_id"],
    numeric_columns=["session_count", "revenue"],
)

# 2. Run the interpretation pipeline
framework = SketchFramework(storage=ParquetBackend("./sketches"))

result = asyncio.run(framework.run({
    "dataset_id": "user_features",
    "job_id":     "train_job_42",
    "sketch_records": records,
    "baseline_config": BaselineConfig(dataset_id="user_features", window_days=7),
}))

print(result.severity)   # ok | warn | alert
print(result.summary)    # "Jaccard similarity dropped 34% vs 7-day baseline..."

Heuristic rules run on every job (free, instant). Set OPENAI_API_KEY or ANTHROPIC_API_KEY to add LLM-powered interpretation — the LLM is only invoked when heuristics flag warn/alert, so healthy runs never incur an API call.

Run the full quickstart example (no API key needed):

pip install lakesense[duckdb]
python examples/quickstart.py

Architecture

Every run   →  Tier 1: sketch compute + baseline merge + LLM interpret  →  severity + summary
warn/alert  →  Tier 2: plugins (investigative agent, Slack, PagerDuty)  →  root cause + action

Tier 1 — base interpretation (always runs)

Compute sketches (MinHash, HLL, KLL) and column profiles from the dataset
Merge historical sketches into a baseline (rolling window, snapshot, or EWMA)
Compute drift signals (Jaccard delta, cardinality ratio, quantile shifts, null rate, schema drift)
Run heuristic rules — if severity is ok, return immediately (no LLM cost)
On warn/alert — call the LLM for nuanced interpretation + summary (LLM can upgrade severity but not downgrade below the heuristic floor)

Tier 2 — plugin chain (on warn/alert only)

Plugins run in registration order, each receiving the result enriched by prior plugins:

framework = (
    SketchFramework(storage=ParquetBackend("./sketches"))
    .register(InvestigativeAgentPlugin())   # root cause analysis
    .register(SlackAlertPlugin(webhook=WEBHOOK))  # needs owners from agent
)

Sketch providers

Provider	Use case	Install
`PandasProvider`	Single-machine, local dev	`pip install lakesense`
`SparkProvider`	Distributed compute via `mapInPandas`	`pip install lakesense[spark]`
`StreamingProvider`	Incremental / micro-batch	`pip install lakesense`

Sketch types

Sketch	Use case	Merge cost
MinHash (Theta)	Text/set similarity, near-duplicate detection	O(num_perm)
HyperLogLog	Cardinality estimation (unique users, items)	O(registers)
KLL	Quantile estimation, distribution shape shifts	approx via sorted sample
Profile	Deterministic column metrics (nulls, ranges, categoricals)	scalar comparison

Storage backends

Backend	Use case	Install
`ParquetBackend`	Zero-infra, local dev	`pip install lakesense`
`DuckDBBackend`	Local + SQL queries	`pip install lakesense[duckdb]`
`IcebergBackend`	Production lakehouse (v0.2)	`pip install lakesense[iceberg]`

Baseline strategies

from lakesense.sketches.merge import BaselineConfig, BaselineStrategy

# Rolling window — merge all runs in the last N days
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.ROLLING_WINDOW, window_days=7)

# Snapshot — pin a known-good run as reference
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.SNAPSHOT,
               snapshot_id="2024-01-15T00:00:00+00:00")

# EWMA — exponentially weight recent runs more
BaselineConfig(dataset_id="ds", strategy=BaselineStrategy.EWMA, decay_factor=0.85)

Writing a custom plugin

from lakesense.core import SketchPlugin, InterpretationResult, Severity

class PagerDutyPlugin(SketchPlugin):
    def __init__(self, routing_key: str):
        self._key = routing_key

    def should_run(self, result: InterpretationResult) -> bool:
        return result.severity == Severity.ALERT and result.is_agent_enriched()

    async def run(self, result: InterpretationResult) -> InterpretationResult:
        await self._page(result)
        result.metadata["pagerduty"] = "paged"
        return result

Roadmap

v0.1 — core sketches, column profiles, Parquet + DuckDB storage, Tier 1 LLM interpret, Spark provider
v0.2 — agent plugin, DataHub lineage, Slack plugin, IcebergBackend
v0.3 — DeltaBackend, Airflow operator, OpenLineage support
v0.4 — JIRA plugin, column-level lineage

Contributing

See CONTRIBUTING.md. PRs welcome — especially new storage backends and plugins.

pip install -e ".[dev]"
pytest tests/unit/
ruff check .
mypy lakesense/

License

Apache 2.0 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ramannanda9

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

May 6, 2026

0.2.3 yanked

May 5, 2026

Reason this release was yanked:

incorrectly tagged

0.2.2

Apr 15, 2026

0.2.1

Mar 31, 2026

0.2.0

Mar 30, 2026

This version

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakesense-0.1.0.tar.gz (43.8 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakesense-0.1.0-py3-none-any.whl (45.7 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file lakesense-0.1.0.tar.gz.

File metadata

Download URL: lakesense-0.1.0.tar.gz
Upload date: Mar 30, 2026
Size: 43.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakesense-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`95888b7bb093c1f5490252018265fbb5f55c333223dfb87ffa597290f24bb1fb`
MD5	`ffc78d310912874e9c7571ec82650506`
BLAKE2b-256	`cead1077f6d8ceb6db05dd05c6618a2601e3087e12112f54fc4249e22624c4b4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakesense-0.1.0.tar.gz:

Publisher: publish.yml on ramannanda9/lakesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lakesense-0.1.0.tar.gz
- Subject digest: 95888b7bb093c1f5490252018265fbb5f55c333223dfb87ffa597290f24bb1fb
- Sigstore transparency entry: 1201078410
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: ramannanda9/lakesense@5a87813ffb085bb37786a2b8de8fcf38c42c6883
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ramannanda9
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a87813ffb085bb37786a2b8de8fcf38c42c6883
- Trigger Event: push

File details

Details for the file lakesense-0.1.0-py3-none-any.whl.

File metadata

Download URL: lakesense-0.1.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 45.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakesense-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6dfe225cebd3c9c35a6c71d6faa57a720bec4de96202c243c6834901eb4d0bba`
MD5	`c4ddaaf1404347e456b59902f119caa7`
BLAKE2b-256	`aded971459a50094569a382de4c692d1dcbaea97fa060be223b3b65442833037`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakesense-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ramannanda9/lakesense

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lakesense-0.1.0-py3-none-any.whl
- Subject digest: 6dfe225cebd3c9c35a6c71d6faa57a720bec4de96202c243c6834901eb4d0bba
- Sigstore transparency entry: 1201078415
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: ramannanda9/lakesense@5a87813ffb085bb37786a2b8de8fcf38c42c6883
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ramannanda9
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a87813ffb085bb37786a2b8de8fcf38c42c6883
- Trigger Event: push

lakesense 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

lakesense

Why lakesense?

Quickstart

Architecture

Tier 1 — base interpretation (always runs)

Tier 2 — plugin chain (on warn/alert only)

Sketch providers

Sketch types

Storage backends

Baseline strategies

Writing a custom plugin

Roadmap

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance