Skip to main content

Leakage & drift QA for ML datasets — catches target leakage, train/test contamination, distribution drift, temporal leakage, schema mismatches, and duplicate columns before they wreck your model.

Project description

LeakLens

Leakage & drift QA for ML datasets. Think Ruff or ESLint, but for the things that quietly invalidate an ML experiment instead of the things that break your code.

Most EDA tools answer "what does my data look like?" LeakLens answers a different question: "is my train/test split even valid?"

pip install leaklens

Why

The most common way an ML project fails isn't a bad model — it's a corrupted experiment. Target leakage, contaminated splits, drifted test sets, and preprocessing fit before the split all produce results that look great in your notebook and fall apart in production. LeakLens checks for exactly these failure modes, deterministically, with no invented "AI confidence scores."

Quick start

from leaklens import LeakLens

report = LeakLens(train_df, test_df, target="price").run()

report.summary()          # prints + returns a plain-text summary
report.to_html("report.html")   # full report with plotly visualizations
report.to_json()          # structured output for CI pipelines
report.issues             # raw list[Issue] if you want to handle them yourself

Works with a single dataframe too (drops the train/test-comparison checks automatically):

report = LeakLens(df, target="churned").run()

Polars is supported — pass a polars.DataFrame and it's converted internally.

What it checks

Check What it catches
Target leakage A feature that maps almost 1:1 to the target, or has suspiciously high correlation / Cramér's V with it
Train/test contamination Identical rows appearing in both train and test (the "sampled both sets from the same pool" bug)
Distribution drift KS test (numeric columns) and PSI (categorical columns) between train and test
Temporal leakage Test-set dates that fall before the latest date in train — a split that isn't actually chronological
Duplicate columns Near-perfectly correlated numeric columns, or categorical columns with identical values (price vs selling_price)
Schema mismatch Columns present in one set but not the other
Dtype mismatch The same column typed differently between train and test
Unseen categories Categories in test that never appeared in train (breaks most encoders)
Constant / near-constant features Columns with zero or near-zero variance
Preprocessing-before-split leakage Static analysis of a training script — flags scaler.fit_transform(X) called before train_test_split()

No fairness scores, no "deployment readiness %," no invented metrics — every issue maps to a specific, explainable calculation.

Preprocessing-leakage check (script analysis)

This one doesn't need a dataframe — it parses your training script's AST and checks call order:

report = LeakLens(train_df, target="price", script="train.py").run()
❌ Line 6: 'scaler.fit(...)' is called before train_test_split() on line 8 —
   this leaks test data into preprocessing statistics.

Configuration

Every threshold is tunable:

from leaklens import LeakLens, Config

config = Config(
    target_corr_threshold=0.90,
    ks_alpha=0.01,
    psi_critical=0.20,
)
report = LeakLens(train_df, test_df, target="price", config=config).run()

CLI (optional)

pip install leaklens[cli]
leaklens check train.csv test.csv --target price --report report.html

Output shape

Every finding is a plain dataclass — no magic strings:

from leaklens import Issue, Severity

Issue(
    title="Target Leakage",
    severity=Severity.CRITICAL,
    column="transaction_id",
    message="98.2% of values map to a single target value...",
    analyzer="target_leakage",
    details={"mapped_fraction": 0.982},
)

Project layout

leaklens/
├── core.py              # LeakLens orchestrator
├── config.py            # tunable thresholds
├── exceptions.py
├── report.py             # HTML rendering
├── models/               # Issue, Report, Severity dataclasses
├── analyzers/            # one module per check, all return list[Issue]
├── visualizations/       # plotly figure builders
└── templates/            # report.html (jinja2)

Analyzers never print, plot, or score — they only emit Issue objects. Rendering is handled separately, so adding a new output format (or a new check) doesn't touch existing code.

Roadmap

  • v1.5: CLI polish, Jupyter widget output, more visualizations
  • v2.0: GitHub Actions integration (fail a PR if leakage is detected), MLflow logging

Development

pip install -e ".[dev]"
pytest --cov=leaklens

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaklens-1.0.0.tar.gz (33.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leaklens-1.0.0-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file leaklens-1.0.0.tar.gz.

File metadata

  • Download URL: leaklens-1.0.0.tar.gz
  • Upload date:
  • Size: 33.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c1c1023990794011cd4c6b90a1c0267c6ac20a810ab39f840a91b48fa58a7460
MD5 799d63442580950beb5cf5a6c38a37c8
BLAKE2b-256 0d7a84f47a55e99cca98808b33d34bbada722348104e3b6388f94fca7b20ddc7

See more details on using hashes here.

File details

Details for the file leaklens-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: leaklens-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 36.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e907de62c34e25a37b65a7f2cf26ba91dab6acb192364a07bae98e86ec0775fd
MD5 c76e83957e59f7d5a49db0b0ed4a7b8a
BLAKE2b-256 53d6cb501140cc998f640fa4979af4f961f7586da43179ad8e03fff98816dde0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page