Leakage & drift QA for ML datasets — catches target leakage, train/test contamination, distribution drift, temporal leakage, schema mismatches, and duplicate columns before they wreck your model.
Project description
LeakLens
Leakage & drift QA for ML datasets. Think Ruff or ESLint, but for the things that quietly invalidate an ML experiment instead of the things that break your code.
Most EDA tools answer "what does my data look like?" LeakLens answers a different question: "is my train/test split even valid?"
pip install leaklens
Why
The most common way an ML project fails isn't a bad model — it's a corrupted experiment. Target leakage, contaminated splits, drifted test sets, and preprocessing fit before the split all produce results that look great in your notebook and fall apart in production. LeakLens checks for exactly these failure modes, deterministically, with no invented "AI confidence scores."
Quick start
from leaklens import LeakLens
report = LeakLens(train_df, test_df, target="price").run()
report.summary() # prints + returns a plain-text summary
report.to_html("report.html") # full report with plotly visualizations
report.to_json() # structured output for CI pipelines
report.issues # raw list[Issue] if you want to handle them yourself
Works with a single dataframe too (drops the train/test-comparison checks automatically):
report = LeakLens(df, target="churned").run()
Polars is supported — pass a polars.DataFrame and it's converted internally.
What it checks
| Check | What it catches |
|---|---|
| Target leakage | A feature that maps almost 1:1 to the target, or has suspiciously high correlation / Cramér's V with it |
| Train/test contamination | Identical rows appearing in both train and test (the "sampled both sets from the same pool" bug) |
| Distribution drift | KS test (numeric columns) and PSI (categorical columns) between train and test |
| Temporal leakage | Test-set dates that fall before the latest date in train — a split that isn't actually chronological |
| Duplicate columns | Near-perfectly correlated numeric columns, or categorical columns with identical values (price vs selling_price) |
| Schema mismatch | Columns present in one set but not the other |
| Dtype mismatch | The same column typed differently between train and test |
| Unseen categories | Categories in test that never appeared in train (breaks most encoders) |
| Constant / near-constant features | Columns with zero or near-zero variance |
| Preprocessing-before-split leakage | Static analysis of a training script — flags scaler.fit_transform(X) called before train_test_split() |
No fairness scores, no "deployment readiness %," no invented metrics — every issue maps to a specific, explainable calculation.
Preprocessing-leakage check (script analysis)
This one doesn't need a dataframe — it parses your training script's AST and checks call order:
report = LeakLens(train_df, target="price", script="train.py").run()
❌ Line 6: 'scaler.fit(...)' is called before train_test_split() on line 8 —
this leaks test data into preprocessing statistics.
Configuration
Every threshold is tunable:
from leaklens import LeakLens, Config
config = Config(
target_corr_threshold=0.90,
ks_alpha=0.01,
psi_critical=0.20,
)
report = LeakLens(train_df, test_df, target="price", config=config).run()
CLI (optional)
pip install leaklens[cli]
leaklens check train.csv test.csv --target price --report report.html
Output shape
Every finding is a plain dataclass — no magic strings:
from leaklens import Issue, Severity
Issue(
title="Target Leakage",
severity=Severity.CRITICAL,
column="transaction_id",
message="98.2% of values map to a single target value...",
analyzer="target_leakage",
details={"mapped_fraction": 0.982},
)
Project layout
leaklens/
├── core.py # LeakLens orchestrator
├── config.py # tunable thresholds
├── exceptions.py
├── report.py # HTML rendering
├── models/ # Issue, Report, Severity dataclasses
├── analyzers/ # one module per check, all return list[Issue]
├── visualizations/ # plotly figure builders
└── templates/ # report.html (jinja2)
Analyzers never print, plot, or score — they only emit Issue objects. Rendering is handled separately, so adding a new output format (or a new check) doesn't touch existing code.
Roadmap
- v1.5: CLI polish, Jupyter widget output, more visualizations
- v2.0: GitHub Actions integration (fail a PR if leakage is detected), MLflow logging
Development
pip install -e ".[dev]"
pytest --cov=leaklens
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leaklens-1.1.0.tar.gz.
File metadata
- Download URL: leaklens-1.1.0.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
adb7a8bc50c5beeb3d6e80cdf16b0cc8b9573df6030403af13f87d9cab705e06
|
|
| MD5 |
ced8865b7f421459fc37774c7c0d4005
|
|
| BLAKE2b-256 |
2621751b2e070e14ca1c11d56b63e07eada44d7ba5a5fc6b1aef24a5d1d9540b
|
File details
Details for the file leaklens-1.1.0-py3-none-any.whl.
File metadata
- Download URL: leaklens-1.1.0-py3-none-any.whl
- Upload date:
- Size: 40.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e1f2d8baa259b0f39672a6fd6c070c7155530baac4117d98aec86a7ff7a543c
|
|
| MD5 |
bfc4a9efd268650f56ad6a6f4c4e2360
|
|
| BLAKE2b-256 |
cca704e88efdc452b2a989db0dc5163c2b1fd06e8c1dbaa6b82effa68eeeadbb
|