Leakage & drift QA for ML datasets — catches target leakage, train/test contamination, distribution drift, temporal leakage, schema mismatches, and duplicate columns before they wreck your model.

These details have not been verified by PyPI

Project links

Project description

LeakLens

Leakage & drift QA for ML datasets. Think Ruff or ESLint, but for the things that quietly invalidate an ML experiment instead of the things that break your code.

Most EDA tools answer "what does my data look like?" LeakLens answers a different question: "is my train/test split even valid?"

pip install leaklens

Why

The most common way an ML project fails isn't a bad model — it's a corrupted experiment. Target leakage, contaminated splits, drifted test sets, and preprocessing fit before the split all produce results that look great in your notebook and fall apart in production. LeakLens checks for exactly these failure modes, deterministically, with no invented "AI confidence scores."

Quick start

from leaklens import LeakLens

report = LeakLens(train_df, test_df, target="price").run()

report.summary()          # prints + returns a plain-text summary
report.to_html("report.html")   # full report with plotly visualizations
report.to_json()          # structured output for CI pipelines
report.issues             # raw list[Issue] if you want to handle them yourself

Works with a single dataframe too (drops the train/test-comparison checks automatically):

report = LeakLens(df, target="churned").run()

Polars is supported — pass a polars.DataFrame and it's converted internally.

What it checks

Check	What it catches
Target leakage	A feature that maps almost 1:1 to the target, or has suspiciously high correlation / Cramér's V with it
Train/test contamination	Identical rows appearing in both train and test (the "sampled both sets from the same pool" bug)
Distribution drift	KS test (numeric columns) and PSI (categorical columns) between train and test
Temporal leakage	Test-set dates that fall before the latest date in train — a split that isn't actually chronological
Duplicate columns	Near-perfectly correlated numeric columns, or categorical columns with identical values (`price` vs `selling_price`)
Schema mismatch	Columns present in one set but not the other
Dtype mismatch	The same column typed differently between train and test
Unseen categories	Categories in test that never appeared in train (breaks most encoders)
Constant / near-constant features	Columns with zero or near-zero variance
Preprocessing-before-split leakage	Static analysis of a training script — flags `scaler.fit_transform(X)` called before `train_test_split()`

No fairness scores, no "deployment readiness %," no invented metrics — every issue maps to a specific, explainable calculation.

Preprocessing-leakage check (script analysis)

This one doesn't need a dataframe — it parses your training script's AST and checks call order:

report = LeakLens(train_df, target="price", script="train.py").run()

❌ Line 6: 'scaler.fit(...)' is called before train_test_split() on line 8 —
   this leaks test data into preprocessing statistics.

Configuration

Every threshold is tunable:

from leaklens import LeakLens, Config

config = Config(
    target_corr_threshold=0.90,
    ks_alpha=0.01,
    psi_critical=0.20,
)
report = LeakLens(train_df, test_df, target="price", config=config).run()

CLI (optional)

pip install leaklens[cli]
leaklens check train.csv test.csv --target price --report report.html

Output shape

Every finding is a plain dataclass — no magic strings:

from leaklens import Issue, Severity

Issue(
    title="Target Leakage",
    severity=Severity.CRITICAL,
    column="transaction_id",
    message="98.2% of values map to a single target value...",
    analyzer="target_leakage",
    details={"mapped_fraction": 0.982},
)

Project layout

leaklens/
├── core.py              # LeakLens orchestrator
├── config.py            # tunable thresholds
├── exceptions.py
├── report.py             # HTML rendering
├── models/               # Issue, Report, Severity dataclasses
├── analyzers/            # one module per check, all return list[Issue]
├── visualizations/       # plotly figure builders
└── templates/            # report.html (jinja2)

Analyzers never print, plot, or score — they only emit Issue objects. Rendering is handled separately, so adding a new output format (or a new check) doesn't touch existing code.

Roadmap

v1.5: CLI polish, Jupyter widget output, more visualizations
v2.0: GitHub Actions integration (fail a PR if leakage is detected), MLflow logging

Development

pip install -e ".[dev]"
pytest --cov=leaklens

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.0

Jul 4, 2026

1.1.2

Jul 3, 2026

1.1.1

Jul 1, 2026

1.1.0

Jul 1, 2026

This version

1.0.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaklens-1.0.0.tar.gz (33.2 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

leaklens-1.0.0-py3-none-any.whl (36.1 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file leaklens-1.0.0.tar.gz.

File metadata

Download URL: leaklens-1.0.0.tar.gz
Upload date: Jun 30, 2026
Size: 33.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c1c1023990794011cd4c6b90a1c0267c6ac20a810ab39f840a91b48fa58a7460`
MD5	`799d63442580950beb5cf5a6c38a37c8`
BLAKE2b-256	`0d7a84f47a55e99cca98808b33d34bbada722348104e3b6388f94fca7b20ddc7`

See more details on using hashes here.

File details

Details for the file leaklens-1.0.0-py3-none-any.whl.

File metadata

Download URL: leaklens-1.0.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 36.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e907de62c34e25a37b65a7f2cf26ba91dab6acb192364a07bae98e86ec0775fd`
MD5	`c76e83957e59f7d5a49db0b0ed4a7b8a`
BLAKE2b-256	`53d6cb501140cc998f640fa4979af4f961f7586da43179ad8e03fff98816dde0`

See more details on using hashes here.

leaklens 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LeakLens

Why

Quick start

What it checks

Preprocessing-leakage check (script analysis)

Configuration

CLI (optional)

Output shape

Project layout

Roadmap

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes