Leakage & drift QA for ML datasets — catches target leakage, train/test contamination, distribution drift, temporal leakage, schema mismatches, and duplicate columns before they wreck your model.

These details have not been verified by PyPI

Project links

Project description

LeakLens

The pre-flight checklist for machine learning datasets.

Most EDA tools answer "what does my data look like?" LeakLens answers a different question: "is my train/test split even valid?"

Think Ruff or ESLint — but for the structural mistakes that quietly invalidate an ML experiment instead of the ones that break your code.

pip install leaklens

LeakLens Dashboard

LeakLens Banner

LeakLens Clean

Why

The most common way an ML project fails isn't a bad model — it's a corrupted experiment. Target leakage, contaminated splits, drifted test sets, and preprocessing fit before the split all produce results that look great in your notebook and fall apart in production.

LeakLens catches exactly these failure modes — deterministically, with no invented scores or black-box metrics. Every finding maps to a specific, reproducible calculation you can explain in an interview or a code review.

Quick start

from leaklens import LeakLens

report = LeakLens(train_df, test_df, target="price").run()

report.summary()             # plain-text console output
report.to_html("report.html")  # full interactive dashboard
report.to_json()             # structured output for CI pipelines
report.issues                # raw List[Issue] for custom handling

Works with a single dataframe too — train/test comparison checks are skipped automatically:

report = LeakLens(df, target="churned").run()

Polars DataFrames are supported — passed in as-is and converted internally.

What it checks

Check	Method	What it catches
Target Leakage	Correlation / Cramer's V / group mapping	Features that directly or indirectly encode the target
Missing Value Leakage	Point-biserial r / Cramer's V on missingness indicator	Columns where whether a value is missing correlates with the target
Train/Test Contamination	Full-row hash matching	Identical rows appearing in both splits
Distribution Drift	KS test (numeric) + PSI (categorical)	Train and test drawn from different distributions
Temporal Leakage	Date-column auto-detection + chronology check	Test-set dates that fall before the latest train date
Duplicate Columns	Correlation threshold + Series.equals()	Near-identical columns adding redundant collinearity
Schema Validation	Column diff + dtype comparison + category sets	Missing columns, dtype mismatches, unseen categories, constant features
Preprocessing Leakage	AST static analysis	Scalers, encoders, PCA, SMOTE etc. fit before `train_test_split()`
Class Imbalance	Majority/minority ratio	Skewed target distributions where accuracy alone is misleading
Target Drift	KS test (numeric) + PSI (categorical)	The target's own distribution shifting between train and test, distinct from feature drift
Multicollinearity	Variance Inflation Factor	Features predicted by a combination of others, not just pairwise duplicates
Invalid Values	Domain-rule checks + IQR	Negative values in non-negative columns, extreme outliers

Every finding includes:

The raw metric value that triggered it (KS statistic, PSI, correlation, etc.)
The exact threshold it crossed (from your Config, not hidden)
Root cause analysis for drift findings (mean shift, variance shift, 95th-percentile shift)
A plain-English "why it matters" explanation
Concrete fix steps

No fairness scores, no deployment readiness %, no AI confidence badges — just real statistics.

The report

report.to_html("report.html") generates a self-contained interactive dashboard:

Verdict banner — DO NOT TRAIN / TRAIN WITH CAUTION / SAFE TO TRAIN, derived directly from critical/warning counts
Dataset fingerprint — schema hash + distribution hash in the header, so you can confirm two runs used identical data
Full drift ranking — every checked column shown, stable ones in grey, drifted ones in red/amber — not just the ones that failed
KDE distribution overlays — smooth kernel density curves for numeric drift, not blurry histograms
Root cause bullets — "Mean increased (29.4 → 55.3), 95th percentile shifted (54.9 → 83.5)"
Expandable issue accordions — detected value, threshold used, why it matters, fix steps
Severity-colored recommendation cards — one per issue type, concise and scannable
JSON export — one-click download of all findings as structured JSON
Print / Save as PDF — properly handles collapsed accordions, chart resolution, and color preservation

Root cause analysis

For numeric drift findings, LeakLens explains what shape of drift occurred — not just that it happened:

❌ Distribution Drift [Age]
   KS p-value = 0.0000

   Likely Cause:
   ✔ Mean increased (29.4 → 55.3)
   ✔ 95th percentile shifted (54.9 → 83.5)

Pure statistics, no model involved.

Missing value leakage

A check most libraries don't have — detects when the absence of a value is itself predictive:

⚠ Missing Value Leakage [Cabin]
  Whether 'Cabin' is missing correlates with the target
  (point-biserial r = -0.313, p < 0.0001)

In the Titanic dataset, this fires automatically — cabin missingness genuinely correlates with survival, and any imputation strategy that fills in a "typical" value would silently erase that signal or inadvertently leak it.

Dataset fingerprint

Every report includes a schema hash and distribution hash:

report.meta["train_fingerprint"]
# {'schema_hash': 'e0cf60009fb642c4', 'distribution_hash': '78e61c68b140fe5b',
#  'n_rows': 624, 'n_columns': 14}

Useful for confirming that two training runs actually used the same data, not silently different versions of a file.

Preprocessing leakage (AST check)

Parses your training script's syntax tree — no execution needed:

report = LeakLens(train_df, target="price", script="train.py").run()

Catches any of these fit before train_test_split():

StandardScaler  MinMaxScaler  RobustScaler
PCA  TruncatedSVD  KernelPCA
KNNImputer  SimpleImputer  IterativeImputer
LabelEncoder  OneHotEncoder  OrdinalEncoder  TargetEncoder
SMOTE  ADASYN  RandomOverSampler  RandomUnderSampler
SelectKBest  SelectPercentile  RFE  VarianceThreshold

Configuration

Every threshold is tunable — nothing is hardcoded:

from leaklens import LeakLens, Config

config = Config(
    target_corr_threshold=0.90,      # default 0.95
    ks_alpha=0.01,                   # default 0.05
    psi_critical=0.20,               # default 0.25
    missingness_corr_threshold=0.25, # default 0.30
    min_sample_size=20,              # default 30 — drift checks are skipped below this,
                                      # since KS/PSI results on tiny samples aren't reliable
)
report = LeakLens(train_df, test_df, target="price", config=config).run()

CLI + GitHub Actions

pip install "leaklens[cli]"
leaklens train.csv test.csv --target price --report report.html --fail-on critical

--fail-on controls CI exit code: critical (default), warning, or none.

Drop this into .github/workflows/leaklens-qa.yml to fail PRs that introduce data leakage:

- name: LeakLens QA
  run: |
    pip install "leaklens[cli]"
    leaklens data/train.csv data/test.csv \
      --target price \
      --report leaklens_report.html \
      --fail-on critical

- name: Upload report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: leaklens-report
    path: leaklens_report.html

A full workflow template is included at .github/workflows/leaklens-qa.yml.

Output shape

Every finding is a structured dataclass — no magic strings, no parsing required:

Issue(
    title="Distribution Drift",
    severity=Severity.CRITICAL,
    column="Age",
    analyzer="drift",
    message="KS test p-value=0.0000 — distributions differ significantly.",
    details={
        "ks_stat": 0.660,
        "p_value": 0.0,
        "root_cause": ["Mean increased (29.4 → 55.3)", "95th percentile shifted (54.9 → 83.5)"],
    },
)

Project layout

leaklens/
├── core.py                   # LeakLens orchestrator
├── config.py                 # all thresholds, fully overridable
├── exceptions.py
├── report.py                 # HTML rendering via Jinja2
├── cli.py                    # optional Typer CLI
├── utils.py                  # to_pandas(), fingerprint, identifier detection
├── models/                   # Issue, Report, Severity dataclasses
├── analyzers/                # one module per check, all return List[Issue]
│   ├── target_leakage.py
│   ├── contamination.py
│   ├── drift.py              # KS + PSI + root cause analysis
│   ├── temporal.py
│   ├── duplicate_columns.py
│   ├── schema.py
│   ├── missing_value_leakage.py
│   └── preprocessing.py      # AST-based static analysis
├── visualizations/           # Plotly KDE figure builders
└── templates/
    └── report.html           # full dashboard, single Jinja2 template

Analyzers never print, plot, or score — they only emit Issue objects. Rendering is handled separately, so adding a new output format or a new check doesn't touch existing detection logic.

Installation

# Core library
pip install leaklens

# With CLI (for GitHub Actions / terminal use)
pip install "leaklens[cli]"

# For development
pip install "leaklens[dev]"

Requirements: Python 3.9+, pandas, numpy, scipy, plotly, jinja2

Development

git clone https://github.com/Prajwal18py/leaklens
cd leaklens
pip install -e ".[dev]"
pytest tests/ -v

Actual test count may vary slightly by environment — run pytest tests/ -v for the exact number, one synthetic case per analyzer behavior, plus a dedicated fuzz suite (test_robustness.py) that throws pathological inputs at the library (all-null columns, duplicate column names, mixed dtypes, infinities, single-row frames) and asserts it never crashes. Every analyzer runs inside defensive error handling — if one check fails on unusual data, it's reported as a skipped check rather than taking down the whole report.

What LeakLens is NOT

Not an EDA library (use ydata-profiling, Sweetviz for that)
Not an AutoML tool
Not a model trainer or evaluator
Not a fairness auditor
No LLM calls, no black-box scores, no invented metrics

LeakLens does one thing: validates that your train/test split is structurally sound before you train anything on it.

License

MIT © Prajwal A.

GitHub · PyPI · Issues

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.0

Jul 4, 2026

1.1.2

Jul 3, 2026

1.1.1

Jul 1, 2026

1.1.0

Jul 1, 2026

1.0.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaklens-1.2.0.tar.gz (54.0 kB view details)

Uploaded Jul 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

leaklens-1.2.0-py3-none-any.whl (52.8 kB view details)

Uploaded Jul 4, 2026 Python 3

File details

Details for the file leaklens-1.2.0.tar.gz.

File metadata

Download URL: leaklens-1.2.0.tar.gz
Upload date: Jul 4, 2026
Size: 54.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`326455835ec6bc72e8a5b6ac26d20cbc7cc267b744453c3f4743f2cb26501466`
MD5	`2dd746d250a2426d1b722b0e234ac5ce`
BLAKE2b-256	`1c7a6b792d91f42a64a07a68a97b5b9b346b41869e96b4ffa3c4e52e74f05875`

See more details on using hashes here.

File details

Details for the file leaklens-1.2.0-py3-none-any.whl.

File metadata

Download URL: leaklens-1.2.0-py3-none-any.whl
Upload date: Jul 4, 2026
Size: 52.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3e46fccb43635ac86bff1dd2a53c8772dec282a544b0cb08a737b09bea4408f`
MD5	`d5a74866b90b0e09867c1272c25c164b`
BLAKE2b-256	`78f48c3bfe61a5eaac27da8adf1d4d5908011473b92abbc3c4dc092ddd676264`

See more details on using hashes here.

leaklens 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LeakLens

Why

Quick start

What it checks

The report

Root cause analysis

Missing value leakage

Dataset fingerprint

Preprocessing leakage (AST check)

Configuration

CLI + GitHub Actions

Output shape

Project layout

Installation

Development

What LeakLens is NOT

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes