Skip to main content

Leakage & drift QA for ML datasets — catches target leakage, train/test contamination, distribution drift, temporal leakage, schema mismatches, and duplicate columns before they wreck your model.

Project description

LeakLens

The pre-flight checklist for machine learning datasets.

Most EDA tools answer "what does my data look like?" LeakLens answers a different question: "is my train/test split even valid?"

Think Ruff or ESLint — but for the structural mistakes that quietly invalidate an ML experiment instead of the ones that break your code.

pip install leaklens

PyPI version Python License: MIT


LeakLens Dashboard

LeakLens Banner

LeakLens Clean

Why

The most common way an ML project fails isn't a bad model — it's a corrupted experiment. Target leakage, contaminated splits, drifted test sets, and preprocessing fit before the split all produce results that look great in your notebook and fall apart in production.

LeakLens catches exactly these failure modes — deterministically, with no invented scores or black-box metrics. Every finding maps to a specific, reproducible calculation you can explain in an interview or a code review.


Quick start

from leaklens import LeakLens

report = LeakLens(train_df, test_df, target="price").run()

report.summary()             # plain-text console output
report.to_html("report.html")  # full interactive dashboard
report.to_json()             # structured output for CI pipelines
report.issues                # raw List[Issue] for custom handling

Works with a single dataframe too — train/test comparison checks are skipped automatically:

report = LeakLens(df, target="churned").run()

Polars DataFrames are supported — passed in as-is and converted internally.


What it checks

Check Method What it catches
Target Leakage Correlation / Cramer's V / group mapping Features that directly or indirectly encode the target
Missing Value Leakage Point-biserial r / Cramer's V on missingness indicator Columns where whether a value is missing correlates with the target
Train/Test Contamination Full-row hash matching Identical rows appearing in both splits
Distribution Drift KS test (numeric) + PSI (categorical) Train and test drawn from different distributions
Temporal Leakage Date-column auto-detection + chronology check Test-set dates that fall before the latest train date
Duplicate Columns Correlation threshold + Series.equals() Near-identical columns adding redundant collinearity
Schema Validation Column diff + dtype comparison + category sets Missing columns, dtype mismatches, unseen categories, constant features
Preprocessing Leakage AST static analysis Scalers, encoders, PCA, SMOTE etc. fit before train_test_split()
Class Imbalance Majority/minority ratio Skewed target distributions where accuracy alone is misleading
Target Drift KS test (numeric) + PSI (categorical) The target's own distribution shifting between train and test, distinct from feature drift
Multicollinearity Variance Inflation Factor Features predicted by a combination of others, not just pairwise duplicates
Invalid Values Domain-rule checks + IQR Negative values in non-negative columns, extreme outliers

Every finding includes:

  • The raw metric value that triggered it (KS statistic, PSI, correlation, etc.)
  • The exact threshold it crossed (from your Config, not hidden)
  • Root cause analysis for drift findings (mean shift, variance shift, 95th-percentile shift)
  • A plain-English "why it matters" explanation
  • Concrete fix steps

No fairness scores, no deployment readiness %, no AI confidence badges — just real statistics.


The report

report.to_html("report.html") generates a self-contained interactive dashboard:

  • Verdict bannerDO NOT TRAIN / TRAIN WITH CAUTION / SAFE TO TRAIN, derived directly from critical/warning counts
  • Dataset fingerprint — schema hash + distribution hash in the header, so you can confirm two runs used identical data
  • Full drift ranking — every checked column shown, stable ones in grey, drifted ones in red/amber — not just the ones that failed
  • KDE distribution overlays — smooth kernel density curves for numeric drift, not blurry histograms
  • Root cause bullets — "Mean increased (29.4 → 55.3), 95th percentile shifted (54.9 → 83.5)"
  • Expandable issue accordions — detected value, threshold used, why it matters, fix steps
  • Severity-colored recommendation cards — one per issue type, concise and scannable
  • JSON export — one-click download of all findings as structured JSON
  • Print / Save as PDF — properly handles collapsed accordions, chart resolution, and color preservation

Root cause analysis

For numeric drift findings, LeakLens explains what shape of drift occurred — not just that it happened:

❌ Distribution Drift [Age]
   KS p-value = 0.0000

   Likely Cause:
   ✔ Mean increased (29.4 → 55.3)
   ✔ 95th percentile shifted (54.9 → 83.5)

Pure statistics, no model involved.


Missing value leakage

A check most libraries don't have — detects when the absence of a value is itself predictive:

⚠ Missing Value Leakage [Cabin]
  Whether 'Cabin' is missing correlates with the target
  (point-biserial r = -0.313, p < 0.0001)

In the Titanic dataset, this fires automatically — cabin missingness genuinely correlates with survival, and any imputation strategy that fills in a "typical" value would silently erase that signal or inadvertently leak it.


Dataset fingerprint

Every report includes a schema hash and distribution hash:

report.meta["train_fingerprint"]
# {'schema_hash': 'e0cf60009fb642c4', 'distribution_hash': '78e61c68b140fe5b',
#  'n_rows': 624, 'n_columns': 14}

Useful for confirming that two training runs actually used the same data, not silently different versions of a file.


Preprocessing leakage (AST check)

Parses your training script's syntax tree — no execution needed:

report = LeakLens(train_df, target="price", script="train.py").run()

Catches any of these fit before train_test_split():

StandardScaler  MinMaxScaler  RobustScaler
PCA  TruncatedSVD  KernelPCA
KNNImputer  SimpleImputer  IterativeImputer
LabelEncoder  OneHotEncoder  OrdinalEncoder  TargetEncoder
SMOTE  ADASYN  RandomOverSampler  RandomUnderSampler
SelectKBest  SelectPercentile  RFE  VarianceThreshold

Configuration

Every threshold is tunable — nothing is hardcoded:

from leaklens import LeakLens, Config

config = Config(
    target_corr_threshold=0.90,      # default 0.95
    ks_alpha=0.01,                   # default 0.05
    psi_critical=0.20,               # default 0.25
    missingness_corr_threshold=0.25, # default 0.30
    min_sample_size=20,              # default 30 — drift checks are skipped below this,
                                      # since KS/PSI results on tiny samples aren't reliable
)
report = LeakLens(train_df, test_df, target="price", config=config).run()

CLI + GitHub Actions

pip install "leaklens[cli]"
leaklens train.csv test.csv --target price --report report.html --fail-on critical

--fail-on controls CI exit code: critical (default), warning, or none.

Drop this into .github/workflows/leaklens-qa.yml to fail PRs that introduce data leakage:

- name: LeakLens QA
  run: |
    pip install "leaklens[cli]"
    leaklens data/train.csv data/test.csv \
      --target price \
      --report leaklens_report.html \
      --fail-on critical

- name: Upload report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: leaklens-report
    path: leaklens_report.html

A full workflow template is included at .github/workflows/leaklens-qa.yml.


Output shape

Every finding is a structured dataclass — no magic strings, no parsing required:

Issue(
    title="Distribution Drift",
    severity=Severity.CRITICAL,
    column="Age",
    analyzer="drift",
    message="KS test p-value=0.0000 — distributions differ significantly.",
    details={
        "ks_stat": 0.660,
        "p_value": 0.0,
        "root_cause": ["Mean increased (29.4 → 55.3)", "95th percentile shifted (54.9 → 83.5)"],
    },
)

Project layout

leaklens/
├── core.py                   # LeakLens orchestrator
├── config.py                 # all thresholds, fully overridable
├── exceptions.py
├── report.py                 # HTML rendering via Jinja2
├── cli.py                    # optional Typer CLI
├── utils.py                  # to_pandas(), fingerprint, identifier detection
├── models/                   # Issue, Report, Severity dataclasses
├── analyzers/                # one module per check, all return List[Issue]
│   ├── target_leakage.py
│   ├── contamination.py
│   ├── drift.py              # KS + PSI + root cause analysis
│   ├── temporal.py
│   ├── duplicate_columns.py
│   ├── schema.py
│   ├── missing_value_leakage.py
│   └── preprocessing.py      # AST-based static analysis
├── visualizations/           # Plotly KDE figure builders
└── templates/
    └── report.html           # full dashboard, single Jinja2 template

Analyzers never print, plot, or score — they only emit Issue objects. Rendering is handled separately, so adding a new output format or a new check doesn't touch existing detection logic.


Installation

# Core library
pip install leaklens

# With CLI (for GitHub Actions / terminal use)
pip install "leaklens[cli]"

# For development
pip install "leaklens[dev]"

Requirements: Python 3.9+, pandas, numpy, scipy, plotly, jinja2


Development

git clone https://github.com/Prajwal18py/leaklens
cd leaklens
pip install -e ".[dev]"
pytest tests/ -v

Actual test count may vary slightly by environment — run pytest tests/ -v for the exact number, one synthetic case per analyzer behavior, plus a dedicated fuzz suite (test_robustness.py) that throws pathological inputs at the library (all-null columns, duplicate column names, mixed dtypes, infinities, single-row frames) and asserts it never crashes. Every analyzer runs inside defensive error handling — if one check fails on unusual data, it's reported as a skipped check rather than taking down the whole report.


What LeakLens is NOT

  • Not an EDA library (use ydata-profiling, Sweetviz for that)
  • Not an AutoML tool
  • Not a model trainer or evaluator
  • Not a fairness auditor
  • No LLM calls, no black-box scores, no invented metrics

LeakLens does one thing: validates that your train/test split is structurally sound before you train anything on it.


License

MIT © Prajwal A.

GitHub · PyPI · Issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaklens-1.2.0.tar.gz (54.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leaklens-1.2.0-py3-none-any.whl (52.8 kB view details)

Uploaded Python 3

File details

Details for the file leaklens-1.2.0.tar.gz.

File metadata

  • Download URL: leaklens-1.2.0.tar.gz
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.2.0.tar.gz
Algorithm Hash digest
SHA256 326455835ec6bc72e8a5b6ac26d20cbc7cc267b744453c3f4743f2cb26501466
MD5 2dd746d250a2426d1b722b0e234ac5ce
BLAKE2b-256 1c7a6b792d91f42a64a07a68a97b5b9b346b41869e96b4ffa3c4e52e74f05875

See more details on using hashes here.

File details

Details for the file leaklens-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: leaklens-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 52.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3e46fccb43635ac86bff1dd2a53c8772dec282a544b0cb08a737b09bea4408f
MD5 d5a74866b90b0e09867c1272c25c164b
BLAKE2b-256 78f48c3bfe61a5eaac27da8adf1d4d5908011473b92abbc3c4dc092ddd676264

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page