Skip to main content

Leakage & drift QA for ML datasets — catches target leakage, train/test contamination, distribution drift, temporal leakage, schema mismatches, and duplicate columns before they wreck your model.

Project description

LeakLens

The pre-flight checklist for machine learning datasets.

Most EDA tools answer "what does my data look like?" LeakLens answers a different question: "is my train/test split even valid?"

Think Ruff or ESLint — but for the structural mistakes that quietly invalidate an ML experiment instead of the ones that break your code.

pip install leaklens

PyPI version Python License: MIT


Why

The most common way an ML project fails isn't a bad model — it's a corrupted experiment. Target leakage, contaminated splits, drifted test sets, and preprocessing fit before the split all produce results that look great in your notebook and fall apart in production.

LeakLens catches exactly these failure modes — deterministically, with no invented scores or black-box metrics. Every finding maps to a specific, reproducible calculation you can explain in an interview or a code review.


Quick start

from leaklens import LeakLens

report = LeakLens(train_df, test_df, target="price").run()

report.summary()             # plain-text console output
report.to_html("report.html")  # full interactive dashboard
report.to_json()             # structured output for CI pipelines
report.issues                # raw List[Issue] for custom handling

Works with a single dataframe too — train/test comparison checks are skipped automatically:

report = LeakLens(df, target="churned").run()

Polars DataFrames are supported — passed in as-is and converted internally.


What it checks

Check Method What it catches
Target Leakage Correlation / Cramer's V / group mapping Features that directly or indirectly encode the target
Missing Value Leakage Point-biserial r / Cramer's V on missingness indicator Columns where whether a value is missing correlates with the target
Train/Test Contamination Full-row hash matching Identical rows appearing in both splits
Distribution Drift KS test (numeric) + PSI (categorical) Train and test drawn from different distributions
Temporal Leakage Date-column auto-detection + chronology check Test-set dates that fall before the latest train date
Duplicate Columns Correlation threshold + Series.equals() Near-identical columns adding redundant collinearity
Schema Validation Column diff + dtype comparison + category sets Missing columns, dtype mismatches, unseen categories, constant features
Preprocessing Leakage AST static analysis Scalers, encoders, PCA, SMOTE etc. fit before train_test_split()

Every finding includes:

  • The raw metric value that triggered it (KS statistic, PSI, correlation, etc.)
  • The exact threshold it crossed (from your Config, not hidden)
  • Root cause analysis for drift findings (mean shift, variance shift, 95th-percentile shift)
  • A plain-English "why it matters" explanation
  • Concrete fix steps

No fairness scores, no deployment readiness %, no AI confidence badges — just real statistics.


The report

report.to_html("report.html") generates a self-contained interactive dashboard:

  • Verdict bannerDO NOT TRAIN / TRAIN WITH CAUTION / SAFE TO TRAIN, derived directly from critical/warning counts
  • Dataset fingerprint — schema hash + distribution hash in the header, so you can confirm two runs used identical data
  • Full drift ranking — every checked column shown, stable ones in grey, drifted ones in red/amber — not just the ones that failed
  • KDE distribution overlays — smooth kernel density curves for numeric drift, not blurry histograms
  • Root cause bullets — "Mean increased (29.4 → 55.3), 95th percentile shifted (54.9 → 83.5)"
  • Expandable issue accordions — detected value, threshold used, why it matters, fix steps
  • Severity-colored recommendation cards — one per issue type, concise and scannable
  • JSON export — one-click download of all findings as structured JSON
  • Print / Save as PDF — properly handles collapsed accordions, chart resolution, and color preservation

Root cause analysis

For numeric drift findings, LeakLens explains what shape of drift occurred — not just that it happened:

❌ Distribution Drift [Age]
   KS p-value = 0.0000

   Likely Cause:
   ✔ Mean increased (29.4 → 55.3)
   ✔ 95th percentile shifted (54.9 → 83.5)

Pure statistics, no model involved.


Missing value leakage

A check most libraries don't have — detects when the absence of a value is itself predictive:

⚠ Missing Value Leakage [Cabin]
  Whether 'Cabin' is missing correlates with the target
  (point-biserial r = -0.313, p < 0.0001)

In the Titanic dataset, this fires automatically — cabin missingness genuinely correlates with survival, and any imputation strategy that fills in a "typical" value would silently erase that signal or inadvertently leak it.


Dataset fingerprint

Every report includes a schema hash and distribution hash:

report.meta["train_fingerprint"]
# {'schema_hash': 'e0cf60009fb642c4', 'distribution_hash': '78e61c68b140fe5b',
#  'n_rows': 624, 'n_columns': 14}

Useful for confirming that two training runs actually used the same data, not silently different versions of a file.


Preprocessing leakage (AST check)

Parses your training script's syntax tree — no execution needed:

report = LeakLens(train_df, target="price", script="train.py").run()

Catches any of these fit before train_test_split():

StandardScaler  MinMaxScaler  RobustScaler
PCA  TruncatedSVD  KernelPCA
KNNImputer  SimpleImputer  IterativeImputer
LabelEncoder  OneHotEncoder  OrdinalEncoder  TargetEncoder
SMOTE  ADASYN  RandomOverSampler  RandomUnderSampler
SelectKBest  SelectPercentile  RFE  VarianceThreshold

Configuration

Every threshold is tunable — nothing is hardcoded:

from leaklens import LeakLens, Config

config = Config(
    target_corr_threshold=0.90,      # default 0.95
    ks_alpha=0.01,                   # default 0.05
    psi_critical=0.20,               # default 0.25
    missingness_corr_threshold=0.25, # default 0.30
)
report = LeakLens(train_df, test_df, target="price", config=config).run()

CLI + GitHub Actions

pip install "leaklens[cli]"
leaklens train.csv test.csv --target price --report report.html --fail-on critical

--fail-on controls CI exit code: critical (default), warning, or none.

Drop this into .github/workflows/leaklens-qa.yml to fail PRs that introduce data leakage:

- name: LeakLens QA
  run: |
    pip install "leaklens[cli]"
    leaklens data/train.csv data/test.csv \
      --target price \
      --report leaklens_report.html \
      --fail-on critical

- name: Upload report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: leaklens-report
    path: leaklens_report.html

A full workflow template is included at .github/workflows/leaklens-qa.yml.


Output shape

Every finding is a structured dataclass — no magic strings, no parsing required:

Issue(
    title="Distribution Drift",
    severity=Severity.CRITICAL,
    column="Age",
    analyzer="drift",
    message="KS test p-value=0.0000 — distributions differ significantly.",
    details={
        "ks_stat": 0.660,
        "p_value": 0.0,
        "root_cause": ["Mean increased (29.4 → 55.3)", "95th percentile shifted (54.9 → 83.5)"],
    },
)

Project layout

leaklens/
├── core.py                   # LeakLens orchestrator
├── config.py                 # all thresholds, fully overridable
├── exceptions.py
├── report.py                 # HTML rendering via Jinja2
├── cli.py                    # optional Typer CLI
├── utils.py                  # to_pandas(), fingerprint, identifier detection
├── models/                   # Issue, Report, Severity dataclasses
├── analyzers/                # one module per check, all return List[Issue]
│   ├── target_leakage.py
│   ├── contamination.py
│   ├── drift.py              # KS + PSI + root cause analysis
│   ├── temporal.py
│   ├── duplicate_columns.py
│   ├── schema.py
│   ├── missing_value_leakage.py
│   └── preprocessing.py      # AST-based static analysis
├── visualizations/           # Plotly KDE figure builders
└── templates/
    └── report.html           # full dashboard, single Jinja2 template

Analyzers never print, plot, or score — they only emit Issue objects. Rendering is handled separately, so adding a new output format or a new check doesn't touch existing detection logic.


Installation

# Core library
pip install leaklens

# With CLI (for GitHub Actions / terminal use)
pip install "leaklens[cli]"

# For development
pip install "leaklens[dev]"

Requirements: Python 3.9+, pandas, numpy, scipy, plotly, jinja2


Development

git clone https://github.com/Prajwal18py/leaklens
cd leaklens
pip install -e ".[dev]"
pytest tests/ -v

37 tests, one synthetic case per analyzer behavior.


What LeakLens is NOT

  • Not an EDA library (use ydata-profiling, Sweetviz for that)
  • Not an AutoML tool
  • Not a model trainer or evaluator
  • Not a fairness auditor
  • No LLM calls, no black-box scores, no invented metrics

LeakLens does one thing: validates that your train/test split is structurally sound before you train anything on it.


License

MIT © Prajwal A.

GitHub · PyPI · Issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaklens-1.1.1.tar.gz (42.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leaklens-1.1.1-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file leaklens-1.1.1.tar.gz.

File metadata

  • Download URL: leaklens-1.1.1.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.1.1.tar.gz
Algorithm Hash digest
SHA256 88baf81883908319c71b26b19f107990095834ad672ddb692cc487a46bc4b1e2
MD5 cee844dcc35a7102fdb09611b2814c32
BLAKE2b-256 f7c81696648441aa87d1cd883ce5d2bbeed8e1c7f10f00f56fca6fef7c0a7348

See more details on using hashes here.

File details

Details for the file leaklens-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: leaklens-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for leaklens-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a0d00870343f870003dd5d20ca1f061fe5621fec73d77c20fb0a4323769bd47e
MD5 48d5ea3ee8b74da20337407f76f8522d
BLAKE2b-256 e2bf7484e2fc936dc11c26a2d64b41ad5ec4fca0f76f051dd02a0aa7612a61d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page