Skip to main content

Pre-training leakage audit reports for tabular ML datasets.

Project description

FeatureLeakageLens

Pre-training feature leakage auditor for tabular ML datasets.

CI PyPI Python License

Checks Tests Status

FeatureLeakageLens audits tabular ML datasets for suspicious feature leakage patterns before model training. It accepts a DataFrame, runs six checks, and returns a structured PASS / WARN / FAIL report the data scientist reviews before fitting a single model.

About

The worst feature leakage is invisible. A model trained on a post-outcome column doesn't look miscalibrated — it looks exceptional. AUC near 1.0, precision through the roof, validation loss flatlining early. Nothing in the training curves signals the problem. The failure only surfaces when the model hits production and the feature isn't available yet, because it was generated after the event you were trying to predict.

By then, the model has shipped, the business has made decisions on it, and months of development are sunk.

Feature leakage is caught late not because teams are careless, but because the dataset review step is informal. There's no standard checklist, no structured output to attach to a model card, and no CI gate to fail. A data scientist checks the columns they think of checking, in the order they happen to think of them, before moving on to the work that feels more like real ML.

FeatureLeakageLens makes that review step explicit, systematic, and documentable. It checks column names for post-outcome terms, scans for suspiciously correlated features, looks for categorical target-rate proxies, verifies timestamps, flags high-cardinality IDs, and tests for train/test distribution shift — before any model is trained. The output is a per-feature audit report in JSON, Markdown, and HTML, with a clear status and evidence you can attach to a pull request or model card review.

The truth boundary is stated on every report: this tool flags suspicious patterns. The domain expert confirms whether a feature was actually available at prediction time.

Architecture

flowchart TD
    IN["LeakageAuditConfig + DataFrame\n──────────────────────────\ntarget_col · split_col\noutcome_time_col · feature_time_cols\nthresholds"]

    subgraph CHEAP ["Tier 1 — Name & Structure  (zero-compute)"]
        NH["Post-outcome Name Heuristic\nkeyword scan · col names only\n→ WARN"]
        ID["ID / Proxy Scan\nn_unique / n_rows ≥ id_threshold\n→ WARN"]
    end

    subgraph STAT ["Tier 2 — Statistical Checks  (requires data)"]
        TC["Target Correlation\nPearson |r| ≥ high_corr_threshold\n→ WARN"]
        CP["Categorical Proxy\nmax_rate − min_rate ≥ cat_proxy_threshold\n→ WARN"]
        SD["Split Distribution\nnorm. mean diff (numeric) · TVD (categorical)\n→ WARN · INSUFFICIENT_INPUT if no split_col"]
    end

    subgraph TEMPORAL ["Tier 3 — Temporal Integrity  (structural violation)"]
        TA["Temporal Availability\nfeature_ts > outcome_ts per row\n→ FAIL  (not WARN)"]
    end

    IN --> CHEAP
    IN --> STAT
    IN --> TEMPORAL

    CHEAP & STAT & TEMPORAL --> AGG

    AGG["FAIL › WARN › INSUFFICIENT_INPUT › PASS"]
    AGG --> OUT["LeakageReport\nJSON · Markdown · HTML\nexplicit truth boundary"]

The 6 checks

Tier Check Method Status
Name & Structure Post-outcome name heuristic Keyword scan on column names WARN
Name & Structure ID / proxy scan n_unique / n_rows ≥ threshold WARN
Statistical Target correlation scan Pearson |r| ≥ threshold WARN
Statistical Categorical proxy scan Target-rate gap across values WARN
Statistical Split distribution scan Normalised mean diff + TVD WARN or INSUFFICIENT_INPUT
Temporal Temporal availability feature_ts > outcome_ts per row FAIL

Only temporal availability can produce FAIL — it is the one check with no ambiguity. Every other finding requires domain confirmation.

Truth boundary

FeatureLeakageLens does not prove leakage. It flags suspicious patterns for review. Human judgment is required to confirm whether a feature was actually available at prediction time. It is not a replacement for feature-store governance, data contracts, or production monitoring.

Install

pip install featureleakagelens

Quickstart

import pandas as pd
from featureleakagelens import LeakageAuditConfig, audit_dataframe

df = pd.read_csv("data/demo_leakage_dataset.csv",
                 parse_dates=["application_ts", "outcome_ts", "payment_received_ts"])

config = LeakageAuditConfig(
    target_col="defaulted",
    split_col="split",
    outcome_time_col="outcome_ts",
    feature_time_cols={"payment_received_flag": "payment_received_ts"},
)

report = audit_dataframe(df, config)

print(report.status)       # FAIL / WARN / PASS
report.save("outputs/")    # writes JSON, Markdown, HTML

Run the demo

git clone https://github.com/SidharthKriplani/featureleakagelens
cd featureleakagelens
pip install -e .
python scripts/generate_demo_reports.py
open outputs/featureleakagelens_report.html

Resume-safe claim

Built FeatureLeakageLens, a pre-training feature leakage auditor for tabular ML datasets that checks for post-outcome name heuristics, target correlation, categorical target-rate proxies, future timestamp leakage, ID/proxy columns, and train/test distribution shift, producing structured JSON/Markdown/HTML audit reports with per-finding PASS/WARN/FAIL/INSUFFICIENT_INPUT status and explicit truth boundary.

Roadmap

  • Mutual information scan for nonlinear proxy detection
  • Group leakage check for cross-validation folds (same entity in train and test)
  • Time-series walk-forward split validator

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featureleakagelens-0.2.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

featureleakagelens-0.2.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file featureleakagelens-0.2.0.tar.gz.

File metadata

  • Download URL: featureleakagelens-0.2.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for featureleakagelens-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ac0beeabb12d184975fb6edfac01148b440306660ad8908fedc02b865ed0bf3a
MD5 abab9230e5fd69673431b51db1acf9be
BLAKE2b-256 5a8699d8057975c294620cebed99da889f6f01eb0d4033ddac03e7994bf9444d

See more details on using hashes here.

Provenance

The following attestation bundles were made for featureleakagelens-0.2.0.tar.gz:

Publisher: publish.yml on SidharthKriplani/featureleakagelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file featureleakagelens-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for featureleakagelens-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d30a50f1f4f8f051539a2a770957d76c9608fb1aa56b3033629faddc76236d4
MD5 9fc83b45ccefcb6589ebde68da6584e9
BLAKE2b-256 1e778cbb82098845468d15b1c63445c9abf9ab9c2bead131b46809964bdb458e

See more details on using hashes here.

Provenance

The following attestation bundles were made for featureleakagelens-0.2.0-py3-none-any.whl:

Publisher: publish.yml on SidharthKriplani/featureleakagelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page