Pre-training leakage audit reports for tabular ML datasets.

These details have not been verified by PyPI

Project description

FeatureLeakageLens

Pre-training feature leakage auditor for tabular ML datasets.

PyPI Python License

Checks Tests Status

FeatureLeakageLens audits tabular ML datasets for suspicious feature leakage patterns before model training. It accepts a DataFrame, runs six checks, and returns a structured PASS / WARN / FAIL report the data scientist reviews before fitting a single model.

About

The worst feature leakage is invisible. A model trained on a post-outcome column doesn't look miscalibrated — it looks exceptional. AUC near 1.0, precision through the roof, validation loss flatlining early. Nothing in the training curves signals the problem. The failure only surfaces when the model hits production and the feature isn't available yet, because it was generated after the event you were trying to predict.

By then, the model has shipped, the business has made decisions on it, and months of development are sunk.

Feature leakage is caught late not because teams are careless, but because the dataset review step is informal. There's no standard checklist, no structured output to attach to a model card, and no CI gate to fail. A data scientist checks the columns they think of checking, in the order they happen to think of them, before moving on to the work that feels more like real ML.

FeatureLeakageLens makes that review step explicit, systematic, and documentable. It checks column names for post-outcome terms, scans for suspiciously correlated features, looks for categorical target-rate proxies, verifies timestamps, flags high-cardinality IDs, and tests for train/test distribution shift — before any model is trained. The output is a per-feature audit report in JSON, Markdown, and HTML, with a clear status and evidence you can attach to a pull request or model card review.

The truth boundary is stated on every report: this tool flags suspicious patterns. The domain expert confirms whether a feature was actually available at prediction time.

Architecture

flowchart TD
    IN["LeakageAuditConfig + DataFrame\n──────────────────────────\ntarget_col · split_col\noutcome_time_col · feature_time_cols\nthresholds"]

    subgraph CHEAP ["Tier 1 — Name & Structure  (zero-compute)"]
        NH["Post-outcome Name Heuristic\nkeyword scan · col names only\n→ WARN"]
        ID["ID / Proxy Scan\nn_unique / n_rows ≥ id_threshold\n→ WARN"]
    end

    subgraph STAT ["Tier 2 — Statistical Checks  (requires data)"]
        TC["Target Correlation\nPearson |r| ≥ high_corr_threshold\n→ WARN"]
        CP["Categorical Proxy\nmax_rate − min_rate ≥ cat_proxy_threshold\n→ WARN"]
        SD["Split Distribution\nnorm. mean diff (numeric) · TVD (categorical)\n→ WARN · INSUFFICIENT_INPUT if no split_col"]
    end

    subgraph TEMPORAL ["Tier 3 — Temporal Integrity  (structural violation)"]
        TA["Temporal Availability\nfeature_ts > outcome_ts per row\n→ FAIL  (not WARN)"]
    end

    IN --> CHEAP
    IN --> STAT
    IN --> TEMPORAL

    CHEAP & STAT & TEMPORAL --> AGG

    AGG["FAIL › WARN › INSUFFICIENT_INPUT › PASS"]
    AGG --> OUT["LeakageReport\nJSON · Markdown · HTML\nexplicit truth boundary"]

The 6 checks

Tier	Check	Method	Status
Name & Structure	Post-outcome name heuristic	Keyword scan on column names	WARN
Name & Structure	ID / proxy scan	n_unique / n_rows ≥ threshold	WARN
Statistical	Target correlation scan	Pearson \|r\| ≥ threshold	WARN
Statistical	Categorical proxy scan	Target-rate gap across values	WARN
Statistical	Split distribution scan	Normalised mean diff + TVD	WARN or INSUFFICIENT_INPUT
Temporal	Temporal availability	feature_ts > outcome_ts per row	FAIL

Only temporal availability can produce FAIL — it is the one check with no ambiguity. Every other finding requires domain confirmation.

Truth boundary

FeatureLeakageLens does not prove leakage. It flags suspicious patterns for review. Human judgment is required to confirm whether a feature was actually available at prediction time. It is not a replacement for feature-store governance, data contracts, or production monitoring.

Install

pip install featureleakagelens

Quickstart

import pandas as pd
from featureleakagelens import LeakageAuditConfig, audit_dataframe

df = pd.read_csv("data/demo_leakage_dataset.csv",
                 parse_dates=["application_ts", "outcome_ts", "payment_received_ts"])

config = LeakageAuditConfig(
    target_col="defaulted",
    split_col="split",
    outcome_time_col="outcome_ts",
    feature_time_cols={"payment_received_flag": "payment_received_ts"},
)

report = audit_dataframe(df, config)

print(report.status)       # FAIL / WARN / PASS
report.save("outputs/")    # writes JSON, Markdown, HTML

Run the demo

git clone https://github.com/SidharthKriplani/featureleakagelens
cd featureleakagelens
pip install -e .
python scripts/generate_demo_reports.py
open outputs/featureleakagelens_report.html

Resume-safe claim

Built FeatureLeakageLens, a pre-training feature leakage auditor for tabular ML datasets that checks for post-outcome name heuristics, target correlation, categorical target-rate proxies, future timestamp leakage, ID/proxy columns, and train/test distribution shift, producing structured JSON/Markdown/HTML audit reports with per-finding PASS/WARN/FAIL/INSUFFICIENT_INPUT status and explicit truth boundary.

Roadmap

Mutual information scan for nonlinear proxy detection
Group leakage check for cross-validation folds (same entity in train and test)
Time-series walk-forward split validator

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featureleakagelens-0.2.0.tar.gz (12.5 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

featureleakagelens-0.2.0-py3-none-any.whl (9.1 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file featureleakagelens-0.2.0.tar.gz.

File metadata

Download URL: featureleakagelens-0.2.0.tar.gz
Upload date: May 5, 2026
Size: 12.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for featureleakagelens-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ac0beeabb12d184975fb6edfac01148b440306660ad8908fedc02b865ed0bf3a`
MD5	`abab9230e5fd69673431b51db1acf9be`
BLAKE2b-256	`5a8699d8057975c294620cebed99da889f6f01eb0d4033ddac03e7994bf9444d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for featureleakagelens-0.2.0.tar.gz:

Publisher: publish.yml on SidharthKriplani/featureleakagelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: featureleakagelens-0.2.0.tar.gz
- Subject digest: ac0beeabb12d184975fb6edfac01148b440306660ad8908fedc02b865ed0bf3a
- Sigstore transparency entry: 1441015884
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: SidharthKriplani/featureleakagelens@17d5f8328288deee9f2caa0cc1efc6288be34a79
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SidharthKriplani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@17d5f8328288deee9f2caa0cc1efc6288be34a79
- Trigger Event: release

File details

Details for the file featureleakagelens-0.2.0-py3-none-any.whl.

File metadata

Download URL: featureleakagelens-0.2.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 9.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for featureleakagelens-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d30a50f1f4f8f051539a2a770957d76c9608fb1aa56b3033629faddc76236d4`
MD5	`9fc83b45ccefcb6589ebde68da6584e9`
BLAKE2b-256	`1e778cbb82098845468d15b1c63445c9abf9ab9c2bead131b46809964bdb458e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for featureleakagelens-0.2.0-py3-none-any.whl:

Publisher: publish.yml on SidharthKriplani/featureleakagelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: featureleakagelens-0.2.0-py3-none-any.whl
- Subject digest: 0d30a50f1f4f8f051539a2a770957d76c9608fb1aa56b3033629faddc76236d4
- Sigstore transparency entry: 1441016027
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: SidharthKriplani/featureleakagelens@17d5f8328288deee9f2caa0cc1efc6288be34a79
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SidharthKriplani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@17d5f8328288deee9f2caa0cc1efc6288be34a79
- Trigger Event: release

featureleakagelens 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

FeatureLeakageLens

About

Architecture

The 6 checks

Truth boundary

Install

Quickstart

Run the demo

Resume-safe claim

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance