Pre-training leakage audit reports for tabular ML datasets.
Project description
FeatureLeakageLens
Pre-training feature leakage auditor for tabular ML datasets.
FeatureLeakageLens audits tabular ML datasets for suspicious feature leakage patterns before model training. It accepts a DataFrame, runs six checks, and returns a structured PASS / WARN / FAIL report the data scientist reviews before fitting a single model.
About
The worst feature leakage is invisible. A model trained on a post-outcome column doesn't look miscalibrated — it looks exceptional. AUC near 1.0, precision through the roof, validation loss flatlining early. Nothing in the training curves signals the problem. The failure only surfaces when the model hits production and the feature isn't available yet, because it was generated after the event you were trying to predict.
By then, the model has shipped, the business has made decisions on it, and months of development are sunk.
Feature leakage is caught late not because teams are careless, but because the dataset review step is informal. There's no standard checklist, no structured output to attach to a model card, and no CI gate to fail. A data scientist checks the columns they think of checking, in the order they happen to think of them, before moving on to the work that feels more like real ML.
FeatureLeakageLens makes that review step explicit, systematic, and documentable. It checks column names for post-outcome terms, scans for suspiciously correlated features, looks for categorical target-rate proxies, verifies timestamps, flags high-cardinality IDs, and tests for train/test distribution shift — before any model is trained. The output is a per-feature audit report in JSON, Markdown, and HTML, with a clear status and evidence you can attach to a pull request or model card review.
The truth boundary is stated on every report: this tool flags suspicious patterns. The domain expert confirms whether a feature was actually available at prediction time.
Architecture
flowchart TD
IN["LeakageAuditConfig + DataFrame\n──────────────────────────\ntarget_col · split_col\noutcome_time_col · feature_time_cols\nthresholds"]
subgraph CHEAP ["Tier 1 — Name & Structure (zero-compute)"]
NH["Post-outcome Name Heuristic\nkeyword scan · col names only\n→ WARN"]
ID["ID / Proxy Scan\nn_unique / n_rows ≥ id_threshold\n→ WARN"]
end
subgraph STAT ["Tier 2 — Statistical Checks (requires data)"]
TC["Target Correlation\nPearson |r| ≥ high_corr_threshold\n→ WARN"]
CP["Categorical Proxy\nmax_rate − min_rate ≥ cat_proxy_threshold\n→ WARN"]
SD["Split Distribution\nnorm. mean diff (numeric) · TVD (categorical)\n→ WARN · INSUFFICIENT_INPUT if no split_col"]
end
subgraph TEMPORAL ["Tier 3 — Temporal Integrity (structural violation)"]
TA["Temporal Availability\nfeature_ts > outcome_ts per row\n→ FAIL (not WARN)"]
end
IN --> CHEAP
IN --> STAT
IN --> TEMPORAL
CHEAP & STAT & TEMPORAL --> AGG
AGG["FAIL › WARN › INSUFFICIENT_INPUT › PASS"]
AGG --> OUT["LeakageReport\nJSON · Markdown · HTML\nexplicit truth boundary"]
The 6 checks
| Tier | Check | Method | Status |
|---|---|---|---|
| Name & Structure | Post-outcome name heuristic | Keyword scan on column names | WARN |
| Name & Structure | ID / proxy scan | n_unique / n_rows ≥ threshold | WARN |
| Statistical | Target correlation scan | Pearson |r| ≥ threshold | WARN |
| Statistical | Categorical proxy scan | Target-rate gap across values | WARN |
| Statistical | Split distribution scan | Normalised mean diff + TVD | WARN or INSUFFICIENT_INPUT |
| Temporal | Temporal availability | feature_ts > outcome_ts per row | FAIL |
Only temporal availability can produce FAIL — it is the one check with no ambiguity. Every other finding requires domain confirmation.
Truth boundary
FeatureLeakageLens does not prove leakage. It flags suspicious patterns for review. Human judgment is required to confirm whether a feature was actually available at prediction time. It is not a replacement for feature-store governance, data contracts, or production monitoring.
Install
pip install featureleakagelens
Quickstart
import pandas as pd
from featureleakagelens import LeakageAuditConfig, audit_dataframe
df = pd.read_csv("data/demo_leakage_dataset.csv",
parse_dates=["application_ts", "outcome_ts", "payment_received_ts"])
config = LeakageAuditConfig(
target_col="defaulted",
split_col="split",
outcome_time_col="outcome_ts",
feature_time_cols={"payment_received_flag": "payment_received_ts"},
)
report = audit_dataframe(df, config)
print(report.status) # FAIL / WARN / PASS
report.save("outputs/") # writes JSON, Markdown, HTML
Run the demo
git clone https://github.com/SidharthKriplani/featureleakagelens
cd featureleakagelens
pip install -e .
python scripts/generate_demo_reports.py
open outputs/featureleakagelens_report.html
Resume-safe claim
Built FeatureLeakageLens, a pre-training feature leakage auditor for tabular ML datasets that checks for post-outcome name heuristics, target correlation, categorical target-rate proxies, future timestamp leakage, ID/proxy columns, and train/test distribution shift, producing structured JSON/Markdown/HTML audit reports with per-finding PASS/WARN/FAIL/INSUFFICIENT_INPUT status and explicit truth boundary.
Roadmap
- Mutual information scan for nonlinear proxy detection
- Group leakage check for cross-validation folds (same entity in train and test)
- Time-series walk-forward split validator
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file featureleakagelens-0.2.0.tar.gz.
File metadata
- Download URL: featureleakagelens-0.2.0.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac0beeabb12d184975fb6edfac01148b440306660ad8908fedc02b865ed0bf3a
|
|
| MD5 |
abab9230e5fd69673431b51db1acf9be
|
|
| BLAKE2b-256 |
5a8699d8057975c294620cebed99da889f6f01eb0d4033ddac03e7994bf9444d
|
Provenance
The following attestation bundles were made for featureleakagelens-0.2.0.tar.gz:
Publisher:
publish.yml on SidharthKriplani/featureleakagelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
featureleakagelens-0.2.0.tar.gz -
Subject digest:
ac0beeabb12d184975fb6edfac01148b440306660ad8908fedc02b865ed0bf3a - Sigstore transparency entry: 1441015884
- Sigstore integration time:
-
Permalink:
SidharthKriplani/featureleakagelens@17d5f8328288deee9f2caa0cc1efc6288be34a79 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/SidharthKriplani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@17d5f8328288deee9f2caa0cc1efc6288be34a79 -
Trigger Event:
release
-
Statement type:
File details
Details for the file featureleakagelens-0.2.0-py3-none-any.whl.
File metadata
- Download URL: featureleakagelens-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d30a50f1f4f8f051539a2a770957d76c9608fb1aa56b3033629faddc76236d4
|
|
| MD5 |
9fc83b45ccefcb6589ebde68da6584e9
|
|
| BLAKE2b-256 |
1e778cbb82098845468d15b1c63445c9abf9ab9c2bead131b46809964bdb458e
|
Provenance
The following attestation bundles were made for featureleakagelens-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on SidharthKriplani/featureleakagelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
featureleakagelens-0.2.0-py3-none-any.whl -
Subject digest:
0d30a50f1f4f8f051539a2a770957d76c9608fb1aa56b3033629faddc76236d4 - Sigstore transparency entry: 1441016027
- Sigstore integration time:
-
Permalink:
SidharthKriplani/featureleakagelens@17d5f8328288deee9f2caa0cc1efc6288be34a79 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/SidharthKriplani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@17d5f8328288deee9f2caa0cc1efc6288be34a79 -
Trigger Event:
release
-
Statement type: