Automatic data leakage detection for tabular and time-series ML workflows.
Project description
🚨 FeatureLeak: Stop Data Leakage Before It Stops You
The #1 Python tool for catching data leakage in machine learning.
Why FeatureLeak?
Data leakage is the silent killer of machine learning projects. It inflates your metrics, sabotages your models in production, and can cost you months of wasted effort. FeatureLeak is your automated guardrail—scanning your data for the most insidious forms of leakage before you ever train a model.
- Instantly scan your data for 10+ leakage types
- Zero configuration needed—works out of the box
- Actionable, human-readable reports
- CLI & Python API for seamless integration
🚀 Quick Demo
Python API
from featureleak import LeakScanner
import pandas as pd
df = pd.read_csv('data.csv')
scanner = LeakScanner()
report = scanner.scan(df, target='target')
print(report.summary()) # Human-readable summary
print(report.issues) # List of detected issues
Command Line
featureleak scan data.csv --target target --output report.json
What Can FeatureLeak Catch?
- Target Leakage: Features that "cheat" by revealing the answer
- Temporal Leakage: Using future info in past predictions
- Train-Test Contamination: Overlap between train/test sets
- Entity Leakage: Same entity in both train and test
- Aggregation Leakage: Pre-aggregated stats leaking test info
- Identifier Leakage: Unique IDs that act as shortcuts
- Missingness Leakage: Patterns of missing data that reveal the target
- Duplicate Leakage: Duplicated rows across splits
- Preprocessing Inconsistencies: Different transforms for train/test
- Distribution Shift: Major differences between train and test
...and more!
📊 Example Output
FeatureLeak Report
──────────────────────────────
Risk score: 75/100 (High)
Total issues: 3
High risk features: 1
Medium risk features: 2
1. [HIGH] target_leakage
Feature 'previous_target' is 0.98 correlated with target
Suggested fix: Remove or investigate this feature
🔧 Configuration (Optional)
scanner = LeakScanner(
target_corr_threshold=0.98, # Correlation threshold for target leakage
overlap_threshold=0.0, # Allowable train-test overlap
sample_size=10000 # Sample for large datasets
)
💡 Why Data Scientists Love FeatureLeak
- Saves you from embarrassing mistakes before deployment
- Works with any tabular data (CSV, pandas DataFrame)
- Handles time series, entity-based, and large datasets
- No black box: Every issue comes with a clear explanation and fix
- Open source, MIT licensed
📦 Installation
pip install featureleak
Requirements: Python 3.10+, pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
🛠️ Integrate Anywhere
- Python API: Use in notebooks, scripts, or pipelines
- CLI: Scan datasets from the terminal or CI/CD
- JSON Reports: Easy to parse and automate
📝 Documentation & Help
- Run
featureleak --helpfor CLI options - See examples in the docs
- Open an issue or PR—contributions welcome!
📄 License
MIT License. See LICENSE.
📣 Citing FeatureLeak
If you use FeatureLeak in research, please cite:
@software{featureleak2026,
author = {McBride, Christian},
title = {FeatureLeak: Automated Data Leakage Detection},
year = {2026},
url = {https://github.com/yourusername/feature_leak}
}
🙏 Acknowledgments
Built with pandas, scikit-learn, and numpy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file featureleak-0.1.0.tar.gz.
File metadata
- Download URL: featureleak-0.1.0.tar.gz
- Upload date:
- Size: 50.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fab64d444ee5d210a48bb82dc3f54b00f7f6c99c5607157d722d7f09b0fc654
|
|
| MD5 |
2b188589df51ba402989466f523256bc
|
|
| BLAKE2b-256 |
639b3e314206774cc3213b79d11265b184326bb29c74eeb45bb11b522faa5238
|
File details
Details for the file featureleak-0.1.0-py3-none-any.whl.
File metadata
- Download URL: featureleak-0.1.0-py3-none-any.whl
- Upload date:
- Size: 62.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a69724207c876bf052b08b19073f709dca4828640bb83daa27fe3fe83158b44
|
|
| MD5 |
7f818d8eeae8af0aec420fa46f7b10f5
|
|
| BLAKE2b-256 |
62f0cba74399bac6d8ca4886eb365b1bbe71d1d1a3f66135f832b638ff0249eb
|