Skip to main content

Automatic data leakage detection for tabular and time-series ML workflows.

Project description

FeatureLeak Banner

🚨 FeatureLeak: Stop Data Leakage Before It Stops You

The #1 Python tool for catching data leakage in machine learning.

MIT License


Why FeatureLeak?

Data leakage is the silent killer of machine learning projects. It inflates your metrics, sabotages your models in production, and can cost you months of wasted effort. FeatureLeak is your automated guardrail—scanning your data for the most insidious forms of leakage before you ever train a model.

  • Instantly scan your data for 10+ leakage types
  • Zero configuration needed—works out of the box
  • Actionable, human-readable reports
  • CLI & Python API for seamless integration

🚀 Quick Demo

Python API

from featureleak import LeakScanner
import pandas as pd

df = pd.read_csv('data.csv')
scanner = LeakScanner()
report = scanner.scan(df, target='target')

print(report.summary())  # Human-readable summary
print(report.issues)     # List of detected issues

Command Line

featureleak scan data.csv --target target --output report.json

What Can FeatureLeak Catch?

  • Target Leakage: Features that "cheat" by revealing the answer
  • Temporal Leakage: Using future info in past predictions
  • Train-Test Contamination: Overlap between train/test sets
  • Entity Leakage: Same entity in both train and test
  • Aggregation Leakage: Pre-aggregated stats leaking test info
  • Identifier Leakage: Unique IDs that act as shortcuts
  • Missingness Leakage: Patterns of missing data that reveal the target
  • Duplicate Leakage: Duplicated rows across splits
  • Preprocessing Inconsistencies: Different transforms for train/test
  • Distribution Shift: Major differences between train and test

...and more!


📊 Example Output

FeatureLeak Report
──────────────────────────────
Risk score: 75/100 (High)
Total issues: 3
High risk features: 1
Medium risk features: 2

1. [HIGH] target_leakage
     Feature 'previous_target' is 0.98 correlated with target
     Suggested fix: Remove or investigate this feature

🔧 Configuration (Optional)

scanner = LeakScanner(
        target_corr_threshold=0.98,  # Correlation threshold for target leakage
        overlap_threshold=0.0,       # Allowable train-test overlap
        sample_size=10000            # Sample for large datasets
)

💡 Why Data Scientists Love FeatureLeak

  • Saves you from embarrassing mistakes before deployment
  • Works with any tabular data (CSV, pandas DataFrame)
  • Handles time series, entity-based, and large datasets
  • No black box: Every issue comes with a clear explanation and fix
  • Open source, MIT licensed

📦 Installation

pip install featureleak

Requirements: Python 3.10+, pandas 2.0+, numpy 1.24+, scikit-learn 1.3+


🛠️ Integrate Anywhere

  • Python API: Use in notebooks, scripts, or pipelines
  • CLI: Scan datasets from the terminal or CI/CD
  • JSON Reports: Easy to parse and automate

📝 Documentation & Help

  • Run featureleak --help for CLI options
  • See examples in the docs
  • Open an issue or PR—contributions welcome!

📄 License

MIT License. See LICENSE.


📣 Citing FeatureLeak

If you use FeatureLeak in research, please cite:

@software{featureleak2026,
    author = {McBride, Christian},
    title = {FeatureLeak: Automated Data Leakage Detection},
    year = {2026},
    url = {https://github.com/yourusername/feature_leak}
}

🙏 Acknowledgments

Built with pandas, scikit-learn, and numpy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featureleak-0.1.0.tar.gz (50.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

featureleak-0.1.0-py3-none-any.whl (62.7 kB view details)

Uploaded Python 3

File details

Details for the file featureleak-0.1.0.tar.gz.

File metadata

  • Download URL: featureleak-0.1.0.tar.gz
  • Upload date:
  • Size: 50.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for featureleak-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0fab64d444ee5d210a48bb82dc3f54b00f7f6c99c5607157d722d7f09b0fc654
MD5 2b188589df51ba402989466f523256bc
BLAKE2b-256 639b3e314206774cc3213b79d11265b184326bb29c74eeb45bb11b522faa5238

See more details on using hashes here.

File details

Details for the file featureleak-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: featureleak-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 62.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for featureleak-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a69724207c876bf052b08b19073f709dca4828640bb83daa27fe3fe83158b44
MD5 7f818d8eeae8af0aec420fa46f7b10f5
BLAKE2b-256 62f0cba74399bac6d8ca4886eb365b1bbe71d1d1a3f66135f832b638ff0249eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page