Data leakage detection and audit tool for machine learning
Project description
Leakage Buster / 泄漏检测器
Professional Data Leakage Detection & Audit Tool | 专业的数据泄漏检测与审计工具
Detects time leakage, KFold leakage, and CV consistency issues with detailed reports and fix suggestions.
🚀 Quick Start / 快速开始
Installation / 安装
# Install from PyPI (自动检测依赖)
pip install leakage-buster
# With optional PDF export support
pip install "leakage-buster[pdf]"
# With optional Polars engine (faster processing)
pip install "leakage-buster[polars]"
# With all optional features
pip install "leakage-buster[pdf,polars]"
✅ Automatic Dependency Detection / 自动依赖检测
When you install via pip install leakage-buster, pip automatically:
- Installs all required dependencies (pandas, numpy, scikit-learn, jinja2, etc.)
- Resolves version conflicts
- Creates proper dependency tree
- No manual dependency management needed!
📖 For detailed installation guide, see INSTALLATION_GUIDE.md
Basic Usage / 基本使用
# Quick test with example data
leakage-buster run \
--train examples/quick_start_example.csv \
--target target \
--time-col date \
--out test_results
# Basic audit with your data
leakage-buster run \
--train your_data.csv \
--target target_column \
--out audit_results
# With time column
leakage-buster run \
--train your_data.csv \
--target target_column \
--time-col date_column \
--out audit_results
# Advanced features
leakage-buster run \
--train your_data.csv \
--target target_column \
--time-col date_column \
--simulate-cv time \
--auto-fix plan \
--export pdf \
--out audit_results
📋 What It Does / 功能说明
Problem It Solves / 解决的问题
- Time Leakage: Future data accidentally used in training | 时间泄漏:训练时意外使用未来数据
- Target Leakage: Target information leaked into features | 目标泄漏:目标信息泄漏到特征中
- CV Leakage: Wrong cross-validation strategy | 交叉验证泄漏:错误的CV策略
- Statistical Leakage: Target encoding, WOE, rolling stats issues | 统计泄漏:目标编码、WOE、滚动统计问题
Detection Capabilities / 检测能力
| Detection Type | Description | 中文说明 |
|---|---|---|
| Target Leakage | High correlation features (abs(corr)/R²≥0.98) | 高相关性特征检测 |
| Statistical Leakage | TE/WOE/Rolling statistics issues | 统计类泄漏检测 |
| Time Leakage | Time column parsing and validation | 时间列解析与验证 |
| Group Leakage | High duplicate columns → GroupKFold | 分组泄漏检测 |
| CV Consistency | TimeSeriesSplit vs KFold vs GroupKFold | CV策略一致性 |
| Policy Audit | Offline/Online calibration differences | 口径一致性审计 |
🔧 Core Features / 核心特性
🔍 Comprehensive Detection / 全面检测
- Target Leakage: High correlation, categorical purity anomalies
- Statistical Leakage: Target encoding (TE), WOE, rolling statistics, aggregation traces
- Time Leakage: Time column parsing, time-aware suggestions
- Group Leakage: High duplicate columns → GroupKFold recommendations
- CV Strategy: TimeSeriesSplit vs KFold vs GroupKFold recommendations
- Policy Audit: Offline/online calibration consistency checks
⚡ High Performance / 高性能处理
- Multi-Engine Support: pandas (default), polars (optional)
- Parallel Processing: Multi-core detection with
--n-jobs - Memory Control: Smart memory management with
--memory-cap - Large Data Support: Chunking and sampling for million-row datasets
- Performance Optimization: Automatic data type optimization
🔧 Semi-Automatic Repair / 半自动修复
- Fix Plan Generation: Structured repair suggestions JSON
- Automatic Fix Application: Apply fixes based on plan
- Smart Suggestions: Delete/recalculate/recommend CV & groups
- Evidence Tracking: Record source risks and reasoning
📊 Professional Reports / 专业报告
- Interactive Reports: Risk radar charts, risk matrices, collapsible evidence
- Multi-Format Export: HTML, PDF, SARIF (GitHub Code Scanning)
- Detailed Metadata: Git hash, random seed, system info
- Responsive Design: Mobile and print support
🐍 Stable SDK / 稳定SDK
- Python API:
audit(),plan_fixes(),apply_fixes() - Type Safety: Complete type annotations and Pydantic models
- CI-Friendly: Standardized exit codes and error handling
- Well Documented: Detailed API documentation and examples
📋 Complete Parameter Reference / 完整参数表
Basic Parameters / 基础参数
| Parameter | Type | Default | Description | 中文说明 |
|---|---|---|---|---|
--train |
str | Required | Training data CSV file path | 训练数据CSV文件路径 |
--target |
str | Required | Target column name | 目标列名 |
--time-col |
str | None | Time column name (optional) | 时间列名(可选) |
--out |
str | Required | Output directory | 输出目录 |
CV Strategy Parameters / CV策略参数
| Parameter | Type | Default | Description | 中文说明 |
|---|---|---|---|---|
--cv-type |
str | None | CV strategy: kfold/timeseries/group | CV策略:kfold/timeseries/group |
--simulate-cv |
str | None | Enable time simulation: time | 启用时序模拟:time |
--leak-threshold |
float | 0.02 | Leakage threshold | 泄漏阈值 |
--cv-policy-file |
str | None | CV policy config file (YAML) | CV策略配置文件(YAML) |
Performance Parameters / 性能参数
| Parameter | Type | Default | Description | 中文说明 |
|---|---|---|---|---|
--engine |
str | pandas | Data engine: pandas/polars | 数据引擎:pandas/polars |
--n-jobs |
int | -1 | Parallel jobs (-1=auto) | 并行作业数(-1=自动) |
--memory-cap |
int | 4096 | Memory limit (MB) | 内存限制(MB) |
--sample-ratio |
float | None | Sampling ratio for large datasets | 大数据集采样比例 |
Export Parameters / 导出参数
| Parameter | Type | Default | Description | 中文说明 |
|---|---|---|---|---|
--export |
str | None | Export format: pdf | 导出格式:pdf |
--export-sarif |
str | None | SARIF file path (GitHub Code Scanning) | SARIF文件路径 |
Auto-Fix Parameters / 自动修复参数
| Parameter | Type | Default | Description | 中文说明 |
|---|---|---|---|---|
--auto-fix |
str | None | Auto-fix mode: plan/apply | 自动修复模式:plan/apply |
--fix-json |
str | None | Fix plan JSON output path | 修复计划JSON输出路径 |
--fixed-train |
str | None | Fixed data CSV output path | 修复后数据CSV输出路径 |
📊 Usage Examples / 使用示例
Example 1: Basic Audit / 基础审计
# Detect all types of leakage
leakage-buster run \
--train data/train.csv \
--target target \
--out results/basic_audit
# Output files:
# - results/basic_audit/report.html
# - results/basic_audit/fix_transforms.py
# - results/basic_audit/meta.json
Example 2: Time Series Analysis / 时序分析
# Time-aware analysis with simulation
leakage-buster run \
--train data/time_series.csv \
--target target \
--time-col timestamp \
--simulate-cv time \
--leak-threshold 0.05 \
--out results/time_audit
Example 3: High Performance / 高性能处理
# Use Polars engine with parallel processing
leakage-buster run \
--train data/large_dataset.csv \
--target target \
--engine polars \
--n-jobs 8 \
--memory-cap 8192 \
--sample-ratio 0.1 \
--out results/perf_audit
Example 4: Auto-Fix / 自动修复
# Generate fix plan
leakage-buster run \
--train data/problematic_data.csv \
--target target \
--auto-fix plan \
--fix-json results/fix_plan.json \
--out results/audit
# Apply fixes
leakage-buster run \
--train data/problematic_data.csv \
--target target \
--auto-fix apply \
--fixed-train results/fixed_data.csv \
--out results/final_audit
Example 5: Professional Export / 专业导出
# Export PDF report and SARIF for GitHub
leakage-buster run \
--train data/production_data.csv \
--target target \
--time-col date \
--export pdf \
--export-sarif results/leakage.sarif \
--out results/production_audit
🐳 Docker Usage / Docker使用
Build Image / 构建镜像
docker build -t leakage-buster .
Run Container / 运行容器
# Basic usage
docker run -v $(pwd):/data leakage-buster run \
--train /data/data.csv --target y --out /data/output
# High performance
docker run -v $(pwd):/data leakage-buster run \
--train /data/data.csv --target y --out /data/output \
--engine pandas --n-jobs 8 --memory-cap 4096
🔄 CI/CD Integration / CI/CD集成
GitHub Actions Example / GitHub Actions示例
- name: Run leakage audit
run: |
leakage-buster run --train data/train.csv --target y --time-col date --out runs/audit
if [ $? -eq 3 ]; then
echo "❌ High leakage detected! Build failed."
exit 1
fi
Exit Codes / 退出码规范
| Code | Meaning | Description | 中文说明 |
|---|---|---|---|
| 0 | Success | No risks detected | 成功,无风险 |
| 2 | Warning | Medium/low risks detected | 警告,有中低危风险 |
| 3 | High Risk | High leakage detected | 高危泄漏,需要立即处理 |
| 4 | Error | Configuration error | 配置错误,无法执行 |
🐍 Python SDK / Python SDK
Basic Usage / 基本使用
import pandas as pd
from leakage_buster.api import audit, plan_fixes, apply_fixes_to_dataframe
# Load data
df = pd.read_csv('data/train.csv')
# Run audit
audit_result = audit(df, target='target', time_col='date')
# Generate fix plan
fix_plan = plan_fixes(audit_result, 'data/train.csv')
# Apply fixes
fixed_df = apply_fixes_to_dataframe(df, fix_plan)
# Check results
print(f"Found {len(audit_result.risks)} risks")
print(f"High risks: {audit_result.has_high_risk}")
Advanced Usage / 高级使用
from leakage_buster.api import audit
from leakage_buster.core import DataLoader, ParallelProcessor
# Custom data loading
loader = DataLoader(engine='polars', memory_cap=4096)
df = loader.load('data/large_dataset.csv')
# Parallel processing
processor = ParallelProcessor(n_jobs=8)
audit_result = audit(df, target='target', parallel_processor=processor)
# Access detailed results
for risk in audit_result.risks:
print(f"Risk: {risk.name}")
print(f"Severity: {risk.severity}")
print(f"Evidence: {risk.evidence}")
print(f"Leak Score: {risk.leak_score}")
📊 Performance Benchmarks / 性能基准
Test Environment / 测试环境
- CPU: 8-core Intel i7
- Memory: 16GB RAM
- Data: 150K rows × 250 columns
Performance Metrics / 性能指标
| Metric | pandas | polars | Improvement | 提升 |
|---|---|---|---|---|
| Load Time | 15.2s | 8.7s | 1.7x | 1.7倍 |
| Audit Time | 45.3s | 28.1s | 1.6x | 1.6倍 |
| Memory Usage | 2.1GB | 1.4GB | 1.5x | 1.5倍 |
| Parallel Efficiency | 6.2x | 7.8x | 1.3x | 1.3倍 |
🧪 Testing / 测试
Run Tests / 运行测试
# Run all tests
pytest -q
# Run performance tests
pytest tests/perf/test_perf_medium.py -k perf -s
# Skip slow tests
pytest -q -k "not slow"
Test Coverage / 测试覆盖
- Unit Tests: 100% core functionality coverage
- Integration Tests: CLI and API end-to-end tests
- Performance Tests: Medium-scale dataset tests
- Security Tests: Bandit and Safety scans
📈 Version History / 版本历史
v1.0.1 (Current / 当前版本)
- 🔧 Fixed CI test failures / 修复CI测试失败问题
- 🔧 Cleaned up debug files on GitHub / 清理GitHub上的debug文件
- ✨ Added PyPI publishing support / 添加PyPI发布支持
- 🔧 Fixed README badge links / 修复README徽章链接
v1.0.0
- ✨ Performance & fault tolerance: pandas/polars engines, parallel processing, memory control
- ✨ Professional reports: risk radar charts, interactive UI, multi-format export
- ✨ Docker support: lightweight image, health checks, complete metadata
- ✨ PyPI ready: complete metadata, optional dependencies, test configuration
v0.5-rc
- ✨ Semi-automatic repair system / 半自动修复系统
- ✨ Stable Python SDK / 稳定Python SDK
- ✨ Standardized exit codes / 标准化退出码
v0.4.0
- ✨ Calibration consistency audit / 口径一致性审计
- ✨ PDF/SARIF export / PDF/SARIF导出
- ✨ Upgraded report template / 升级报告模板
v0.3.0
- ✨ Statistical leakage detection / 统计类泄漏检测
- ✨ Time series simulator / 时序模拟器
- ✨ Quantified leak scores / 风险分量化
v0.2.0
- ✨ Extended detection framework / 扩展检测框架
- ✨ JSON schema contract / JSON schema约定
v0.1.0
- 🎉 Initial release / 初始版本发布
🤝 Contributing / 贡献
We welcome contributions of all kinds! / 我们欢迎各种形式的贡献!
How to Contribute / 贡献方式
- Report Issues: GitHub Issues
- Feature Requests: GitHub Discussions
- Code Contributions: Fork → Develop → Pull Request
- Documentation: Direct edits or PR submissions
Development Environment / 开发环境
git clone https://github.com/li147852xu/leakage-buster.git
cd leakage-buster
pip install -e ".[dev]"
pytest
📄 License / 许可证
MIT License - See LICENSE file for details
🙏 Acknowledgments / 致谢
Thanks to all contributors and users for their support! / 感谢所有贡献者和用户的支持!
Leakage Buster - Making data leakage nowhere to hide! / 让数据泄漏无处遁形!🕵️♂️
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leakage_buster-1.0.2.tar.gz.
File metadata
- Download URL: leakage_buster-1.0.2.tar.gz
- Upload date:
- Size: 113.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea4ec3dd9200a5cf472534dbc0d2fbd790beed1675838216a8c3c6b24742ef56
|
|
| MD5 |
7d66e435fe0d2ac8e1e3c2b28d183792
|
|
| BLAKE2b-256 |
b4cb06cb552415575fff0c70f8e81752fee092d59d28842143119aa9c2883e65
|
File details
Details for the file leakage_buster-1.0.2-py3-none-any.whl.
File metadata
- Download URL: leakage_buster-1.0.2-py3-none-any.whl
- Upload date:
- Size: 45.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
820c77fb93b1c0452161bb97e2962b1f90a0477aed090fde821b11b01bdb7e9f
|
|
| MD5 |
f3b3cce615e6d36275483be6269dfe0a
|
|
| BLAKE2b-256 |
e98cc70b428c2c21e48d959689fdc643d1ece323c67eb0992ff02d84be6ea2c8
|