Detect data leakage in ML datasets and pipelines at runtime — focused, framework-agnostic, CI-friendly.
Project description
leaklint
Runtime data-leakage detection for ML datasets and pipelines. Point it at your
actual data and splits — leaklint tells you, ranked and in plain language, where
leakage is hiding.
Data leakage (information from outside the training data sneaking in) is the single most common reason a model looks great offline and fails in production — documented across 294 papers in 17 fields (Kapoor & Narayanan, Patterns 2023).
Why another tool?
| Approach | Examples | Limitation |
|---|---|---|
| IDE / static code analysis | LeakageDetector (PyCharm/VSCode) | reads your code, not your data; not an importable library |
| broad validation suites | deepchecks | leakage is a small, shallow part of a heavy suite |
| narrow black-box trick | leak-detect | dormant since 2020; instruments one function |
leaklint is focused, framework-agnostic (numpy / pandas / any sklearn-style
estimator), zero-config, and CI-friendly — it inspects the data + splits directly.
Install
pip install -e . # from this repo (PyPI release TBD)
Usage
from leaklint import detect_leakage
report = detect_leakage(
X_train, y_train, X_test, y_test,
groups_train=cust_ids_tr, groups_test=cust_ids_te, # optional
time_train=dates_tr, time_test=dates_te, # optional
)
report.summary() # ranked, human-readable findings
if not report.clean: # gate your CI
raise SystemExit(1)
What it detects
- Exact train/test duplicate rows (overlap contamination) — NaN- and float-safe hashing
- Near-duplicate rows across splits (jittered copies / augmentation overlap) — opt-in
(
enable_near_dup=True); distance-based, can false-positive on discrete/low-cardinality data - Group leakage — same entity/group in both train and test (partial overlap counts)
- Temporal leakage — training rows dated at/after the test period (mixed-tz / sub-day safe)
- Leaky / target-proxy features — a single feature that ~perfectly predicts the target, via AUC/correlation and mutual information (catches non-linear, non-monotonic proxies)
- Target / mean-encoding — a feature whose values equal the per-group target mean
- Identifier features — near-unique id-like columns that shouldn't be features
- Within-train duplicates — inflate cross-validated scores
- Preprocessing leakage — a scaler/imputer (
transformer=) fit on train+test, plus encoders / feature-selectors (categories or selected-features fit on the full data) - Cross-validation fold leakage — pass
cv_splits=[(train_idx, test_idx), ...]to catch index bleed / duplicate rows shared across folds
Multi-output / multi-label targets, NaN, all-constant columns, and integer-encoded
categoricals (categorical_features=[...], excluded from near-dup distance) are handled.
All stochastic steps are deterministic via random_state.
scikit-learn pipelines
from leaklint import audit_pipeline
report = audit_pipeline(fitted_pipeline, X_train, y_train, X_test, y_test)
Pulls the transformer steps out of the pipeline and checks them for fit-on-full-data leakage, plus the usual data-level checks.
Use in CI
CLI (exits non-zero when leakage is found, so it gates the build):
leaklint --train train.csv --test test.csv --target label
# optional: --groups customer_id --time signup_date --enable-near-dup
GitHub Actions:
- run: pip install leaklint
- run: leaklint --train data/train.csv --test data/test.csv --target label --sarif > leaklint.sarif
Machine-readable output for artifact storage / diff-on-PR:
report.to_json() # {"clean": bool, "findings": [...]}
report.to_sarif() # SARIF 2.1.0 (GitHub code-scanning compatible)
leaklint --train t.csv --test e.csv --target y --json # or --sarif
Per-check severity (not everyone's "group leakage" is fatal) via leaklint.yaml or a dict:
# leaklint.yaml (auto-discovered, or pass --config / config=)
severity:
group_leakage: low # downgrade
within_train_duplicates: ignore # mute
Or as a local pre-commit hook (runs the check before each commit):
# .pre-commit-config.yaml
- repo: local
hooks:
- id: leaklint
name: leaklint
entry: leaklint --train data/train.csv --test data/test.csv --target label
language: system
pass_filenames: false
Threshold validation
Measured by scripts/validate_thresholds.py on 13 public OpenML datasets:
- Near-duplicate eps (default = data-relative
0.1 × median nearest-neighbour distance): recall 1.00 on injected near-duplicates across all 13 datasets; clean-data false-positive rate mean 0.028, max 0.163 (spambase). Because that FP-rate is dataset-dependent, near-dup ships opt-in (enable_near_dup=True). - Identifier near-unique threshold (≥ 95% distinct): 0 / 251 columns flagged across the (id-free) benchmark datasets → empirical FPR ≈ 0.0000, so valid ordinals aren't flagged.
- Leaky-feature AUC threshold is configurable (
leaky_auc_threshold, default 0.99). AUC is rank-based and threshold-free, so it is robust to class imbalance for ranking; under extreme imbalance its variance grows, so treat near-threshold hits as suspicions, not proof.
Scale
- Exact-duplicate detection is hash-based and chunked (
chunk_size=) — O(n) memory. - Near-duplicate uses tree-based nearest-neighbour search with a sampling fallback
(
max_rows=, default 20k; sampling lowers recall but never invents matches) and is deterministic viarandom_state. Optionalprogress=Trueshows a tqdm bar.
Honest limitations
- Some leakage is context-dependent (e.g., "is this feature known at prediction
time?").
leaklintflags mechanically detectable leakage and suspicious signals; it does not claim to certify a dataset leak-free. - The near-duplicate threshold is a documented heuristic (
near_dup_eps=), tunable. - Complementary to static-analysis tools (they catch code-level mistakes like
scaler.fit(X)before the split that a data-only view can miss).
Roadmap (not yet implemented)
- scipy.sparse input support for high-dimensional text/embedding data.
- LSH (e.g.
datasketch) for near-dup beyond the sampling cap; today large data is handled by the sampling fallback + tree search rather than true sub-linear LSH. - True out-of-core streaming from disk (current chunking still expects an in-memory frame).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leaklint-0.1.0.tar.gz.
File metadata
- Download URL: leaklint-0.1.0.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
220d0eb46a3927cc2f7d3ae405b23a5e2b46f150b6d700e7b93ad4141a284953
|
|
| MD5 |
e9a7d4b20a59093a382b70dc8f38fea6
|
|
| BLAKE2b-256 |
9f9600d982b0854681c357073e0d3cb68fbe81d4a5ce8c6fef644e0b8aa34a8b
|
File details
Details for the file leaklint-0.1.0-py3-none-any.whl.
File metadata
- Download URL: leaklint-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7932c00d19bbd72e92912df60b046540d77a00443550586047aca1e81e28b48a
|
|
| MD5 |
b665de39bc40ea4a202236bff645b48b
|
|
| BLAKE2b-256 |
ce9da69632771e866f36240f90d6da4aa1a38cc528b721c7daaabaa5b4fb620c
|