Detect data leakage in ML datasets and pipelines at runtime — focused, framework-agnostic, CI-friendly.

These details have not been verified by PyPI

Project description

leaklint

Runtime data-leakage detection for ML datasets and pipelines. Point it at your actual data and splits — leaklint tells you, ranked and in plain language, where leakage is hiding.

Data leakage (information from outside the training data sneaking in) is the single most common reason a model looks great offline and fails in production — documented across 294 papers in 17 fields (Kapoor & Narayanan, Patterns 2023).

Why another tool?

Approach	Examples	Limitation
IDE / static code analysis	LeakageDetector (PyCharm/VSCode)	reads your code, not your data; not an importable library
broad validation suites	deepchecks	leakage is a small, shallow part of a heavy suite
narrow black-box trick	leak-detect	dormant since 2020; instruments one function

leaklint is focused, framework-agnostic (numpy / pandas / any sklearn-style estimator), zero-config, and CI-friendly — it inspects the data + splits directly.

Install

pip install -e .        # from this repo (PyPI release TBD)

Usage

from leaklint import detect_leakage

report = detect_leakage(
    X_train, y_train, X_test, y_test,
    groups_train=cust_ids_tr, groups_test=cust_ids_te,  # optional
    time_train=dates_tr,      time_test=dates_te,        # optional
)
report.summary()          # ranked, human-readable findings
if not report.clean:      # gate your CI
    raise SystemExit(1)

What it detects

Exact train/test duplicate rows (overlap contamination) — NaN- and float-safe hashing
Near-duplicate rows across splits (jittered copies / augmentation overlap) — opt-in (enable_near_dup=True); distance-based, can false-positive on discrete/low-cardinality data
Group leakage — same entity/group in both train and test (partial overlap counts)
Temporal leakage — training rows dated at/after the test period (mixed-tz / sub-day safe)
Leaky / target-proxy features — a single feature that ~perfectly predicts the target, via AUC/correlation and mutual information (catches non-linear, non-monotonic proxies)
Target / mean-encoding — a feature whose values equal the per-group target mean
Identifier features — near-unique id-like columns that shouldn't be features
Within-train duplicates — inflate cross-validated scores
Preprocessing leakage — a scaler/imputer (transformer=) fit on train+test, plus encoders / feature-selectors (categories or selected-features fit on the full data)
Cross-validation fold leakage — pass cv_splits=[(train_idx, test_idx), ...] to catch index bleed / duplicate rows shared across folds

Multi-output / multi-label targets, NaN, all-constant columns, and integer-encoded categoricals (categorical_features=[...], excluded from near-dup distance) are handled. All stochastic steps are deterministic via random_state.

scikit-learn pipelines

from leaklint import audit_pipeline
report = audit_pipeline(fitted_pipeline, X_train, y_train, X_test, y_test)

Pulls the transformer steps out of the pipeline and checks them for fit-on-full-data leakage, plus the usual data-level checks.

Use in CI

CLI (exits non-zero when leakage is found, so it gates the build):

leaklint --train train.csv --test test.csv --target label
# optional: --groups customer_id --time signup_date --enable-near-dup

GitHub Actions:

- run: pip install leaklint
- run: leaklint --train data/train.csv --test data/test.csv --target label --sarif > leaklint.sarif

Machine-readable output for artifact storage / diff-on-PR:

report.to_json()      # {"clean": bool, "findings": [...]}
report.to_sarif()     # SARIF 2.1.0 (GitHub code-scanning compatible)

leaklint --train t.csv --test e.csv --target y --json     # or --sarif

Per-check severity (not everyone's "group leakage" is fatal) via leaklint.yaml or a dict:

# leaklint.yaml   (auto-discovered, or pass --config / config=)
severity:
  group_leakage: low        # downgrade
  within_train_duplicates: ignore   # mute

Or as a local pre-commit hook (runs the check before each commit):

# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: leaklint
      name: leaklint
      entry: leaklint --train data/train.csv --test data/test.csv --target label
      language: system
      pass_filenames: false

Threshold validation

Measured by scripts/validate_thresholds.py on 13 public OpenML datasets:

Near-duplicate eps (default = data-relative 0.1 × median nearest-neighbour distance): recall 1.00 on injected near-duplicates across all 13 datasets; clean-data false-positive rate mean 0.028, max 0.163 (spambase). Because that FP-rate is dataset-dependent, near-dup ships opt-in (enable_near_dup=True).
Identifier near-unique threshold (≥ 95% distinct): 0 / 251 columns flagged across the (id-free) benchmark datasets → empirical FPR ≈ 0.0000, so valid ordinals aren't flagged.
Leaky-feature AUC threshold is configurable (leaky_auc_threshold, default 0.99). AUC is rank-based and threshold-free, so it is robust to class imbalance for ranking; under extreme imbalance its variance grows, so treat near-threshold hits as suspicions, not proof.

Scale

Exact-duplicate detection is hash-based and chunked (chunk_size=) — O(n) memory.
Near-duplicate uses tree-based nearest-neighbour search with a sampling fallback (max_rows=, default 20k; sampling lowers recall but never invents matches) and is deterministic via random_state. Optional progress=True shows a tqdm bar.

Honest limitations

Some leakage is context-dependent (e.g., "is this feature known at prediction time?"). leaklint flags mechanically detectable leakage and suspicious signals; it does not claim to certify a dataset leak-free.
The near-duplicate threshold is a documented heuristic (near_dup_eps=), tunable.
Complementary to static-analysis tools (they catch code-level mistakes like scaler.fit(X) before the split that a data-only view can miss).

Roadmap (not yet implemented)

scipy.sparse input support for high-dimensional text/embedding data.
LSH (e.g. datasketch) for near-dup beyond the sampling cap; today large data is handled by the sampling fallback + tree search rather than true sub-linear LSH.
True out-of-core streaming from disk (current chunking still expects an in-memory frame).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaklint-0.1.0.tar.gz (28.4 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

leaklint-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file leaklint-0.1.0.tar.gz.

File metadata

Download URL: leaklint-0.1.0.tar.gz
Upload date: Jun 23, 2026
Size: 28.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for leaklint-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`220d0eb46a3927cc2f7d3ae405b23a5e2b46f150b6d700e7b93ad4141a284953`
MD5	`e9a7d4b20a59093a382b70dc8f38fea6`
BLAKE2b-256	`9f9600d982b0854681c357073e0d3cb68fbe81d4a5ce8c6fef644e0b8aa34a8b`

See more details on using hashes here.

File details

Details for the file leaklint-0.1.0-py3-none-any.whl.

File metadata

Download URL: leaklint-0.1.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for leaklint-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7932c00d19bbd72e92912df60b046540d77a00443550586047aca1e81e28b48a`
MD5	`b665de39bc40ea4a202236bff645b48b`
BLAKE2b-256	`ce9da69632771e866f36240f90d6da4aa1a38cc528b721c7daaabaa5b4fb620c`

See more details on using hashes here.

leaklint 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

leaklint

Why another tool?

Install

Usage

What it detects

scikit-learn pipelines

Use in CI

Threshold validation

Scale

Honest limitations

Roadmap (not yet implemented)

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes