Skip to main content

Detect data leakage in ML datasets and pipelines at runtime — focused, framework-agnostic, CI-friendly.

Project description

leaklint

Runtime data-leakage detection for ML datasets and pipelines. Point it at your actual data and splits — leaklint tells you, ranked and in plain language, where leakage is hiding.

Data leakage (information from outside the training data sneaking in) is the single most common reason a model looks great offline and fails in production — documented across 294 papers in 17 fields (Kapoor & Narayanan, Patterns 2023).

Why another tool?

Approach Examples Limitation
IDE / static code analysis LeakageDetector (PyCharm/VSCode) reads your code, not your data; not an importable library
broad validation suites deepchecks leakage is a small, shallow part of a heavy suite
narrow black-box trick leak-detect dormant since 2020; instruments one function

leaklint is focused, framework-agnostic (numpy / pandas / any sklearn-style estimator), zero-config, and CI-friendly — it inspects the data + splits directly.

Install

pip install -e .        # from this repo (PyPI release TBD)

Usage

from leaklint import detect_leakage

report = detect_leakage(
    X_train, y_train, X_test, y_test,
    groups_train=cust_ids_tr, groups_test=cust_ids_te,  # optional
    time_train=dates_tr,      time_test=dates_te,        # optional
)
report.summary()          # ranked, human-readable findings
if not report.clean:      # gate your CI
    raise SystemExit(1)

What it detects

  • Exact train/test duplicate rows (overlap contamination) — NaN- and float-safe hashing
  • Near-duplicate rows across splits (jittered copies / augmentation overlap) — opt-in (enable_near_dup=True); distance-based, can false-positive on discrete/low-cardinality data
  • Group leakage — same entity/group in both train and test (partial overlap counts)
  • Temporal leakage — training rows dated at/after the test period (mixed-tz / sub-day safe)
  • Leaky / target-proxy features — a single feature that ~perfectly predicts the target, via AUC/correlation and mutual information (catches non-linear, non-monotonic proxies)
  • Target / mean-encoding — a feature whose values equal the per-group target mean
  • Identifier features — near-unique id-like columns that shouldn't be features
  • Within-train duplicates — inflate cross-validated scores
  • Preprocessing leakage — a scaler/imputer (transformer=) fit on train+test, plus encoders / feature-selectors (categories or selected-features fit on the full data)
  • Cross-validation fold leakage — pass cv_splits=[(train_idx, test_idx), ...] to catch index bleed / duplicate rows shared across folds

Multi-output / multi-label targets, NaN, all-constant columns, and integer-encoded categoricals (categorical_features=[...], excluded from near-dup distance) are handled. All stochastic steps are deterministic via random_state.

scikit-learn pipelines

from leaklint import audit_pipeline
report = audit_pipeline(fitted_pipeline, X_train, y_train, X_test, y_test)

Pulls the transformer steps out of the pipeline and checks them for fit-on-full-data leakage, plus the usual data-level checks.

Use in CI

CLI (exits non-zero when leakage is found, so it gates the build):

leaklint --train train.csv --test test.csv --target label
# optional: --groups customer_id --time signup_date --enable-near-dup

GitHub Actions:

- run: pip install leaklint
- run: leaklint --train data/train.csv --test data/test.csv --target label --sarif > leaklint.sarif

Machine-readable output for artifact storage / diff-on-PR:

report.to_json()      # {"clean": bool, "findings": [...]}
report.to_sarif()     # SARIF 2.1.0 (GitHub code-scanning compatible)
leaklint --train t.csv --test e.csv --target y --json     # or --sarif

Per-check severity (not everyone's "group leakage" is fatal) via leaklint.yaml or a dict:

# leaklint.yaml   (auto-discovered, or pass --config / config=)
severity:
  group_leakage: low        # downgrade
  within_train_duplicates: ignore   # mute

Or as a local pre-commit hook (runs the check before each commit):

# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: leaklint
      name: leaklint
      entry: leaklint --train data/train.csv --test data/test.csv --target label
      language: system
      pass_filenames: false

Threshold validation

Measured by scripts/validate_thresholds.py on 13 public OpenML datasets:

  • Near-duplicate eps (default = data-relative 0.1 × median nearest-neighbour distance): recall 1.00 on injected near-duplicates across all 13 datasets; clean-data false-positive rate mean 0.028, max 0.163 (spambase). Because that FP-rate is dataset-dependent, near-dup ships opt-in (enable_near_dup=True).
  • Identifier near-unique threshold (≥ 95% distinct): 0 / 251 columns flagged across the (id-free) benchmark datasets → empirical FPR ≈ 0.0000, so valid ordinals aren't flagged.
  • Leaky-feature AUC threshold is configurable (leaky_auc_threshold, default 0.99). AUC is rank-based and threshold-free, so it is robust to class imbalance for ranking; under extreme imbalance its variance grows, so treat near-threshold hits as suspicions, not proof.

Scale

  • Exact-duplicate detection is hash-based and chunked (chunk_size=) — O(n) memory.
  • Near-duplicate uses tree-based nearest-neighbour search with a sampling fallback (max_rows=, default 20k; sampling lowers recall but never invents matches) and is deterministic via random_state. Optional progress=True shows a tqdm bar.

Honest limitations

  • Some leakage is context-dependent (e.g., "is this feature known at prediction time?"). leaklint flags mechanically detectable leakage and suspicious signals; it does not claim to certify a dataset leak-free.
  • The near-duplicate threshold is a documented heuristic (near_dup_eps=), tunable.
  • Complementary to static-analysis tools (they catch code-level mistakes like scaler.fit(X) before the split that a data-only view can miss).

Roadmap (not yet implemented)

  • scipy.sparse input support for high-dimensional text/embedding data.
  • LSH (e.g. datasketch) for near-dup beyond the sampling cap; today large data is handled by the sampling fallback + tree search rather than true sub-linear LSH.
  • True out-of-core streaming from disk (current chunking still expects an in-memory frame).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaklint-0.1.0.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leaklint-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file leaklint-0.1.0.tar.gz.

File metadata

  • Download URL: leaklint-0.1.0.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for leaklint-0.1.0.tar.gz
Algorithm Hash digest
SHA256 220d0eb46a3927cc2f7d3ae405b23a5e2b46f150b6d700e7b93ad4141a284953
MD5 e9a7d4b20a59093a382b70dc8f38fea6
BLAKE2b-256 9f9600d982b0854681c357073e0d3cb68fbe81d4a5ce8c6fef644e0b8aa34a8b

See more details on using hashes here.

File details

Details for the file leaklint-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: leaklint-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for leaklint-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7932c00d19bbd72e92912df60b046540d77a00443550586047aca1e81e28b48a
MD5 b665de39bc40ea4a202236bff645b48b
BLAKE2b-256 ce9da69632771e866f36240f90d6da4aa1a38cc528b721c7daaabaa5b4fb620c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page