Skip to main content

Leakage checks for machine-learning pipelines using permutation tests.

Project description

Leakly logo

PyPI Build License

Open in Google Colab

Leakly: Leakage checks for any machine-learning pipeline

Leakly uses label permutation to test whether a pipeline still performs above chance when no true signal is present.

If it does, the pipeline may be leaking test-set information through preprocessing, feature selection, tuning, or another step.

Principle

  1. Permute labels to remove real signal.
  2. Run the full pipeline exactly as a user would run it.
  3. Compare the score distribution with chance level.
  4. Above-chance permuted performance suggests possible leakage.

Leakly includes example configurations for a leaky pipeline and a non-leaky pipeline so users can see the effect immediately.

Example permutation AUC summary

Install

pip install Leakly

For notebook environments that need the optional notebook dependencies:

pip install "Leakly[notebook]"

For the current GitHub checkout:

git clone https://github.com/DeMONLab-BioFINDER/Leakly.git
cd Leakly
pip install -e .

Quick Start

Open example.ipynb in Colab

Key Python snippet

from leakly import (
    MLPipeline,
    SummaryPlotter,
    load_example_leakage_config,
    permute_label,
)

scores = []
for seed in range(100):
    permuted_y = permute_label(data.y, random_state=seed)
    score = (
        # user could replace with any pipeline
        MLPipeline(
            data.X,
            permuted_y,
            covariates=data.covariates,
            config=load_example_leakage_config(),
        ).fit()).evaluate()
    scores.append(score)

SummaryPlotter(scores, chance_level=0.5).plot("assets/AUC.png")

FAQ

Can Leakly check my own pipeline?

Yes. Leakly can evaluate any pipeline that takes X, y, optional covariates, and returns a test score. The key is to run the full pipeline exactly as in the real analysis, including preprocessing, feature selection, tuning, and evaluation.

Why can a leaky pipeline score well on permuted labels?

After label permutation, there should be no real biological, clinical, or statistical link between features and outcomes. A valid pipeline should therefore perform near chance.

A leaky pipeline may still score well if information from the full dataset enters the analysis before the train/test split or outside the cross-validation loop. Common sources include feature selection, scaling, imputation, covariate adjustment, dimensionality reduction, or hyperparameter tuning performed on all samples.

This is especially problematic in high-dimensional data, such as neuroimaging, omics, or biomarker studies, where random label-specific patterns can appear meaningful by chance. If the test set influences preprocessing or feature selection, the model may "remember" these random patterns and show inflated performance.

How many permutations should I run?

Use 100 for a quick check. Use 1,000 or more for publication-level evidence.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leakly-0.1.1.tar.gz (255.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leakly-0.1.1-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file leakly-0.1.1.tar.gz.

File metadata

  • Download URL: leakly-0.1.1.tar.gz
  • Upload date:
  • Size: 255.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for leakly-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ed2563ad791ab5ec2b313b63263b48fcdc90340b3e123fae9c413b328790542c
MD5 e9b06758e22699c719825bf45b9e90e7
BLAKE2b-256 8e83d470b610a77d13269daa4ab53bfac1feae2b0135c8c0fa049bf9021a5224

See more details on using hashes here.

Provenance

The following attestation bundles were made for leakly-0.1.1.tar.gz:

Publisher: publish.yml on DeMONLab-BioFINDER/Leakly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file leakly-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: leakly-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 29.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for leakly-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 de04e6e748aaac27b2dc173c0ef1c368fc60c929a285f33b511107b4d29029af
MD5 30c1a4f171cfcf4437d5bbc9d2433d88
BLAKE2b-256 5f61ea2290359ad7aaa48f52759dc89e36a62e43a03be8c327403180cb45291f

See more details on using hashes here.

Provenance

The following attestation bundles were made for leakly-0.1.1-py3-none-any.whl:

Publisher: publish.yml on DeMONLab-BioFINDER/Leakly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page