Skip to main content

Leakage checks for machine-learning pipelines using permutation tests.

Project description

Leakly logo

PyPI Build License

Open in Google Colab

Leakly: Leakage checks for any machine-learning pipeline

Leakly uses label permutation to test whether a pipeline still performs above chance when no true signal is present.

If it does, the pipeline may be leaking test-set information through preprocessing, feature selection, tuning, or another step.

Principle

  1. Permute labels to remove real signal.
  2. Run the full pipeline exactly as a user would run it.
  3. Compare the score distribution with chance level.
  4. Above-chance permuted performance suggests possible leakage.

Leakly includes example configurations for a leaky pipeline and a non-leaky pipeline so users can see the effect immediately.

Example permutation AUC summary

Install

pip install Leakly

For notebook environments that need the optional notebook dependencies:

pip install "Leakly[notebook]"

For the current GitHub checkout:

git clone https://github.com/DeMONLab-BioFINDER/Leakly.git
cd Leakly
pip install -e .

Quick Start

Open example.ipynb in Colab

Key Python snippet

from leakly import (
    MLPipeline,
    SummaryPlotter,
    load_example_leakage_config,
    permute_label,
)

scores = []
for seed in range(100):
    permuted_y = permute_label(data.y, random_state=seed)
    score = (
        # user could replace with any pipeline
        MLPipeline(
            data.X,
            permuted_y,
            covariates=data.covariates,
            config=load_example_leakage_config(),
        ).fit()).evaluate()
    scores.append(score)

SummaryPlotter(scores, chance_level=0.5).plot("assets/AUC.png")

FAQ

Can Leakly check my own pipeline?

Yes. Leakly can evaluate any pipeline that takes X, y, optional covariates, and returns a test score. The key is to run the full pipeline exactly as in the real analysis, including preprocessing, feature selection, tuning, and evaluation.

Why can a leaky pipeline score well on permuted labels?

After label permutation, there should be no real biological, clinical, or statistical link between features and outcomes. A valid pipeline should therefore perform near chance.

A leaky pipeline may still score well if information from the full dataset enters the analysis before the train/test split or outside the cross-validation loop. Common sources include feature selection, scaling, imputation, covariate adjustment, dimensionality reduction, or hyperparameter tuning performed on all samples.

This is especially problematic in high-dimensional data, such as neuroimaging, omics, or biomarker studies, where random label-specific patterns can appear meaningful by chance. If the test set influences preprocessing or feature selection, the model may "remember" these random patterns and show inflated performance.

How many permutations should I run?

Use 100 for a quick check. Use 1,000 or more for publication-level evidence.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leakly-0.1.0.tar.gz (255.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leakly-0.1.0-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file leakly-0.1.0.tar.gz.

File metadata

  • Download URL: leakly-0.1.0.tar.gz
  • Upload date:
  • Size: 255.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for leakly-0.1.0.tar.gz
Algorithm Hash digest
SHA256 847bc505ce93e997826c228c73b26cf83841e3a1ade512f90bbc50fb497c8a84
MD5 b608858d56ea8e82784310e64b57ce89
BLAKE2b-256 95ebf29c0d257bcc708d22e2a00c5024784462a31ecd9af2b9105f426f2cd93c

See more details on using hashes here.

Provenance

The following attestation bundles were made for leakly-0.1.0.tar.gz:

Publisher: publish.yml on DeMONLab-BioFINDER/Leakly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file leakly-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: leakly-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for leakly-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 53f3f0f88ca06de22865d94958723cd17c736ce19182ffbd2f50ce08872674a3
MD5 a370c4ec74bd0e7505bd166ef93de7a2
BLAKE2b-256 d976069c63ecfd0371bd9df46260c81066b7fc45b201e29f74fd803a68aeecd7

See more details on using hashes here.

Provenance

The following attestation bundles were made for leakly-0.1.0-py3-none-any.whl:

Publisher: publish.yml on DeMONLab-BioFINDER/Leakly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page