Skip to main content

A registry-based, multi-GPU framework for reproducible image-unlearning evaluation.

Project description

⚡ SUPREME - A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

SUPREME

🔬 Tech Stack
Core: Python 3.9 PyTorch Lightning Fabric HuggingFace Transformers
Accelerators: CUDA 12.1 MPS TPU via PyTorch XLA
Distributed & precision: DeepSpeed bitsandbytes NVIDIA TransformerEngine

🛠️ Tooling
Experiment tracking: Weights & Biases TensorBoard
Environment: Docker Open in Dev Containers
Debug & profile: debugpy Scalene Profiler
Code quality: Ruff pre-commit

📄 Publication
arXiv Preprint Under double-blind review at the WIPE-OUT 2 Workshop, ECML-PKDD 2026 Project Page

📦 Repository
CI (lint, build, tests) PyPI TestPyPI MIT License


📖 Overview

SUPREME is an open-source framework for evaluating machine unlearning methods on image classification tasks at scale.

Machine unlearning removes the influence of a chosen subset of training data (a class, a sub-class, or a random sample) from an already-trained model, without retraining from scratch. A good unlearned model should behave as if it had never seen the forgotten data while still classifying everything else accurately. Comparing the many proposed methods fairly demands a standardised, repeatable harness, and SUPREME is that harness.

The gap it fills. Existing image-classification unlearning frameworks - MUBox, DeepUnlearn, and ERASURE - run on a single device, which caps how many methods, scenarios, and seeds can be evaluated in reasonable time. SUPREME distributes the entire train → unlearn → evaluate pipeline across multiple GPUs and nodes, removing that bottleneck. It does for image-classification unlearning what Open-Unlearning did for LLM unlearning in the text domain: turn a single-device research problem into a scalable, reproducible benchmark. To our knowledge it is the first multi-GPU framework for the field.

What it offers out of the box:

  • A complete, automated pipeline. Train a baseline on the full dataset, unlearn the chosen subset with the selected method, then evaluate the result against a from-scratch retrained reference, all from one command. Re-runs detect and skip work that is already done.
  • A broad component library. 5 datasets, 2 model architectures, 2 baselines, 9 unlearning methods, 9 evaluation metrics (covering forgetting, utility, privacy, behavioural/parametric equivalence, and efficiency), and 3 unlearning scenarios (full-class, subclass, random-sample), all selectable through command-line flags.
  • Distributed, multi-precision execution. Built on PyTorch and Lightning Fabric. DDP, FSDP, and DeepSpeed ZeRO 1/2/3 apply to all three stages, with mixed precision (fp16 / bf16, FP8, 4-/8-bit) and CUDA / Apple Silicon (MPS) / TPU / CPU back-ends. SLURM helpers fan experiments out across a cluster.
  • Statistically honest evaluation. A single random seed misrepresents how an unlearning method really behaves, because randomness enters at three independent points: training (weight initialisation and data shuffling produce different base models), unlearning (the unlearning algorithm itself is stochastic), and evaluation (sampling and metric computation add their own noise). SUPREME varies the seed at each of these three stages separately, so you can see how much of the spread in a result comes from the base model, from the unlearning run, and from measurement, and report the full distribution rather than a single point estimate. The seed count at each stage is configurable per run.
  • Extensibility without forking. It is pip-installable (pip install supreme-unlearning) and registry-based: add a dataset, model, method, or metric from your own package by implementing a small interface and registering its module path, with no edits to framework code (see docs/extending.md).
  • Efficient reuse. Experiments that share a training configuration train the model once and reuse it, guarded by a file lock so parallel SLURM jobs and concurrent local runs stay consistent.

SUPREME evolved from the codebases of Selective Synaptic Dampening (SSD) and bad-teaching unlearning, generalising them from single-method, single-device scripts into a standardised, distributed evaluation platform.

For the formal pipeline algorithm and mathematical notation (seed formulas, set definitions, operation signatures), see src/supreme/README.md and docs/notation.md.


📦 SUPREME as a Library

SUPREME is a pip-installable Python library (import supreme), not just a set of scripts. Install it, register your own components, and drive the full train → unlearn → evaluate pipeline from Python, with no edits to the framework:

pip install supreme-unlearning
import supreme

# Run the built-in pipeline programmatically
supreme.run_training(["-net", "ViT", "-dataset", "Cifar10", "-seed", "260"])
supreme.run_unlearning(["-method", "ssd", "-net", "ViT", "-dataset", "Cifar10"])

# Plug in code you wrote yourself, living in your own package.
# Replace "your_package.your_method" with your real import path.
supreme.register_unlearning_method("mymethod", "your_package.your_method")
supreme.run_unlearning(["-method", "mymethod", "-net", "ViT", "-dataset", "Cifar10"])

Public API: supreme.run_training, supreme.run_unlearning, supreme.register_model, supreme.register_baseline, supreme.register_unlearning_method, supreme.register_metric, supreme.register_dataset, and supreme.project_config. Everything under supreme.utils.* is internal. The API is defined in src/supreme/__init__.py; resolution and plugin entry points live in src/supreme/registry.py. Full walkthrough: docs/extending.md and the notebook notebooks/custom_components.ipynb.

Where the code lives

Path What's there
src/supreme/__init__.py Public API surface (run_*, register_*)
src/supreme/registry.py Name → component resolution and plugin entry points
src/supreme/methods/unlearning_methods/ Unlearning method implementations
src/supreme/methods/baselines/ Retrain / Original baselines
src/supreme/models/ ResNet18, ViT
src/supreme/datasets/datasets.py The 5 datasets
src/supreme/eval_metrics/ The 9 evaluation metrics
src/supreme/utils/training/train_main.py Training-stage entry point (supreme-train)
src/supreme/utils/unlearning/unlearn_main.py Unlearn/evaluate entry point (supreme-unlearn)
src/supreme/utils/fabric/ Lightning Fabric setup (accelerators, precision, distributed strategies)

🗃️ Available Components

Registry-based components are user-extensible - implement the relevant interface and register the module path, either in-tree or from your own package (runtime API or packaging entry points, no edits to SUPREME). See docs/extending.md. The components provided via Lightning Fabric cover the supported hardware and execution configurations.

Registry-based (user-extensible)

Component Available implementations
Datasets CIFAR-10, CIFAR-20, CIFAR-100, PinsFaceRecognition, Caltech-101
Models ResNet18, Vision Transformer (ViT)
Baselines Retrain, Original
Unlearning methods Fine-Tuning (FT), Bad Teacher (BadT), Random Labels (RL), UNSIR, SSD, LFSSD, ASSD, SCRUB, JIT
Evaluation metrics Accuracy, Loss/Error, ZRF, Activation Distance, JS-Divergence, Layer-wise Distance, Membership Inference Attack, Completeness, Resource Consumption, Time
Unlearning scenarios Full-class, Subclass, Random sample

Provided via Lightning Fabric

Component Available implementations
Accelerators CPU, CUDA, MPS, TPU
Precision modes 64-true, 32-true, 16-mixed, bf16-mixed, 16-true, bf16-true, transformer-engine, transformer-engine-float16 (FP8), nf4, nf4-dq, fp4, fp4-dq, int8, int8-training
Distributed strategies DDP, FSDP, DeepSpeed (ZeRO Stage 1/2/3)
Loggers Weights & Biases, TensorBoard, CSV

⚡ Quickstart

# 1. Clone
git clone https://github.com/pedroandreou/supreme-unlearning.git
cd supreme-unlearning

# 2. Set up environment - the Makefile is the entry point for local dev: it creates
#    the venv (named `unlearning` by default; override with VENV=<name>), installs the
#    pinned deps + SUPREME (editable), and enables the git hook. (Prompts if it
#    already exists; pass ON_EXISTING=reuse to skip.)
make cuda                  # NVIDIA GPU (Linux / WSL2).  Apple Silicon / CPU: `make mps`
source unlearning/bin/activate

# 3. Configure W&B + HF tokens
cp .env.example .env
# edit .env with your WANDB_KEY, WANDB_USERNAME and HUGGING_FACE_HUB_TOKEN

# 4. Smoke test - one seed, one method, one dataset
bash src/supreme/run_local.sh \
  --gpu 0 --models ViT --training-seeds 260 \
  --methods retrain,finetune,ssd \
  --strategies random_ --datasets Cifar10 \
  --forget-percs 0.01

Full environment setup (Docker Dev Container, MPS prerequisites, etc.) is documented in docs/environment_setup.md. The Docker image is NVIDIA-only (Linux / WSL2); macOS users follow the virtual-env path above.


🧪 Running Experiments

The pipeline runs train → unlearn → evaluate automatically. Re-running is safe: per-stage outputs (training checkpoints, unlearning checkpoints, already-logged W&B results) are detected and skipped.

Local (workstation, GPU server, interactive cluster node)

# All 10 seeds, all methods, all datasets - defaults
bash src/supreme/run_local.sh --gpu 0

# Filter the sweep
bash src/supreme/run_local.sh \
  --gpu 0,1 \
  --models ViT \
  --training-seeds 260,261,262 \
  --methods retrain,finetune,bad_teacher,ssd \
  --strategies fullclass,random_ \
  --datasets PinsFaceRecognition
Flag Description Default
--gpu GPU ID(s) - 0 single, 0,1,2,3 multi-GPU 0
--models ResNet18, ViT both
--training-seeds Comma-separated training seeds (outer loop, I). 260269
--unlearning-seeds Space-separated indices for J (e.g. "0 1 2" for J=3) "0" (matched)
--evaluation-seeds Space-separated indices for K "0" (matched)
--methods Unlearning methods to run all 11 (2 baselines + 9 methods)
--strategies fullclass, subclass, random_ all
--datasets Datasets to use all 5
--forget-percs Forget % for random_ strategy 0.0010.10

SLURM (HPC, login node)

# Preview the grid (no submission)
./src/supreme/run_slurm.sh --dry-run

# Submit all experiments, max 12 concurrent jobs
./src/supreme/run_slurm.sh --max-concurrent 12

# Subset
./src/supreme/run_slurm.sh \
  --datasets Cifar10,Cifar20 \
  --models ViT \
  --training-seeds 260,261,262

# Multi-GPU DDP per job
./src/supreme/run_slurm.sh --gpus 4

Each submitted job runs one (seed, dataset, model) cell independently; cells run in parallel across the cluster. Distributed-strategy selection (DDP / FSDP / DeepSpeed) is documented in docs/implementation_notes.md → Distributed Strategies.


🔁 Reproducing the paper

Reproducing the paper's numbers is a two-step process: run the experiment grid on Pins Face Recognition (both architectures, both scenarios, all 10 seeds) and then render the three paper LaTeX tables from the W&B-logged results using src/supreme/utils/wandb_utils/results_analysis/pins_paper_tables.ipynb. The exact command, the table-rendering workflow, and the troubleshooting notes are documented in docs/reproducing_the_paper.md. For a runnable, step-by-step walkthrough (install → smoke test → full grid → tables → extending), see the notebook notebooks/reproduce_experiments.ipynb.


➕ Extending SUPREME

SUPREME is reusable as a library (see SUPREME as a Library for installation and the public API). You register your own components from your own package with no edits to framework code, either at runtime via supreme.register_* or, for an installed plugin package, via packaging entry points (supreme.models, supreme.unlearning_methods, supreme.metrics, supreme.datasets, supreme.plugins).

A runnable, end-to-end walkthrough - pip install supreme-unlearning, then register your own method/metric/model/dataset from your own project - is in the notebook notebooks/custom_components.ipynb.

Adding a dataset, model, method, or metric follows a consistent register-and-implement pattern. Walkthroughs and Fabric-integration rules live in docs/extending.md:

What to add Walkthrough
New dataset docs/extending.md → Adding a new dataset
New model docs/extending.md → Adding a new model
New unlearning method docs/extending.md → Adding a new unlearning method
New evaluation metric docs/extending.md → Adding a new evaluation metric

🤝 Contributing

Contributions are welcome - bug reports, new components, and documentation alike.

CI (.github/workflows/ci.yml) lints, format-checks, and validates the package build on every push and PR. A version tag like v0.1.0 triggers .github/workflows/publish.yml to build and publish the release to PyPI (a manual run targets TestPyPI as a dry-run). The CUDA images are published to GHCR manually via .github/workflows/docker.yml (runtime image) and .github/workflows/devcontainer.yml (prebuilt dev container). Notable changes per release are tracked in CHANGELOG.md.


📚 Documentation

Document Covers
docs/contributing.md How to report issues, add components, and open a pull request
CHANGELOG.md Notable changes per release (Keep a Changelog / SemVer)
community/ Community-contributed methods, templates, and the results leaderboard
docs/notation.md Symbol glossary - seeds, datasets, models, indices, counts
src/supreme/README.md Formal algorithm specification (matched and decoupled protocols)
docs/environment_setup.md Virtual-env and Docker Dev Container setup, .env template, prerequisites
docs/reproducing_the_paper.md Single command for the paper's experiment grid plus the W&B-export-to-LaTeX-tables workflow
docs/script_arguments.md Full argument reference for train_main.py and unlearn_main.py
docs/extending.md How to add new datasets, models, methods, and metrics
docs/tooling.md Debugger, profiler, Fabric callbacks, process tracker, split export, W&B exporter
docs/wandb_integration.md W&B runtime behaviour: rank-0 logging, offline mode, sync workflow, metric synchronisation
docs/wandb_fields.md Paper-to-W&B metric mapping and per-metric field paths
docs/implementation_notes.md Distributed strategies, gradient handling, batch-size scaling, memory, known limitations
docs/adding_pinsfacerecognition.md Manual Kaggle download for the Pins Face Recognition dataset
docs/future_work.md Planned extensions

📝 Citing this work

If you use SUPREME in your research, please cite our work. When you use a specific unlearning method, please also cite its original paper (linked in each method's source-file header); the foundational SSD/LFSSD and Bad Teacher papers are included below.

@misc{supreme2026,
  title  = {SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation},
  author = {Petros Andreou, Jamie Lanyon, Axel Finke, Georgina Cosma},
  year   = {2026},
  eprint = {2606.00380},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url    = {https://arxiv.org/abs/2606.00380}
}
@inproceedings{foster2024ssd,
  title     = {Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening},
  author    = {Foster, Jack and Schoepf, Stefan and Brintrup, Alexandra},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2024},
  url       = {https://arxiv.org/abs/2308.07707}
}
@inproceedings{foster2024lossfree,
  title     = {Loss-Free Machine Unlearning},
  author    = {Foster, Jack and Schoepf, Stefan and Brintrup, Alexandra},
  booktitle = {ICLR 2024 Tiny Papers Track},
  year      = {2024},
  url       = {https://arxiv.org/abs/2402.19308}
}
@inproceedings{chundawat2023badteacher,
  title     = {Can Bad Teaching Induce Forgetting? Unlearning in Deep Networks using an Incompetent Teacher},
  author    = {Chundawat, Vikram S and Tarun, Ayush K and Mandal, Murari and Kankanhalli, Mohan},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2023},
  url       = {https://arxiv.org/abs/2205.08096}
}

This work was conducted at Loughborough University.


🙏 Acknowledgements

Several unlearning methods reimplement or adapt published research code. We thank the authors of the following projects, and ask that you cite the original papers (linked in each method's source-file header) when using the corresponding methods:


📄 License

This project is licensed under the MIT License. See the LICENSE file for details.


⭐ Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supreme_unlearning-0.1.2.tar.gz (171.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

supreme_unlearning-0.1.2-py3-none-any.whl (184.6 kB view details)

Uploaded Python 3

File details

Details for the file supreme_unlearning-0.1.2.tar.gz.

File metadata

  • Download URL: supreme_unlearning-0.1.2.tar.gz
  • Upload date:
  • Size: 171.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for supreme_unlearning-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2efbbc2def0848d37083ef27aca8744fe1dc7d48b9e5a557f652cdcee07cce40
MD5 39d0e5908dfa83838a2c0551a0a5f745
BLAKE2b-256 03e2778918c499f23af616f6262f663012cc9cdb796b2b3fba0741f0197d77d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for supreme_unlearning-0.1.2.tar.gz:

Publisher: publish.yml on pedroandreou/supreme-unlearning

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supreme_unlearning-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for supreme_unlearning-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9be6577c5824e985232caaaa8fa52a18522a909a044ca60f599c318ca0b72dad
MD5 7eb4326f869438c4cbc68f05878825a2
BLAKE2b-256 e2cede3198129d4c349274033d429f3a6384d9796dc93fbc53877bb2dc8fed19

See more details on using hashes here.

Provenance

The following attestation bundles were made for supreme_unlearning-0.1.2-py3-none-any.whl:

Publisher: publish.yml on pedroandreou/supreme-unlearning

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page