HeGM: Heterogeneous GPU Migration framework for deep learning (PyTorch today, TensorFlow-ready)
Project description
HeGM - Heterogeneous GPU Migration Framework
HeGM lets you live-migrate long-running deep learning jobs between GPUs
using CRIU (Checkpoint/Restore in Userspace), without requiring any code
changes in your training script. Your existing for step, batch in enumerate(dataloader) loop works as-is -- HeGM transparently resumes from
the correct batch and step number after migration.
It targets PyTorch today and is structured to support additional backends (e.g. TensorFlow) in the future.
Installation
Install the Python package
The HeGM Python package (launcher + runtime hooks) is published on PyPI as
hegm.
On any machine/container where you run training:
python -m pip install --upgrade pip
pip install hegm
This installs:
- the
hegmpackage (transparentenumerate()patching, hooks, backends) - a
sitecustomize.pymodule in site-packages that automatically loads HeGM - the
hegm-launcherCLI
You can then start training under HeGM with:
hegm-launcher python -u train.py
In Kubernetes Pods, your command block typically becomes:
hegm-launcher python -u /workspace/train.py
Runtime prerequisites (cluster)
HeGM relies on a CRIU and CRI-O build that understand GPU checkpoint/restore
and the /checkpoint/*/lock convention. Install these first on your nodes:
-
CRIU (GPU migration fork)
leehun-criubranch2026-01-26/gpu-migration-support
See the upstream README for build/install instructions: -
CRI-O (restore-from-file fork)
leehun-cri-obranch2026-02-03/support-restore-from-file
Install and configure it as your Kubernetes container runtime:
In addition you need:
- A Kubernetes cluster with GPU nodes and the NVIDIA drivers/runtime configured.
kubectlaccess to the cluster.
Deploying HeGM example
From this repository:
- Create the demo namespace and storage (if not already present):
examples/dra/ns.yamlexamples/dra/storage.yaml
- Create the ConfigMaps:
examples/dra/hegm-scripts.yaml(Launcher + HeGM payload)examples/dra/training-script.yaml(yourtrain.py)
- Create the resource claim(s):
examples/dra/resource-claim.yaml(orresource-claim-restore.yaml)
- Launch a training pod:
- Single worker:
examples/dra/training-pod.yaml - Two workers in one Pod (for PID-isolation testing):
examples/dra/multiple-training-pod.yaml
- Single worker:
- Trigger checkpoint from outside the pod:
- Prefer using the
hegm-ctrlCLI, which also exposes a REST API inside the Pod:- All workers using the GPU:
kubectl exec ... -- hegm-ctrl checkpoint --all - Specific PID(s):
kubectl exec ... -- hegm-ctrl checkpoint --pid <pid1> <pid2> ...
- All workers using the GPU:
- Under the hood this sends
SIGUSR1to the worker PID(s) and waits until the corresponding/checkpoint/<PPID>/lockfile(s) appear, indicating that the checkpoint is ready for CRIU/CRI-O to snapshot.
- Prefer using the
Control API (CLI + REST)
When at least one hegm-launcher is active, HeGM starts a small HTTP server
inside the container (default 0.0.0.0:8298, configurable via
HEGM_CTRL_PORT). It is implemented by the hegm-ctrl module.
-
CLI (inside the Pod):
-
Trigger checkpoint for all GPU-using or worker processes:
hegm-ctrl checkpoint --all
-
Trigger checkpoint for specific worker PID(s):
hegm-ctrl checkpoint --pid 1234 5678
-
Resume from checkpoints:
# All pending checkpoints (all /checkpoint/*/lock) hegm-ctrl resume --all # Specific parent PID(s) hegm-ctrl resume --ppid 1 2 3
checkpointwaits until the corresponding lock file(s)/checkpoint/<PPID>/lockappear (or timeout), so when the command returns the checkpoint is ready for CRIU/CRI-O. -
-
REST API (from outside the Pod):
Assuming
HEGM_CTRL_PORT=8298and using the Pod IP:-
POST /checkpoint// Signal specific worker PIDs { "pids": [1234, 5678] } // Or auto-detect GPU/worker PIDs { "all": true }
Response:
{ "results": { "1234": "checkpoint-ready", "5678": "signalled (timeout-waiting-lock)" } }
-
POST /resume// Resume specific parent PIDs { "ppids": [1, 2, 3] } // Or resume all pending checkpoints { "all": true }
Response:
{ "results": { "1": "lock-removed", "2": "no-lock" } }
-
- Use your CRI-O / CRIU integration to checkpoint and restore the container,
pointing CRI-O at the CRIU checkpoint tarball and re-using the same
hegm-scriptsandtrain-scriptConfigMaps (seeexamples/dra/restore-pod.yaml).
High-level architecture
HeGM is split into two main pieces:
-
Launcher (
launcher.py/hegm-launcher): a small supervisor process that:- spawns your training script as a child process (the Worker)
- injects
sitecustomize.pyviaPYTHONPATH - watches the Worker's exit code:
0→ training finished, exit normally99→ Worker saved a checkpoint and exited for migration
- buffers the checkpoint file into RAM so CRIU can carry it across nodes
- creates a per-process lock file so an external controller knows when it is safe to snapshot
- starts a lightweight in-container REST API (via the
hegm-ctrlmodule) so external controllers can trigger checkpoint/resume over HTTP
-
Payload (
sitecustomize.py+hegm/package): automatically loaded into the Worker by Python. It:- installs a PEP‑451 import hook for
torch - monkey-patches
torch.nn.Moduleandtorch.optim.Optimizerto track models and optimizers - patches
builtins.enumeratesoenumerate(DataLoader)transparently resumes from the correct batch after a checkpoint restore - tracks global training steps and RNG state
- handles
SIGUSR1by saving a checkpoint and exiting with code99
- installs a PEP‑451 import hook for
All of this is delivered to your Pods via a ConfigMap (examples/dra/hegm-scripts.yaml).
Key files
-
launcher.py- Entry point you run instead of
python train.py. - For each Launcher process, checkpoints and lock files are isolated by
PID:
/checkpoint/<PID>/latest.pt/checkpoint/<PID>/lock
- The external controller can discover all ready instances via
/checkpoint/*/lock.
- Entry point you run instead of
-
sitecustomize.py- Thin bootstrap that simply does
import hegm. - Needs to stay at the top level of your
PYTHONPATHso Python’ssitecustomizemechanism can find it.
- Thin bootstrap that simply does
-
hegm/__init__.py: patchesenumerate()for transparent DataLoader resume, public API (hegm.global_step()), SIGUSR1 handler, import hooks._config.py: env‑driven configuration and lightweight logging._hook.py: generic PEP‑451 import hook logic.backends/__init__.py:AbstractBackendinterface and registry.backends/pytorch.py: PyTorch backend (tracking, checkpointing, RNG, GPU teardown).ctrl.py: external control helpers (hegm-ctrlCLI and REST API).
Why launcher.py and sitecustomize.py live at the top level
The package code lives under hegm/, but we intentionally keep:
sitecustomize.pyat the top level so that Python’s automaticsitecustomizeimport works as soon as/opt/hegmis onPYTHONPATH.launcher.pyat the top level so Kubernetes manifests can invoke it as/opt/hegm/launcher.pywithout needing to worry about Python module import paths.
Internally, both of these files are very thin:
sitecustomize.pyjust imports thehegmpackage.launcher.pyis a small script that wires the environment and hands off most logic to the shared configuration/checkpointing scheme.
In other words, the reusable logic already lives in hegm/; the two
top‑level scripts are just ergonomic entry points for Python and Kubernetes.
We can still add more entry points later (e.g. hegm.launcher console
script) without changing this layout.
Examples
Working DRA examples live under examples/dra/:
training-pod.yaml– single training job.multiple-training-pod.yaml– twolauncher.pyprocesses in one Pod, useful for testing PID‑isolated checkpoints and locks.restore-pod.yaml– example restore workflow using CRIU.
See docs/HeGM_Architecture.md for a more detailed architectural walk‑through.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hegm-0.3.1.tar.gz.
File metadata
- Download URL: hegm-0.3.1.tar.gz
- Upload date:
- Size: 38.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
698f47e70ba78e4c7519e6fd320d58ac8813704e207d6569ee8d9b41662192d6
|
|
| MD5 |
32f51a967a5a52f18dc7f161f3e5b585
|
|
| BLAKE2b-256 |
2c42fc67102d869d407be57f569e2cb9371b3c88e3fb59e27d78cfa8a5ae2d82
|
File details
Details for the file hegm-0.3.1-py3-none-any.whl.
File metadata
- Download URL: hegm-0.3.1-py3-none-any.whl
- Upload date:
- Size: 38.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fff20f0b561aad71d1603ea9d5577aae2665a9aebe547e08fa9827f572f34c3
|
|
| MD5 |
c64bd090a543145ca0b7346cc168856f
|
|
| BLAKE2b-256 |
7918af3d4ffdcc82ac7d806f816e0ede23a498daa717bdb38cc741e27a76ecb6
|