HeGM: Heterogeneous GPU Migration framework for deep learning (PyTorch today, TensorFlow-ready)

These details have not been verified by PyPI

Project links

Homepage

Project description

HeGM - Heterogeneous GPU Migration Framework

HeGM lets you live-migrate long-running deep learning jobs between GPUs using CRIU (Checkpoint/Restore in Userspace), without requiring any code changes in your training script. Your existing for step, batch in enumerate(dataloader) loop works as-is -- HeGM transparently resumes from the correct batch and step number after migration.

It targets PyTorch today and is structured to support additional backends (e.g. TensorFlow) in the future.

Introduce

Installation

Install the Python package

The HeGM Python package (launcher + runtime hooks) is published on PyPI as hegm.

On any machine/container where you run training:

python -m pip install --upgrade pip
pip install hegm

This installs:

the hegm package (transparent enumerate() patching, hooks, backends)
a sitecustomize.py module in site-packages that automatically loads HeGM
the hegm-launcher CLI

You can then start training under HeGM with:

hegm-launcher python -u train.py

In Kubernetes Pods, your command block typically becomes:

hegm-launcher python -u /workspace/train.py

Runtime prerequisites (cluster)

HeGM relies on a CRIU and CRI-O build that understand GPU checkpoint/restore and the /checkpoint/*/lock convention. Install these first on your nodes:

CRIU (GPU migration fork)
leehun-criu branch 2026-01-26/gpu-migration-support
See the upstream README for build/install instructions:
- Repo: lehuannhatrang/leehun-criu
CRI-O (restore-from-file fork)
leehun-cri-o branch 2026-02-03/support-restore-from-file
Install and configure it as your Kubernetes container runtime:
- Repo: lehuannhatrang/leehun-cri-o

In addition you need:

A Kubernetes cluster with GPU nodes and the NVIDIA drivers/runtime configured.
kubectl access to the cluster.

Deploying HeGM example

From this repository:

Create the demo namespace and storage (if not already present):
- examples/dra/ns.yaml
- examples/dra/storage.yaml
Create the ConfigMaps:
- examples/dra/hegm-scripts.yaml (Launcher + HeGM payload)
- examples/dra/training-script.yaml (your train.py)
Create the resource claim(s):
- examples/dra/resource-claim.yaml (or resource-claim-restore.yaml)
Launch a training pod:
- Single worker: examples/dra/training-pod.yaml
- Two workers in one Pod (for PID-isolation testing): examples/dra/multiple-training-pod.yaml
Trigger checkpoint from outside the pod:
- Prefer using the hegm-ctrl CLI, which also exposes a REST API inside the Pod:
  - All workers using the GPU:
    kubectl exec ... -- hegm-ctrl checkpoint --all
  - Specific PID(s):
    kubectl exec ... -- hegm-ctrl checkpoint --pid <pid1> <pid2> ...
- Under the hood this sends SIGUSR1 to the worker PID(s) and waits until the corresponding /checkpoint/<PPID>/lock file(s) appear, indicating that the checkpoint is ready for CRIU/CRI-O to snapshot.

Control API (CLI + REST)

When at least one hegm-launcher is active, HeGM starts a small HTTP server inside the container (default 0.0.0.0:8298, configurable via HEGM_CTRL_PORT). It is implemented by the hegm-ctrl module.

CLI (inside the Pod):
- Trigger checkpoint for all GPU-using or worker processes:
```
hegm-ctrl checkpoint --all
```
- Trigger checkpoint for specific worker PID(s):
```
hegm-ctrl checkpoint --pid 1234 5678
```
- Resume from checkpoints:
```
# All pending checkpoints (all /checkpoint/*/lock)
hegm-ctrl resume --all

# Specific parent PID(s)
hegm-ctrl resume --ppid 1 2 3
```
checkpoint waits until the corresponding lock file(s) /checkpoint/<PPID>/lock appear (or timeout), so when the command returns the checkpoint is ready for CRIU/CRI-O.

REST API (from outside the Pod):

Assuming HEGM_CTRL_PORT=8298 and using the Pod IP:

POST /checkpoint

// Signal specific worker PIDs
{
  "pids": [1234, 5678]
}

// Or auto-detect GPU/worker PIDs
{
  "all": true
}

Response:

{
  "results": {
    "1234": "checkpoint-ready",
    "5678": "signalled (timeout-waiting-lock)"
  }
}

POST /resume

// Resume specific parent PIDs
{
  "ppids": [1, 2, 3]
}

// Or resume all pending checkpoints
{
  "all": true
}

Response:

{
  "results": {
    "1": "lock-removed",
    "2": "no-lock"
  }
}

Use your CRI-O / CRIU integration to checkpoint and restore the container, pointing CRI-O at the CRIU checkpoint tarball and re-using the same hegm-scripts and train-script ConfigMaps (see examples/dra/restore-pod.yaml).

High-level architecture

HeGM is split into two main pieces:

Launcher (launcher.py / hegm-launcher): a small supervisor process that:
- spawns your training script as a child process (the Worker)
- injects sitecustomize.py via PYTHONPATH
- watches the Worker's exit code:
  - 0 → training finished, exit normally
  - 99 → Worker saved a checkpoint and exited for migration
- buffers the checkpoint file into RAM so CRIU can carry it across nodes
- creates a per-process lock file so an external controller knows when it is safe to snapshot
- starts a lightweight in-container REST API (via the hegm-ctrl module) so external controllers can trigger checkpoint/resume over HTTP
Payload (sitecustomize.py + hegm/ package): automatically loaded into the Worker by Python. It:
- installs a PEP‑451 import hook for torch
- monkey-patches torch.nn.Module and torch.optim.Optimizer to track models and optimizers
- patches builtins.enumerate so enumerate(DataLoader) transparently resumes from the correct batch after a checkpoint restore
- tracks global training steps and RNG state
- handles SIGUSR1 by saving a checkpoint and exiting with code 99

All of this is delivered to your Pods via a ConfigMap (examples/dra/hegm-scripts.yaml).

Key files

launcher.py
- Entry point you run instead of python train.py.
- For each Launcher process, checkpoints and lock files are isolated by PID:
  - /checkpoint/<PID>/latest.pt
  - /checkpoint/<PID>/lock
- The external controller can discover all ready instances via /checkpoint/*/lock.
sitecustomize.py
- Thin bootstrap that simply does import hegm.
- Needs to stay at the top level of your PYTHONPATH so Python’s sitecustomize mechanism can find it.
hegm/
- __init__.py: patches enumerate() for transparent DataLoader resume, public API (hegm.global_step()), SIGUSR1 handler, import hooks.
- _config.py: env‑driven configuration and lightweight logging.
- _hook.py: generic PEP‑451 import hook logic.
- backends/__init__.py: AbstractBackend interface and registry.
- backends/pytorch.py: PyTorch backend (tracking, checkpointing, RNG, GPU teardown).
- ctrl.py: external control helpers (hegm-ctrl CLI and REST API).

Why `launcher.py` and `sitecustomize.py` live at the top level

The package code lives under hegm/, but we intentionally keep:

sitecustomize.py at the top level so that Python’s automatic sitecustomize import works as soon as /opt/hegm is on PYTHONPATH.
launcher.py at the top level so Kubernetes manifests can invoke it as /opt/hegm/launcher.py without needing to worry about Python module import paths.

Internally, both of these files are very thin:

sitecustomize.py just imports the hegm package.
launcher.py is a small script that wires the environment and hands off most logic to the shared configuration/checkpointing scheme.

In other words, the reusable logic already lives in hegm/; the two top‑level scripts are just ergonomic entry points for Python and Kubernetes. We can still add more entry points later (e.g. hegm.launcher console script) without changing this layout.

Examples

Working DRA examples live under examples/dra/:

training-pod.yaml – single training job.
multiple-training-pod.yaml – two launcher.py processes in one Pod, useful for testing PID‑isolated checkpoints and locks.
restore-pod.yaml – example restore workflow using CRIU.

See docs/HeGM_Architecture.md for a more detailed architectural walk‑through.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.3

May 12, 2026

0.3.2

May 6, 2026

0.3.1

Mar 16, 2026

0.3.0

Mar 3, 2026

0.2.3

Feb 13, 2026

0.2.2

Feb 13, 2026

0.2.0

Feb 12, 2026

0.1.2

Feb 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hegm-0.3.3.tar.gz (40.0 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hegm-0.3.3-py3-none-any.whl (40.4 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file hegm-0.3.3.tar.gz.

File metadata

Download URL: hegm-0.3.3.tar.gz
Upload date: May 12, 2026
Size: 40.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for hegm-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`c301e366e84f8ef40a5884c4d9adc0fb144f14646ba3d75cd0aa81924b4be060`
MD5	`b261582a4833f82a4c3c97518efb65d5`
BLAKE2b-256	`5a650fe3caa0553fdc51c9f83b40abe827a0c4a52c0b46b7aa916863a78888d5`

See more details on using hashes here.

File details

Details for the file hegm-0.3.3-py3-none-any.whl.

File metadata

Download URL: hegm-0.3.3-py3-none-any.whl
Upload date: May 12, 2026
Size: 40.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for hegm-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b01e334c56a51a4a65a1da23de5a5fa959c309b3789e7a5bd2337fff197df5f7`
MD5	`9ee4639ed7aa640eb020643d042da770`
BLAKE2b-256	`d8c158430da6a6d364b9475ee50bd7365000dead1bb14abee2861b5753394eff`

See more details on using hashes here.

hegm 0.3.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

HeGM - Heterogeneous GPU Migration Framework

Installation

Install the Python package

Runtime prerequisites (cluster)

Deploying HeGM example

Control API (CLI + REST)

High-level architecture

Key files

Why `launcher.py` and `sitecustomize.py` live at the top level

Examples

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

hegm 0.3.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

HeGM - Heterogeneous GPU Migration Framework

Installation

Install the Python package

Runtime prerequisites (cluster)

Deploying HeGM example

Control API (CLI + REST)

High-level architecture

Key files

Why launcher.py and sitecustomize.py live at the top level

Examples

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Why `launcher.py` and `sitecustomize.py` live at the top level