Skip to main content

HeGM: Heterogeneous GPU Migration framework for deep learning (PyTorch today, TensorFlow-ready)

Project description

HeGM - Heterogeneous GPU Migration Framework

HeGM lets you live-migrate long-running deep learning jobs between GPUs using CRIU (Checkpoint/Restore in Userspace), without requiring any code changes in your training script. Your existing for step, batch in enumerate(dataloader) loop works as-is -- HeGM transparently resumes from the correct batch and step number after migration.

It targets PyTorch today and is structured to support additional backends (e.g. TensorFlow) in the future.

Introduce

Installation

Install the Python package

The HeGM Python package (launcher + runtime hooks) is published on PyPI as hegm.

On any machine/container where you run training:

python -m pip install --upgrade pip
pip install hegm

This installs:

  • the hegm package (transparent enumerate() patching, hooks, backends)
  • a sitecustomize.py module in site-packages that automatically loads HeGM
  • the hegm-launcher CLI

You can then start training under HeGM with:

hegm-launcher python -u train.py

In Kubernetes Pods, your command block typically becomes:

hegm-launcher python -u /workspace/train.py

Runtime prerequisites (cluster)

HeGM relies on a CRIU and CRI-O build that understand GPU checkpoint/restore and the /checkpoint/*/lock convention. Install these first on your nodes:

  • CRIU (GPU migration fork)
    leehun-criu branch 2026-01-26/gpu-migration-support
    See the upstream README for build/install instructions:

  • CRI-O (restore-from-file fork)
    leehun-cri-o branch 2026-02-03/support-restore-from-file
    Install and configure it as your Kubernetes container runtime:

In addition you need:

  • A Kubernetes cluster with GPU nodes and the NVIDIA drivers/runtime configured.
  • kubectl access to the cluster.

Deploying HeGM example

From this repository:

  1. Create the demo namespace and storage (if not already present):
    • examples/dra/ns.yaml
    • examples/dra/storage.yaml
  2. Create the ConfigMaps:
    • examples/dra/hegm-scripts.yaml (Launcher + HeGM payload)
    • examples/dra/training-script.yaml (your train.py)
  3. Create the resource claim(s):
    • examples/dra/resource-claim.yaml (or resource-claim-restore.yaml)
  4. Launch a training pod:
    • Single worker: examples/dra/training-pod.yaml
    • Two workers in one Pod (for PID-isolation testing): examples/dra/multiple-training-pod.yaml
  5. Trigger checkpoint from outside the pod:
    • Prefer using the hegm-ctrl CLI, which also exposes a REST API inside the Pod:
      • All workers using the GPU:
        kubectl exec ... -- hegm-ctrl checkpoint --all
      • Specific PID(s):
        kubectl exec ... -- hegm-ctrl checkpoint --pid <pid1> <pid2> ...
    • Under the hood this sends SIGUSR1 to the worker PID(s) and waits until the corresponding /checkpoint/<PPID>/lock file(s) appear, indicating that the checkpoint is ready for CRIU/CRI-O to snapshot.

Control API (CLI + REST)

When at least one hegm-launcher is active, HeGM starts a small HTTP server inside the container (default 0.0.0.0:8298, configurable via HEGM_CTRL_PORT). It is implemented by the hegm-ctrl module.

  • CLI (inside the Pod):

    • Trigger checkpoint for all GPU-using or worker processes:

      hegm-ctrl checkpoint --all
      
    • Trigger checkpoint for specific worker PID(s):

      hegm-ctrl checkpoint --pid 1234 5678
      
    • Resume from checkpoints:

      # All pending checkpoints (all /checkpoint/*/lock)
      hegm-ctrl resume --all
      
      # Specific parent PID(s)
      hegm-ctrl resume --ppid 1 2 3
      

    checkpoint waits until the corresponding lock file(s) /checkpoint/<PPID>/lock appear (or timeout), so when the command returns the checkpoint is ready for CRIU/CRI-O.

  • REST API (from outside the Pod):

    Assuming HEGM_CTRL_PORT=8298 and using the Pod IP:

    • POST /checkpoint

      // Signal specific worker PIDs
      {
        "pids": [1234, 5678]
      }
      
      // Or auto-detect GPU/worker PIDs
      {
        "all": true
      }
      

      Response:

      {
        "results": {
          "1234": "checkpoint-ready",
          "5678": "signalled (timeout-waiting-lock)"
        }
      }
      
    • POST /resume

      // Resume specific parent PIDs
      {
        "ppids": [1, 2, 3]
      }
      
      // Or resume all pending checkpoints
      {
        "all": true
      }
      

      Response:

      {
        "results": {
          "1": "lock-removed",
          "2": "no-lock"
        }
      }
      
  1. Use your CRI-O / CRIU integration to checkpoint and restore the container, pointing CRI-O at the CRIU checkpoint tarball and re-using the same hegm-scripts and train-script ConfigMaps (see examples/dra/restore-pod.yaml).

High-level architecture

HeGM is split into two main pieces:

  • Launcher (launcher.py / hegm-launcher): a small supervisor process that:

    • spawns your training script as a child process (the Worker)
    • injects sitecustomize.py via PYTHONPATH
    • watches the Worker's exit code:
      • 0 → training finished, exit normally
      • 99 → Worker saved a checkpoint and exited for migration
    • buffers the checkpoint file into RAM so CRIU can carry it across nodes
    • creates a per-process lock file so an external controller knows when it is safe to snapshot
    • starts a lightweight in-container REST API (via the hegm-ctrl module) so external controllers can trigger checkpoint/resume over HTTP
  • Payload (sitecustomize.py + hegm/ package): automatically loaded into the Worker by Python. It:

    • installs a PEP‑451 import hook for torch
    • monkey-patches torch.nn.Module and torch.optim.Optimizer to track models and optimizers
    • patches builtins.enumerate so enumerate(DataLoader) transparently resumes from the correct batch after a checkpoint restore
    • tracks global training steps and RNG state
    • handles SIGUSR1 by saving a checkpoint and exiting with code 99

All of this is delivered to your Pods via a ConfigMap (examples/dra/hegm-scripts.yaml).

Key files

  • launcher.py

    • Entry point you run instead of python train.py.
    • For each Launcher process, checkpoints and lock files are isolated by PID:
      • /checkpoint/<PID>/latest.pt
      • /checkpoint/<PID>/lock
    • The external controller can discover all ready instances via /checkpoint/*/lock.
  • sitecustomize.py

    • Thin bootstrap that simply does import hegm.
    • Needs to stay at the top level of your PYTHONPATH so Python’s sitecustomize mechanism can find it.
  • hegm/

    • __init__.py: patches enumerate() for transparent DataLoader resume, public API (hegm.global_step()), SIGUSR1 handler, import hooks.
    • _config.py: env‑driven configuration and lightweight logging.
    • _hook.py: generic PEP‑451 import hook logic.
    • backends/__init__.py: AbstractBackend interface and registry.
    • backends/pytorch.py: PyTorch backend (tracking, checkpointing, RNG, GPU teardown).
    • ctrl.py: external control helpers (hegm-ctrl CLI and REST API).

Why launcher.py and sitecustomize.py live at the top level

The package code lives under hegm/, but we intentionally keep:

  • sitecustomize.py at the top level so that Python’s automatic sitecustomize import works as soon as /opt/hegm is on PYTHONPATH.
  • launcher.py at the top level so Kubernetes manifests can invoke it as /opt/hegm/launcher.py without needing to worry about Python module import paths.

Internally, both of these files are very thin:

  • sitecustomize.py just imports the hegm package.
  • launcher.py is a small script that wires the environment and hands off most logic to the shared configuration/checkpointing scheme.

In other words, the reusable logic already lives in hegm/; the two top‑level scripts are just ergonomic entry points for Python and Kubernetes. We can still add more entry points later (e.g. hegm.launcher console script) without changing this layout.

Examples

Working DRA examples live under examples/dra/:

  • training-pod.yaml – single training job.
  • multiple-training-pod.yaml – two launcher.py processes in one Pod, useful for testing PID‑isolated checkpoints and locks.
  • restore-pod.yaml – example restore workflow using CRIU.

See docs/HeGM_Architecture.md for a more detailed architectural walk‑through.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hegm-0.3.1.tar.gz (38.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hegm-0.3.1-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file hegm-0.3.1.tar.gz.

File metadata

  • Download URL: hegm-0.3.1.tar.gz
  • Upload date:
  • Size: 38.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for hegm-0.3.1.tar.gz
Algorithm Hash digest
SHA256 698f47e70ba78e4c7519e6fd320d58ac8813704e207d6569ee8d9b41662192d6
MD5 32f51a967a5a52f18dc7f161f3e5b585
BLAKE2b-256 2c42fc67102d869d407be57f569e2cb9371b3c88e3fb59e27d78cfa8a5ae2d82

See more details on using hashes here.

File details

Details for the file hegm-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: hegm-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for hegm-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9fff20f0b561aad71d1603ea9d5577aae2665a9aebe547e08fa9827f572f34c3
MD5 c64bd090a543145ca0b7346cc168856f
BLAKE2b-256 7918af3d4ffdcc82ac7d806f816e0ede23a498daa717bdb38cc741e27a76ecb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page