NVIDIA Resiliency Package
Project description
NVIDIA Resiliency Extension
The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads. Users can modularly integrate NVRx capabilities into their own infrastructure to maximize AI training productivity at scale. NVRx maximizes goodput by enabling system-wide health checks, quickly detecting faults at runtime and resuming training automatically. NVRx minimizes loss of work by enabling fast and frequent checkpointing.
For detailed documentation and usage information about each component, please refer to https://nvidia.github.io/nvidia-resiliency-ext/.
⚠️ NOTE: This project is still experimental and under active development. The code, features, and documentation are evolving rapidly. Please expect frequent updates and breaking changes. Contributions are welcome and we encourage you to watch for updates.
Core Components and Capabilities
-
- Detection of hung ranks.
- Restarting training in-job, without the need to reallocate SLURM nodes.
-
- Detecting failures and enabling quick recovery.
-
- Providing an efficient framework for asynchronous checkpointing.
-
- Providing an efficient framework for local checkpointing.
-
- Monitoring GPU and CPU performance of ranks.
- Identifying slower ranks that may impede overall training efficiency.
-
- Facilitating seamless NVRx integration with PyTorch Lightning.
Installation
From sources
git clone https://github.com/NVIDIA/nvidia-resiliency-extcd nvidia-resiliency-extpip install .
From PyPI wheel
pip install nvidia-resiliency-ext
Platform Support
| Category | Supported Versions / Requirements |
|---|---|
| Architecture | x86_64, arm64 |
| Operating System | Ubuntu 22.04, 24.04 |
| Python Version | >= 3.10, < 3.13 |
| PyTorch Version | >= 2.3.1 (injob & chkpt), >= 2.5.1 (inprocess) |
| CUDA & CUDA Toolkit | >= 12.5 (12.8 required for GPU health check) |
| NVML Driver | >= 535 (570 required for GPU health check) |
| NCCL Version | >= 2.21.5 (injob & chkpt), >= 2.26.2 (inprocess) |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_x86_64.whl.
File metadata
- Download URL: nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 440.1 kB
- Tags: CPython 3.12, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ed370156f8a64565dca9469a989f92026f8947addf9521ef265687da4289589
|
|
| MD5 |
e2fd9ee516f8e3fdbc26dd57cc3117f6
|
|
| BLAKE2b-256 |
bf6e080d7446b11cbc89bf32fd1f00f78ba073c81f6782a45be557a331d6faca
|
File details
Details for the file nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_aarch64.whl.
File metadata
- Download URL: nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_aarch64.whl
- Upload date:
- Size: 435.5 kB
- Tags: CPython 3.12, manylinux: glibc 2.31+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
097ee479ae9ae3e7f40456bf06a2bd8a05246df3dcb17aae2823aee3e7e08e27
|
|
| MD5 |
c9a4b1c041520e001fd7cae6c157d9f6
|
|
| BLAKE2b-256 |
c4cee06526126b00fe4e9761beb13b7b8fc2047e2a7dca06b6d95191fe728091
|
File details
Details for the file nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_x86_64.whl.
File metadata
- Download URL: nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 441.4 kB
- Tags: CPython 3.11, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
499b5db52fcf416ddbe00d44f24b837b0aacd0586935312a13ffbb509bb8315b
|
|
| MD5 |
c0744172a82f4d34b16a0791a571e0c7
|
|
| BLAKE2b-256 |
41193d655cd6d294039bd369c3e9ce6ac5ec766cbabe4fda38cfaf0703118ba1
|
File details
Details for the file nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_aarch64.whl.
File metadata
- Download URL: nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_aarch64.whl
- Upload date:
- Size: 436.6 kB
- Tags: CPython 3.11, manylinux: glibc 2.31+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a661ccd54d2d3fa78d0e750a807d7d77245dfe66e33f7368ec418c8880fa4d2
|
|
| MD5 |
aed524d9e39473ad159321d6a55e42fd
|
|
| BLAKE2b-256 |
264992f44ca5e818c1adfb5aaaf18be12b2ef0391af0e5badf12e75008723672
|
File details
Details for the file nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_x86_64.whl.
File metadata
- Download URL: nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 440.4 kB
- Tags: CPython 3.10, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a601bb437074c35552d4a7def69d1859972086ecdd1ebe18fdac1f98bb26f0c
|
|
| MD5 |
cef0110bb8fb307747c9c27114d877c3
|
|
| BLAKE2b-256 |
fabb702e4b9df71bbf5089370c71ca7389f741fa5aafa2d19ec80f362abe7f7a
|
File details
Details for the file nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_aarch64.whl.
File metadata
- Download URL: nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_aarch64.whl
- Upload date:
- Size: 436.1 kB
- Tags: CPython 3.10, manylinux: glibc 2.31+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f83f8a8f2f86637d429e074d2d08d33dc4ebd3968c08cad3131cadc8f2b86890
|
|
| MD5 |
dfecb8913c4791fee8abc82aa2f62c2e
|
|
| BLAKE2b-256 |
a4a5ea0ae066c21432f3df573894524715a654f7a56dcfc9822635a495849b85
|