Skip to main content

NVIDIA Resiliency Package

Project description

NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads. Users can modularly integrate NVRx capabilities into their own infrastructure to maximize AI training productivity at scale. NVRx maximizes goodput by enabling system-wide health checks, quickly detecting faults at runtime and resuming training automatically. NVRx minimizes loss of work by enabling fast and frequent checkpointing.

For detailed documentation and usage information about each component, please refer to https://nvidia.github.io/nvidia-resiliency-ext/.

⚠️ NOTE: This project is still experimental and under active development. The code, features, and documentation are evolving rapidly. Please expect frequent updates and breaking changes. Contributions are welcome and we encourage you to watch for updates.

Figure highlighting core NVRx features including automatic restart, hierarchical checkpointing, fault detection and health checks

Core Components and Capabilities

Installation

From sources

  • git clone https://github.com/NVIDIA/nvidia-resiliency-ext
  • cd nvidia-resiliency-ext
  • pip install .

From PyPI wheel

  • pip install nvidia-resiliency-ext

Platform Support

Category Supported Versions / Requirements
Architecture x86_64, arm64
Operating System Ubuntu 22.04, 24.04
Python Version >= 3.10, < 3.13
PyTorch Version >= 2.3.1 (injob & chkpt), >= 2.5.1 (inprocess)
CUDA & CUDA Toolkit >= 12.5 (12.8 required for GPU health check)
NVML Driver >= 535 (570 required for GPU health check)
NCCL Version >= 2.21.5 (injob & chkpt), >= 2.26.2 (inprocess)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_x86_64.whl (440.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_aarch64.whl (435.5 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ ARM64

nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_x86_64.whl (441.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ x86-64

nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_aarch64.whl (436.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ ARM64

nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_x86_64.whl (440.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.31+ x86-64

nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_aarch64.whl (436.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.31+ ARM64

File details

Details for the file nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 3ed370156f8a64565dca9469a989f92026f8947addf9521ef265687da4289589
MD5 e2fd9ee516f8e3fdbc26dd57cc3117f6
BLAKE2b-256 bf6e080d7446b11cbc89bf32fd1f00f78ba073c81f6782a45be557a331d6faca

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_aarch64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.4.0-cp312-cp312-manylinux_2_31_aarch64.whl
Algorithm Hash digest
SHA256 097ee479ae9ae3e7f40456bf06a2bd8a05246df3dcb17aae2823aee3e7e08e27
MD5 c9a4b1c041520e001fd7cae6c157d9f6
BLAKE2b-256 c4cee06526126b00fe4e9761beb13b7b8fc2047e2a7dca06b6d95191fe728091

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 499b5db52fcf416ddbe00d44f24b837b0aacd0586935312a13ffbb509bb8315b
MD5 c0744172a82f4d34b16a0791a571e0c7
BLAKE2b-256 41193d655cd6d294039bd369c3e9ce6ac5ec766cbabe4fda38cfaf0703118ba1

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_aarch64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.4.0-cp311-cp311-manylinux_2_31_aarch64.whl
Algorithm Hash digest
SHA256 8a661ccd54d2d3fa78d0e750a807d7d77245dfe66e33f7368ec418c8880fa4d2
MD5 aed524d9e39473ad159321d6a55e42fd
BLAKE2b-256 264992f44ca5e818c1adfb5aaaf18be12b2ef0391af0e5badf12e75008723672

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 8a601bb437074c35552d4a7def69d1859972086ecdd1ebe18fdac1f98bb26f0c
MD5 cef0110bb8fb307747c9c27114d877c3
BLAKE2b-256 fabb702e4b9df71bbf5089370c71ca7389f741fa5aafa2d19ec80f362abe7f7a

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_aarch64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.4.0-cp310-cp310-manylinux_2_31_aarch64.whl
Algorithm Hash digest
SHA256 f83f8a8f2f86637d429e074d2d08d33dc4ebd3968c08cad3131cadc8f2b86890
MD5 dfecb8913c4791fee8abc82aa2f62c2e
BLAKE2b-256 a4a5ea0ae066c21432f3df573894524715a654f7a56dcfc9822635a495849b85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page