Skip to main content

No project description provided

Project description

torchft

Easy Per Step Fault Tolerance for PyTorch

| Documentation | Poster | Design Doc |

PyPI - Version


This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.

This is based on the large scale training techniques presented at PyTorch Conference 2024.

Overview

torchft is designed to provide the primitives required to implement fault tolerance in any application/train script as well as the primitives needed to implement custom fault tolerance strategies.

Out of the box, torchft provides the following algorithms:

  • Fault Tolerant DDP
  • Fault Tolerant HSDP: fault tolerance across the replicated dimension with any mix of FSDP/TP/etc across the other dimensions.
  • LocalSGD
  • DiLoCo

To implement these, torchft provides some key reusable components:

  1. Coordination primitives that can determine which workers are healthy via heartbeating on a per-step basis
  2. Fault tolerant ProcessGroup implementations that report errors sanely and be reinitialized gracefully.
  3. Checkpoint transports that can be used to do live recovery from a healthy peer when doing scale up operations.

The following component diagram shows the high level components and how they relate to each other:

Component Diagram

See torchft's documentation for more details.

Examples

torchtitan (Fault Tolerant HSDP)

torchtitan provides an out of the box fault tolerant HSDP training loop built on top of torchft that can be used to train models such as Llama 3 70B.

It also serves as a good example of how you can integrate torchft into your own training script for use with HSDP.

See torchtitan's documentation for end to end usage.

Fault Tolerant DDP

We have a minimal DDP train loop that highlights all of the key components in torchft.

See train_ddp.py for more info.

DiLoCo

LocalSGD and DiLoCo are currently experimental.

See the diloco_train_loop/local_sgd_train_loop tests for an example on how to integrate these algorithms into your training loop.

Design

torchft is designed to allow for fault tolerance when using training with replicated weights such as in DDP or HSDP (FSDP with DDP).

See the design doc for the most detailed explanation.

Lighthouse

torchft implements a lighthouse server that coordinates across the different replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which can greatly improve efficiency by avoiding stopping the world training on errors.

Lighthouse Diagram

Fault Tolerant HSDP Algorithm

torchft provides an implementation of a fault tolerant HSDP/DDP algorithm. The following diagram shows the high level operations that need to happen in the train loop to ensure everything stays consistent during a healing operation.

HSDP Diagram

See the design doc linked above for more details.

Installing from PyPI

We have nighty builds available at https://pypi.org/project/torchft-nightly/

To install torchft with minimal dependencies you can run:

pip install torchft-nightly

If you want all development dependencies you can install:

pip install torchft-nightly[dev]

Installing from Source

Prerequisites

Before proceeding, ensure you have the following installed:

  • Rust (with necessary dependencies)
  • protobuf-compiler and the corresponding development package for Protobuf.
  • PyTorch 2.7 RC+ or Nightly

Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:

curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh

To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:

sudo apt install protobuf-compiler libprotobuf-dev

or for a Red Hat-based system, run:

sudo dnf install protobuf-compiler protobuf-devel

Installation

pip install .

This uses pyo3+maturin to build the package, you'll need maturin installed.

If the installation command fails to invoke cargo update due to an inability to fetch the manifest, it may be caused by the proxy, proxySSLCert, and proxySSLKey settings in your .gitconfig file affecting the cargo command. To resolve this issue, try temporarily removing these fields from your .gitconfig before running the installation command.

To install in editable mode w/ the Rust extensions and development dependencies, you can use the normal pip install command:

pip install -e '.[dev]'

Usage

Lighthouse

The lighthouse is used for fault tolerance across replicated workers (DDP/FSDP) when using synchronous training.

You can start a lighthouse server by running:

RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

Example Training Loop (DDP)

See train_ddp.py for the full example.

Invoke with:

TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py

train.py:

from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

manager = Manager(
    pg=ProcessGroupGloo(),
    load_state_dict=...,
    state_dict=...,
)

m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))

for i in range(1000):
    batch = torch.rand(2, 2, device=device)

    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    loss.backward()

    optimizer.step()

Running DDP

After starting the lighthouse server by running:

RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

A test DDP script can be launched with torchX with:

torchx run

Or Diloco with:

USE_STREAMING=True torchx run ./torchft/torchx.py:hsdp --script='train_diloco.py'

See .torchxconfig, torchx.py and the torchX documentation to understand how DDP is being ran.

torchx.py could also launch HSDP jobs when workers_per_replica is set > 1, if the training script supports it. For an example HSDP training implementation with torchFT enabled, see torchtitan.

Alternatively, to test on a node with two GPUs, you can launch two replica groups running train_ddp.py by:

On shell 1 (one replica groups starts initial training):

export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

On shell 2 (a second replica group joins):

export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

By observing the outputs from both shells, you should observe process group reconfiguration and live checkpoint recovery.

Example Parameter Server

torchft has a fault tolerant parameter server implementation built on it's reconfigurable ProcessGroups. This does not require/use a Lighthouse server.

See parameter_server_test.py for an example.

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

torchft is BSD 3-Clause licensed. See LICENSE for more details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

torchft_nightly-2026.4.15-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.4.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.4.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.4.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.4.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file torchft_nightly-2026.4.15-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft_nightly-2026.4.15-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c591282d1f1603dbef41be364c2ba7de0b217ebf37f3d19701a8cd2eb0e2c0b
MD5 1ebc8d4414116ac77b6801a84e7d264c
BLAKE2b-256 6594b0346bfc66eb90ed8ab418f7cc6cac2d759d14e82940808d6281657e8811

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.4.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft_nightly-2026.4.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 308c26df62d5b988ed39b59a4bf601b7b96dd64b592b8e08980552df931c94a4
MD5 39e880cdf754a5a7dbbe65523269a504
BLAKE2b-256 df9e3a740c07f0de9f5e92c265b734ba2af224c15b76ac8f716fd6ba42ef6500

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.4.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft_nightly-2026.4.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d8010d15389c03feaa50c70f94276c3bd5ce9eea5f66f1169c9572454ee667c3
MD5 92dbaa0dc801ae340b7564b523d76113
BLAKE2b-256 f1d96ebf2b4dc911f1f1239659cf6a9b7f9180289028d2bf712e267c82febc86

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.4.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft_nightly-2026.4.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75866a98a0886d42351d906186f2fd6d5f2d3d826b52a6b08f77a9b9c7636cae
MD5 eacfd0cd21fe5f2c2d10d65394cbf35e
BLAKE2b-256 f3df545f9b167f3d47704a3f0868caa7c86ee7d359b4be61526696d339c5c2e7

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.4.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft_nightly-2026.4.15-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e8c80b345c88f2b48cc12363a5146de21062080b8af721d0c47cc858164e6c15
MD5 c46f22be8a83acae98752869a409661c
BLAKE2b-256 e3e22fb7120fd680d2e3bdcf82d7c4cb1ed4e57faed4eb6cd5463fd1837eb62f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page