torchft-nightly

No project description provided

These details have not been verified by PyPI

Project links

Project description

torchft

Easy Per Step Fault Tolerance for PyTorch

| Documentation | Poster | Design Doc |

This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.

This is based on the large scale training techniques presented at PyTorch Conference 2024.

Overview

torchft is designed to provide the primitives required to implement fault tolerance in any application/train script as well as the primitives needed to implement custom fault tolerance strategies.

Out of the box, torchft provides the following algorithms:

Fault Tolerant DDP
Fault Tolerant HSDP: fault tolerance across the replicated dimension with any mix of FSDP/TP/etc across the other dimensions.
LocalSGD
DiLoCo

To implement these, torchft provides some key reusable components:

Coordination primitives that can determine which workers are healthy via heartbeating on a per-step basis
Fault tolerant ProcessGroup implementations that report errors sanely and be reinitialized gracefully.
Checkpoint transports that can be used to do live recovery from a healthy peer when doing scale up operations.

The following component diagram shows the high level components and how they relate to each other:

Component Diagram

See torchft's documentation for more details.

Examples

torchtitan (Fault Tolerant HSDP)

torchtitan provides an out of the box fault tolerant HSDP training loop built on top of torchft that can be used to train models such as Llama 3 70B.

It also serves as a good example of how you can integrate torchft into your own training script for use with HSDP.

See torchtitan's documentation for end to end usage.

Fault Tolerant DDP

We have a minimal DDP train loop that highlights all of the key components in torchft.

See train_ddp.py for more info.

DiLoCo

LocalSGD and DiLoCo are currently experimental.

See the diloco_train_loop/local_sgd_train_loop tests for an example on how to integrate these algorithms into your training loop.

Design

torchft is designed to allow for fault tolerance when using training with replicated weights such as in DDP or HSDP (FSDP with DDP).

See the design doc for the most detailed explanation.

Lighthouse

torchft implements a lighthouse server that coordinates across the different replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which can greatly improve efficiency by avoiding stopping the world training on errors.

Lighthouse Diagram

Fault Tolerant HSDP Algorithm

torchft provides an implementation of a fault tolerant HSDP/DDP algorithm. The following diagram shows the high level operations that need to happen in the train loop to ensure everything stays consistent during a healing operation.

HSDP Diagram

See the design doc linked above for more details.

Installing from PyPI

We have nighty builds available at https://pypi.org/project/torchft-nightly/

To install torchft with minimal dependencies you can run:

pip install torchft-nightly

If you want all development dependencies you can install:

pip install torchft-nightly[dev]

Installing from Source

Prerequisites

Before proceeding, ensure you have the following installed:

Rust (with necessary dependencies)
protobuf-compiler and the corresponding development package for Protobuf.
PyTorch 2.7 RC+ or Nightly

Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:

curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh

To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:

sudo apt install protobuf-compiler libprotobuf-dev

or for a Red Hat-based system, run:

sudo dnf install protobuf-compiler protobuf-devel

Installation

pip install .

This uses pyo3+maturin to build the package, you'll need maturin installed.

If the installation command fails to invoke cargo update due to an inability to fetch the manifest, it may be caused by the proxy, proxySSLCert, and proxySSLKey settings in your .gitconfig file affecting the cargo command. To resolve this issue, try temporarily removing these fields from your .gitconfig before running the installation command.

To install in editable mode w/ the Rust extensions and development dependencies, you can use the normal pip install command:

pip install -e '.[dev]'

Usage

Lighthouse

The lighthouse is used for fault tolerance across replicated workers (DDP/FSDP) when using synchronous training.

You can start a lighthouse server by running:

RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

Example Training Loop (DDP)

See train_ddp.py for the full example.

Invoke with:

TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py

train.py:

from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

manager = Manager(
    pg=ProcessGroupGloo(),
    load_state_dict=...,
    state_dict=...,
)

m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))

for i in range(1000):
    batch = torch.rand(2, 2, device=device)

    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    loss.backward()

    optimizer.step()

Running DDP

After starting the lighthouse server by running:

RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

A test DDP script can be launched with torchX with:

torchx run

Or Diloco with:

USE_STREAMING=True torchx run ./torchft/torchx.py:hsdp --script='train_diloco.py'

See .torchxconfig, torchx.py and the torchX documentation to understand how DDP is being ran.

torchx.py could also launch HSDP jobs when workers_per_replica is set > 1, if the training script supports it. For an example HSDP training implementation with torchFT enabled, see torchtitan.

Alternatively, to test on a node with two GPUs, you can launch two replica groups running train_ddp.py by:

On shell 1 (one replica groups starts initial training):

export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

On shell 2 (a second replica group joins):

export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

By observing the outputs from both shells, you should observe process group reconfiguration and live checkpoint recovery.

Example Parameter Server

torchft has a fault tolerant parameter server implementation built on it's reconfigurable ProcessGroups. This does not require/use a Lighthouse server.

See parameter_server_test.py for an example.

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

torchft is BSD 3-Clause licensed. See LICENSE for more details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2026.7.4

Jul 4, 2026

2026.7.3

Jul 3, 2026

2026.7.2

Jul 2, 2026

2026.7.1

Jul 1, 2026

2026.6.30

Jun 30, 2026

2026.6.29

Jun 29, 2026

2026.6.28

Jun 28, 2026

2026.6.27

Jun 27, 2026

2026.6.26

Jun 26, 2026

2026.6.25

Jun 25, 2026

2026.6.24

Jun 24, 2026

2026.6.23

Jun 23, 2026

2026.6.22

Jun 22, 2026

2026.6.21

Jun 21, 2026

2026.6.20

Jun 20, 2026

2026.6.19

Jun 19, 2026

2026.6.18

Jun 18, 2026

2026.6.17

Jun 17, 2026

2026.6.16

Jun 16, 2026

2026.6.15

Jun 15, 2026

2026.6.14

Jun 14, 2026

2026.6.13

Jun 13, 2026

2026.6.12

Jun 12, 2026

2026.6.11

Jun 11, 2026

2026.6.10

Jun 10, 2026

2026.6.9

Jun 9, 2026

2026.6.8

Jun 8, 2026

2026.6.7

Jun 7, 2026

2026.6.6

Jun 6, 2026

2026.6.5

Jun 5, 2026

2026.6.4

Jun 4, 2026

2026.6.3

Jun 3, 2026

2026.6.2

Jun 2, 2026

2026.6.1

Jun 1, 2026

2026.5.31

May 31, 2026

2026.5.30

May 30, 2026

2026.5.29

May 29, 2026

2026.5.28

May 28, 2026

2026.5.27

May 27, 2026

2026.5.25

May 25, 2026

2026.5.24

May 24, 2026

2026.5.23

May 23, 2026

2026.5.22

May 22, 2026

This version

2026.5.21

May 21, 2026

2026.5.20

May 20, 2026

2026.5.19

May 19, 2026

2026.5.18

May 18, 2026

2026.5.17

May 17, 2026

2026.5.16

May 16, 2026

2026.5.15

May 15, 2026

2026.5.14

May 14, 2026

2026.5.13

May 13, 2026

2026.5.12

May 12, 2026

2026.5.11

May 11, 2026

2026.5.10

May 10, 2026

2026.5.9

May 9, 2026

2026.5.8

May 8, 2026

2026.5.7

May 7, 2026

2026.5.6

May 6, 2026

2026.5.5

May 5, 2026

2026.5.4

May 4, 2026

2026.5.3

May 3, 2026

2026.5.2

May 2, 2026

2026.5.1

May 1, 2026

2026.4.30

Apr 30, 2026

2026.4.29

Apr 29, 2026

2026.4.28

Apr 28, 2026

2026.4.27

Apr 27, 2026

2026.4.26

Apr 26, 2026

2026.4.25

Apr 25, 2026

2026.4.24

Apr 24, 2026

2026.4.23

Apr 23, 2026

2026.4.22

Apr 22, 2026

2026.4.21

Apr 21, 2026

2026.4.20

Apr 20, 2026

2026.4.19

Apr 19, 2026

2026.4.18

Apr 18, 2026

2026.4.17

Apr 17, 2026

2026.4.16

Apr 16, 2026

2026.4.15

Apr 15, 2026

2026.4.14

Apr 14, 2026

2026.4.13

Apr 13, 2026

2026.4.12

Apr 12, 2026

2026.4.11

Apr 11, 2026

2026.4.10

Apr 10, 2026

2026.4.9

Apr 9, 2026

2026.4.8

Apr 8, 2026

2026.4.7

Apr 7, 2026

2026.4.6

Apr 6, 2026

2026.4.5

Apr 5, 2026

2026.4.4

Apr 4, 2026

2026.4.3

Apr 3, 2026

2026.4.2

Apr 2, 2026

2026.4.1

Apr 1, 2026

2026.3.31

Mar 31, 2026

2026.3.30

Mar 30, 2026

2026.3.29

Mar 29, 2026

2026.3.28

Mar 28, 2026

2026.3.27

Mar 27, 2026

2026.3.26

Mar 26, 2026

2026.3.25

Mar 25, 2026

2026.3.24

Mar 24, 2026

2026.3.23

Mar 23, 2026

2026.3.22

Mar 22, 2026

2026.3.21

Mar 21, 2026

2026.3.20

Mar 20, 2026

2026.3.19

Mar 19, 2026

2026.3.18

Mar 18, 2026

2026.3.17

Mar 17, 2026

2026.3.16

Mar 16, 2026

2026.3.15

Mar 15, 2026

2026.3.14

Mar 14, 2026

2026.3.13

Mar 13, 2026

2026.3.12

Mar 12, 2026

2026.3.11

Mar 11, 2026

2026.3.10

Mar 10, 2026

2026.3.9

Mar 9, 2026

2026.3.8

Mar 8, 2026

2026.3.7

Mar 7, 2026

2026.3.6

Mar 6, 2026

2026.3.5

Mar 5, 2026

2026.3.4

Mar 4, 2026

2026.3.3

Mar 3, 2026

2026.3.2

Mar 2, 2026

2026.3.1

Mar 1, 2026

2026.2.28

Feb 28, 2026

2026.2.27

Feb 27, 2026

2026.2.26

Feb 26, 2026

2026.2.25

Feb 25, 2026

2026.2.24

Feb 24, 2026

2026.2.23

Feb 23, 2026

2026.2.22

Feb 22, 2026

2026.2.21

Feb 21, 2026

2026.2.20

Feb 20, 2026

2026.2.19

Feb 19, 2026

2026.2.18

Feb 18, 2026

2026.2.17

Feb 17, 2026

2026.2.16

Feb 16, 2026

2026.2.15

Feb 15, 2026

2026.2.14

Feb 14, 2026

2026.2.13

Feb 13, 2026

2026.2.12

Feb 12, 2026

2026.2.11

Feb 11, 2026

2026.2.10

Feb 10, 2026

2026.2.9

Feb 9, 2026

2026.2.8

Feb 8, 2026

2026.2.7

Feb 7, 2026

2026.2.6

Feb 6, 2026

2026.2.5

Feb 5, 2026

2026.2.4

Feb 4, 2026

2026.2.3

Feb 3, 2026

2026.2.2

Feb 2, 2026

2026.2.1

Feb 1, 2026

2026.1.31

Jan 31, 2026

2026.1.30

Jan 30, 2026

2026.1.29

Jan 29, 2026

2026.1.28

Jan 28, 2026

2026.1.27

Jan 27, 2026

2026.1.26

Jan 26, 2026

2026.1.25

Jan 25, 2026

2026.1.24

Jan 24, 2026

2026.1.23

Jan 23, 2026

2026.1.22

Jan 22, 2026

2026.1.21

Jan 21, 2026

2026.1.20

Jan 20, 2026

2026.1.19

Jan 19, 2026

2026.1.18

Jan 18, 2026

2026.1.17

Jan 17, 2026

2026.1.16

Jan 16, 2026

2026.1.15

Jan 15, 2026

2026.1.14

Jan 14, 2026

2026.1.13

Jan 13, 2026

2026.1.12

Jan 12, 2026

2026.1.11

Jan 11, 2026

2026.1.10

Jan 10, 2026

2026.1.9

Jan 9, 2026

2026.1.8

Jan 8, 2026

2026.1.7

Jan 7, 2026

2026.1.6

Jan 6, 2026

2026.1.5

Jan 5, 2026

2026.1.4

Jan 4, 2026

2026.1.3

Jan 3, 2026

2026.1.2

Jan 2, 2026

2026.1.1

Jan 1, 2026

2025.12.31

Dec 31, 2025

2025.12.30

Dec 30, 2025

2025.12.29

Dec 29, 2025

2025.12.28

Dec 28, 2025

2025.12.27

Dec 27, 2025

2025.12.26

Dec 26, 2025

2025.12.25

Dec 25, 2025

2025.12.24

Dec 24, 2025

2025.12.23

Dec 23, 2025

2025.12.22

Dec 22, 2025

2025.12.21

Dec 21, 2025

2025.12.20

Dec 20, 2025

2025.12.19

Dec 19, 2025

2025.12.18

Dec 18, 2025

2025.12.17

Dec 17, 2025

2025.12.16

Dec 16, 2025

2025.12.15

Dec 15, 2025

2025.12.14

Dec 14, 2025

2025.12.13

Dec 13, 2025

2025.12.12

Dec 12, 2025

2025.12.11

Dec 11, 2025

2025.12.10

Dec 10, 2025

2025.12.9

Dec 9, 2025

2025.12.8

Dec 8, 2025

2025.12.7

Dec 7, 2025

2025.12.6

Dec 6, 2025

2025.12.5

Dec 5, 2025

2025.12.4

Dec 4, 2025

2025.12.3

Dec 3, 2025

2025.12.2

Dec 2, 2025

2025.11.24

Nov 24, 2025

2025.11.23

Nov 23, 2025

2025.11.22

Nov 22, 2025

2025.11.21

Nov 21, 2025

2025.11.20

Nov 20, 2025

2025.11.19

Nov 19, 2025

2025.11.18

Nov 18, 2025

2025.11.17

Nov 17, 2025

2025.11.16

Nov 16, 2025

2025.11.15

Nov 15, 2025

2025.11.14

Nov 14, 2025

2025.11.13

Nov 13, 2025

2025.11.12

Nov 12, 2025

2025.11.11

Nov 11, 2025

2025.11.10

Nov 10, 2025

2025.11.9

Nov 9, 2025

2025.11.8

Nov 8, 2025

2025.11.7

Nov 7, 2025

2025.11.6

Nov 6, 2025

2025.11.5

Nov 5, 2025

2025.11.4

Nov 4, 2025

2025.11.3

Nov 3, 2025

2025.11.2

Nov 2, 2025

2025.11.1

Nov 1, 2025

2025.10.31

Oct 31, 2025

2025.10.30

Oct 30, 2025

2025.10.29

Oct 29, 2025

2025.10.28

Oct 28, 2025

2025.10.27

Oct 27, 2025

2025.10.26

Oct 26, 2025

2025.10.25

Oct 25, 2025

2025.10.24

Oct 24, 2025

2025.10.23

Oct 23, 2025

2025.10.22

Oct 22, 2025

2025.10.21

Oct 21, 2025

2025.10.20

Oct 20, 2025

2025.10.19

Oct 19, 2025

2025.10.18

Oct 18, 2025

2025.10.17

Oct 17, 2025

2025.10.16

Oct 16, 2025

2025.10.15

Oct 15, 2025

2025.10.14

Oct 14, 2025

2025.10.13

Oct 13, 2025

2025.10.12

Oct 12, 2025

2025.10.11

Oct 11, 2025

2025.10.10

Oct 10, 2025

2025.10.9

Oct 9, 2025

2025.10.8

Oct 8, 2025

2025.10.7

Oct 7, 2025

2025.10.6

Oct 6, 2025

2025.10.5

Oct 5, 2025

2025.10.4

Oct 4, 2025

2025.10.3

Oct 3, 2025

2025.10.2

Oct 2, 2025

2025.10.1

Oct 1, 2025

2025.9.30

Sep 30, 2025

2025.9.29

Sep 29, 2025

2025.9.28

Sep 28, 2025

2025.9.27

Sep 27, 2025

2025.9.26

Sep 26, 2025

2025.9.25

Sep 25, 2025

2025.9.24

Sep 24, 2025

2025.7.27

Jul 27, 2025

2025.7.26

Jul 26, 2025

2025.7.25

Jul 25, 2025

2025.7.24

Jul 24, 2025

2025.7.23

Jul 23, 2025

2025.7.22

Jul 22, 2025

2025.7.21

Jul 21, 2025

2025.7.20

Jul 20, 2025

2025.7.19

Jul 19, 2025

2025.7.18

Jul 18, 2025

2025.7.17

Jul 17, 2025

2025.7.16

Jul 16, 2025

2025.7.15

Jul 15, 2025

2025.7.14

Jul 14, 2025

2025.7.13

Jul 13, 2025

2025.7.12

Jul 12, 2025

2025.7.11

Jul 11, 2025

2025.7.10

Jul 10, 2025

2025.7.9

Jul 9, 2025

2025.7.8

Jul 8, 2025

2025.7.7

Jul 7, 2025

2025.7.6

Jul 6, 2025

2025.7.5

Jul 5, 2025

2025.7.4

Jul 4, 2025

2025.7.3

Jul 3, 2025

2025.7.2

Jul 2, 2025

2025.7.1

Jul 1, 2025

2025.6.30

Jun 30, 2025

2025.6.29

Jun 29, 2025

2025.6.28

Jun 28, 2025

2025.6.27

Jun 27, 2025

2025.6.26

Jun 26, 2025

2025.6.25

Jun 25, 2025

2025.6.24

Jun 24, 2025

2025.6.23

Jun 23, 2025

2025.6.22

Jun 22, 2025

2025.6.21

Jun 21, 2025

2025.6.20

Jun 20, 2025

2025.6.19

Jun 19, 2025

2025.6.18

Jun 18, 2025

2025.6.17

Jun 17, 2025

2025.6.16

Jun 16, 2025

2025.6.15

Jun 15, 2025

2025.6.14

Jun 14, 2025

2025.6.13

Jun 13, 2025

2025.6.12

Jun 12, 2025

2025.6.11

Jun 11, 2025

2025.6.10

Jun 10, 2025

2025.6.9

Jun 9, 2025

2025.6.8

Jun 8, 2025

2025.6.7

Jun 7, 2025

2025.6.6

Jun 6, 2025

2025.6.5

Jun 5, 2025

2025.6.4

Jun 4, 2025

2025.6.3

Jun 3, 2025

2025.6.2

Jun 2, 2025

2025.6.1

Jun 1, 2025

2025.5.31

May 31, 2025

2025.5.30

May 30, 2025

2025.5.29

May 29, 2025

2025.5.28

May 28, 2025

2025.5.27

May 27, 2025

2025.5.26

May 26, 2025

2025.5.25

May 25, 2025

2025.5.24

May 24, 2025

2025.5.23

May 23, 2025

2025.5.22

May 22, 2025

2025.5.21

May 21, 2025

2025.5.20

May 20, 2025

2025.5.19

May 19, 2025

2025.5.18

May 18, 2025

2025.5.17

May 17, 2025

2025.5.16

May 16, 2025

2025.5.15

May 15, 2025

2025.5.14

May 14, 2025

2025.5.13

May 13, 2025

2025.5.12

May 12, 2025

2025.5.11

May 11, 2025

2025.5.10

May 10, 2025

2025.5.9

May 9, 2025

2025.5.8

May 8, 2025

2025.5.7

May 7, 2025

2025.5.6

May 6, 2025

2025.5.5

May 5, 2025

2025.5.4

May 4, 2025

2025.5.3

May 3, 2025

2025.5.2

May 2, 2025

2025.5.1

May 1, 2025

2025.4.30

Apr 30, 2025

2025.4.29

Apr 29, 2025

2025.4.28

Apr 28, 2025

2025.4.27

Apr 27, 2025

2025.4.26

Apr 26, 2025

2025.4.25

Apr 25, 2025

2025.4.24

Apr 24, 2025

2025.4.23

Apr 23, 2025

2025.4.22

Apr 22, 2025

2025.4.21

Apr 21, 2025

2025.4.20

Apr 20, 2025

2025.4.19

Apr 19, 2025

2025.4.18

Apr 18, 2025

2025.4.17

Apr 17, 2025

2025.4.16

Apr 16, 2025

2025.4.15

Apr 15, 2025

2025.4.14

Apr 14, 2025

2025.4.13

Apr 13, 2025

2025.4.12

Apr 12, 2025

2025.4.11

Apr 11, 2025

2025.4.10

Apr 10, 2025

2025.4.9

Apr 9, 2025

2025.4.8

Apr 8, 2025

2025.4.7

Apr 7, 2025

2025.4.6

Apr 6, 2025

2025.4.5

Apr 5, 2025

2025.4.4

Apr 4, 2025

2025.4.3

Apr 3, 2025

2025.4.2

Apr 2, 2025

2025.4.1

Apr 1, 2025

2025.3.31

Mar 31, 2025

2025.3.30

Mar 30, 2025

2025.3.29

Mar 29, 2025

2025.3.28

Mar 28, 2025

2025.3.27

Mar 27, 2025

2025.3.26

Mar 26, 2025

2025.3.25

Mar 25, 2025

2025.3.24

Mar 24, 2025

2025.3.23

Mar 23, 2025

2025.3.22

Mar 22, 2025

2025.3.21

Mar 21, 2025

2025.3.20

Mar 20, 2025

2025.3.19

Mar 19, 2025

2025.3.18

Mar 18, 2025

2025.3.17

Mar 17, 2025

2025.3.16

Mar 16, 2025

2025.3.15

Mar 15, 2025

2025.3.14

Mar 14, 2025

2025.3.13

Mar 13, 2025

2025.3.12

Mar 12, 2025

2025.3.11

Mar 11, 2025

2025.3.10

Mar 10, 2025

2025.3.9

Mar 9, 2025

2025.3.8

Mar 8, 2025

2025.3.7

Mar 7, 2025

2025.3.6

Mar 6, 2025

2025.3.5

Mar 5, 2025

2025.3.4

Mar 4, 2025

2025.3.3

Mar 3, 2025

2025.3.2

Mar 2, 2025

2025.3.1

Mar 1, 2025

2025.2.28

Feb 28, 2025

2025.2.27

Feb 27, 2025

2025.2.26

Feb 26, 2025

2025.2.25

Feb 25, 2025

2025.2.24

Feb 24, 2025

2025.2.23

Feb 23, 2025

2025.2.22

Feb 22, 2025

2025.2.21

Feb 21, 2025

2025.2.20

Feb 20, 2025

2025.2.19

Feb 19, 2025

2025.2.18

Feb 18, 2025

2025.2.17

Feb 17, 2025

2025.2.16

Feb 16, 2025

2025.2.15

Feb 15, 2025

2025.2.14

Feb 14, 2025

2025.2.13

Feb 13, 2025

2025.2.12

Feb 12, 2025

2025.2.11

Feb 11, 2025

2025.2.10

Feb 10, 2025

2025.2.9

Feb 9, 2025

2025.2.8

Feb 8, 2025

2025.2.7

Feb 7, 2025

2025.2.6

Feb 6, 2025

2025.2.5

Feb 5, 2025

2025.2.4

Feb 4, 2025

2025.2.3

Feb 3, 2025

2025.2.2

Feb 2, 2025

2025.2.1

Feb 1, 2025

2025.1.31

Jan 31, 2025

2025.1.30

Jan 30, 2025

2025.1.29

Jan 29, 2025

2025.1.28

Jan 28, 2025

2025.1.27

Jan 27, 2025

2025.1.26

Jan 26, 2025

2025.1.25

Jan 25, 2025

2025.1.24

Jan 24, 2025

2025.1.23

Jan 23, 2025

2025.1.22

Jan 22, 2025

2025.1.21

Jan 21, 2025

2025.1.20

Jan 20, 2025

2025.1.19

Jan 19, 2025

2025.1.18

Jan 18, 2025

2025.1.17

Jan 17, 2025

2025.1.16

Jan 16, 2025

2025.1.15

Jan 15, 2025

2025.1.14

Jan 14, 2025

2025.1.13

Jan 13, 2025

2025.1.12

Jan 12, 2025

2025.1.11

Jan 11, 2025

2025.1.10

Jan 10, 2025

2025.1.9

Jan 9, 2025

2025.1.8

Jan 8, 2025

2025.1.7

Jan 7, 2025

2025.1.6

Jan 6, 2025

2025.1.5

Jan 5, 2025

2025.1.4

Jan 4, 2025

2025.1.3

Jan 3, 2025

2025.1.2

Jan 2, 2025

2025.1.1

Jan 1, 2025

2024.12.31

Dec 31, 2024

2024.12.30

Dec 30, 2024

2024.12.29

Dec 29, 2024

2024.12.28

Dec 28, 2024

2024.12.27

Dec 27, 2024

2024.12.26

Dec 26, 2024

2024.12.25

Dec 25, 2024

2024.12.24

Dec 24, 2024

2024.12.23

Dec 23, 2024

2024.12.22

Dec 22, 2024

2024.12.21

Dec 21, 2024

2024.12.20

Dec 20, 2024

2024.12.19

Dec 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

torchft_nightly-2026.5.21-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded May 21, 2026 CPython 3.13manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.5.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded May 21, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.5.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded May 21, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.5.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded May 21, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64

torchft_nightly-2026.5.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded May 21, 2026 CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file torchft_nightly-2026.5.21-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: torchft_nightly-2026.5.21-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 21, 2026
Size: 2.5 MB
Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for torchft_nightly-2026.5.21-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`e6ccf142cf9308cb880dadeab4593448cf2d0d6f11db9fdb74ae40f14aada627`
MD5	`737f44f2da6144082979bedbd617306f`
BLAKE2b-256	`905680eaa5b8903895bcae3fce61d086c3cf7bdd97292753c91a3660e639be8f`

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.5.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: torchft_nightly-2026.5.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 21, 2026
Size: 2.5 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for torchft_nightly-2026.5.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`a4fa4d246ac9304b20bdd6cfbe2829624f784cb3cc8a8065aff85b43446e9588`
MD5	`00fc661837d3189216f2d75576b8bd78`
BLAKE2b-256	`095c77a32a2afd6ae3e233e042f3b5d9bc09329c0c08a32b089c83036abe9d7a`

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.5.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: torchft_nightly-2026.5.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 21, 2026
Size: 2.5 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for torchft_nightly-2026.5.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`0c48ab67ed974b2efc9ad2f4edecc843076ff7817f3fde2bd3d680872a76bcf4`
MD5	`a93dd1e7d0f6cf08c868a2f89cdd39f8`
BLAKE2b-256	`dfa2c208e3edf6779805f222add76999db374504b69a2e38074b5cd1e6b4dc24`

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.5.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: torchft_nightly-2026.5.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 21, 2026
Size: 2.5 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for torchft_nightly-2026.5.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`e955fa54d60d25b45602075e94c66a36365d1ca29e0c034404257cbba023b036`
MD5	`74b8331730c59d39eb06f9e32b845f62`
BLAKE2b-256	`865bb721486a50b97be9db9092839b29015066d28c31673873b80afd4e142c7e`

See more details on using hashes here.

File details

Details for the file torchft_nightly-2026.5.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: torchft_nightly-2026.5.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 21, 2026
Size: 2.5 MB
Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for torchft_nightly-2026.5.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`b1f7da8d89d2ebada41aa1648b33ebca32d94d85a9572bd1d8dfa727ec6cce57`
MD5	`9c278c7ac4e404fc370a0a684a57cf7f`
BLAKE2b-256	`466a09ba5ac4442550a689234f84c05f88ac29bebd70905f44aedfe2bc39e2d8`

See more details on using hashes here.

torchft-nightly 2026.5.21

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Easy Per Step Fault Tolerance for PyTorch

Overview

Examples

torchtitan (Fault Tolerant HSDP)

Fault Tolerant DDP

DiLoCo

Design

Lighthouse

Fault Tolerant HSDP Algorithm

Installing from PyPI

Installing from Source

Prerequisites

Installation

Usage

Lighthouse

Example Training Loop (DDP)

Running DDP

Example Parameter Server

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes