Skip to main content

No project description provided

Project description

torchft

Easy Per Step Fault Tolerance for PyTorch

| Documentation | Poster | Design Doc |


⚠️ WARNING: This is an alpha prototype for PyTorch fault tolerance and may have bugs or breaking changes as this is actively under development. We'd love to collaborate and contributions are welcome. Please reach out if you're interested in torchft or want to discuss fault tolerance in PyTorch

This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.

This is based off of the large scale training techniques presented at PyTorch Conference 2024.

Design

torchft is designed to allow for fault tolerance when using training with replicated weights such as in DDP or HSDP (FSDP with DDP).

torchft implements a lighthouse server that coordinates across the different replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which can greatly improve efficiency by avoiding stop the world training on errors.

Prerequisites

Before proceeding, ensure you have the following installed:

  • Rust (with necessaray dependencies)
  • protobuf-compiler and the corresponding development package for Protobuf.

Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:

$ curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh

To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:

sudo apt install protobuf-compiler libprotobuf-dev

or for a Red Hat-based system, run:

sudo dnf install protobuf-compiler protobuf-devel

Installation

$ pip install .

This uses pyo3+maturin to build the package, you'll need maturin installed.

If the installation command fails to invoke cargo update due to an inability to fetch the manifest, it may be caused by the proxy, proxySSLCert, and proxySSLKey settings in your .gitconfig file affecting the cargo command. To resolve this issue, try temporarily removing these fields from your .gitconfig before running the installation command.

To install in editable mode w/ the Rust extensions you can use the normal pip install command:

$ pip install -e .

Usage

Lighthouse

The lighthouse is used for fault tolerance across replicated workers (DDP/FSDP) when using synchronous training.

You can start a lighthouse server by running:

$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000

Example Training Loop (DDP)

See train_ddp.py for the full example.

Invoke with:

$ TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py

train.py:

from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

manager = Manager(
    pg=ProcessGroupGloo(),
    load_state_dict=...,
    state_dict=...,
)

m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))

for i in range(1000):
    batch = torch.rand(2, 2, device=device)

    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    loss.backward()

    optimizer.step()

Example Parameter Server

torchft has a fault tolerant parameter server implementation built on it's reconfigurable ProcessGroups. This does not require/use a Lighthouse server.

See parameter_server_test.py for an example.

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

torchft is BSD 3-Clause licensed. See LICENSE for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

torchft-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

torchft-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

torchft-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

torchft-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

torchft-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file torchft-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fc264718ef7bd8189033892ee720bbde8181365c005213660e3cf13b6cf2c7c3
MD5 55a2dac8f155ac8e7cc68439af6676b0
BLAKE2b-256 27fa21163203397330b2f360a8b3410540d9c609cdfd7131d0570ba648b1ebde

See more details on using hashes here.

File details

Details for the file torchft-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aa7d1634938de99ab1441786a8df93c27e1810f31cf043e0c5ae91fddb61ddf7
MD5 5459944cad8f0d416220b08318392767
BLAKE2b-256 ce5e2824830d184b44a65817fb2e1604ef82e6b2f9d74ab3a0dc68bc9080c59b

See more details on using hashes here.

File details

Details for the file torchft-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 04a520b92a26a48714ab8b5f19a730f331801dc77804c6e242afb737532fd119
MD5 6c4cf791e16267ae2ad248f161374bb5
BLAKE2b-256 0149f866c6aef0937f15909abbc29642e8f778c11797b860a7144ee8935c3d49

See more details on using hashes here.

File details

Details for the file torchft-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 36dfe729ac3905f5848a0da03eb4a3325fed377c21af60b8574c96c96b75d3eb
MD5 246cc8685092635b8609b2053708d6d0
BLAKE2b-256 d09859651162345827a25fb4eac6790e63cc5e68f935f26f3a5dd81e3d65a171

See more details on using hashes here.

File details

Details for the file torchft-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0c0e2e6ed091bbb4cf9e308e96cf53fb3742dbd2ef0cf1b36df69edde9b96fa2
MD5 9427aa77e6435ba8acfe204b429baba5
BLAKE2b-256 7c9ae04367aedc9e68811f61dbfedccad0cf43566b1bbe6688d5793b8661fe59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page