No project description provided
Project description
Easy Per Step Fault Tolerance for PyTorch
| Documentation | Poster | Design Doc |
⚠️ WARNING: This is an alpha prototype for PyTorch fault tolerance and may have bugs or breaking changes as this is actively under development. We'd love to collaborate and contributions are welcome. Please reach out if you're interested in torchft or want to discuss fault tolerance in PyTorch
This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.
This is based off of the large scale training techniques presented at PyTorch Conference 2024.
Design
torchft is designed to allow for fault tolerance when using training with replicated weights such as in DDP or HSDP (FSDP with DDP).
torchft implements a lighthouse server that coordinates across the different replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop.
This allows for membership changes at the training step granularity which can greatly improve efficiency by avoiding stop the world training on errors.
Prerequisites
Before proceeding, ensure you have the following installed:
- Rust (with necessaray dependencies)
protobuf-compilerand the corresponding development package for Protobuf.
Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:
$ curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:
sudo apt install protobuf-compiler libprotobuf-dev
or for a Red Hat-based system, run:
sudo dnf install protobuf-compiler protobuf-devel
Installation
$ pip install .
This uses pyo3+maturin to build the package, you'll need maturin installed.
If the installation command fails to invoke cargo update due to an inability to fetch the manifest, it may be caused by the proxy, proxySSLCert, and proxySSLKey settings in your .gitconfig file affecting the cargo command. To resolve this issue, try temporarily removing these fields from your .gitconfig before running the installation command.
To install in editable mode w/ the Rust extensions you can use the normal pip install command:
$ pip install -e .
Usage
Lighthouse
The lighthouse is used for fault tolerance across replicated workers (DDP/FSDP) when using synchronous training.
You can start a lighthouse server by running:
$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000
Example Training Loop (DDP)
See train_ddp.py for the full example.
Invoke with:
$ TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py
train.py:
from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo
manager = Manager(
pg=ProcessGroupGloo(),
load_state_dict=...,
state_dict=...,
)
m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))
for i in range(1000):
batch = torch.rand(2, 2, device=device)
optimizer.zero_grad()
out = m(batch)
loss = out.sum()
loss.backward()
optimizer.step()
Example Parameter Server
torchft has a fault tolerant parameter server implementation built on it's reconfigurable ProcessGroups. This does not require/use a Lighthouse server.
See parameter_server_test.py for an example.
Contributing
We welcome PRs! See the CONTRIBUTING file.
License
torchft is BSD 3-Clause licensed. See LICENSE for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file torchft-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: torchft-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc264718ef7bd8189033892ee720bbde8181365c005213660e3cf13b6cf2c7c3
|
|
| MD5 |
55a2dac8f155ac8e7cc68439af6676b0
|
|
| BLAKE2b-256 |
27fa21163203397330b2f360a8b3410540d9c609cdfd7131d0570ba648b1ebde
|
File details
Details for the file torchft-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: torchft-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa7d1634938de99ab1441786a8df93c27e1810f31cf043e0c5ae91fddb61ddf7
|
|
| MD5 |
5459944cad8f0d416220b08318392767
|
|
| BLAKE2b-256 |
ce5e2824830d184b44a65817fb2e1604ef82e6b2f9d74ab3a0dc68bc9080c59b
|
File details
Details for the file torchft-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: torchft-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04a520b92a26a48714ab8b5f19a730f331801dc77804c6e242afb737532fd119
|
|
| MD5 |
6c4cf791e16267ae2ad248f161374bb5
|
|
| BLAKE2b-256 |
0149f866c6aef0937f15909abbc29642e8f778c11797b860a7144ee8935c3d49
|
File details
Details for the file torchft-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: torchft-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36dfe729ac3905f5848a0da03eb4a3325fed377c21af60b8574c96c96b75d3eb
|
|
| MD5 |
246cc8685092635b8609b2053708d6d0
|
|
| BLAKE2b-256 |
d09859651162345827a25fb4eac6790e63cc5e68f935f26f3a5dd81e3d65a171
|
File details
Details for the file torchft-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: torchft-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c0e2e6ed091bbb4cf9e308e96cf53fb3742dbd2ef0cf1b36df69edde9b96fa2
|
|
| MD5 |
9427aa77e6435ba8acfe204b429baba5
|
|
| BLAKE2b-256 |
7c9ae04367aedc9e68811f61dbfedccad0cf43566b1bbe6688d5793b8661fe59
|