No project description provided
Project description
torch-ft
Prototype repo for PyTorch fault tolerance
This implements a lighthouse server that coordinates across the different replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop.
This allows for membership changes at the training step granularity which can greatly improve efficiency by avoiding stop the world training on errors.
Installation
$ pip install .
This uses pyo3+maturin to build the package, you'll need maturin installed.
To install in editable mode w/ the Rust extensions you can use the normal pip install command:
$ pip install -e .
Lighthouse
You can start a lighthouse server by running:
$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000
Example Training Loop
See train.py for the full example.
Invoke with:
$ TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py
train.py:
from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo
manager = Manager(
pg=ProcessGroupGloo(),
load_state_dict=...,
state_dict=...,
)
m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))
for i in range(1000):
batch = torch.rand(2, 2, device=device)
optimizer.zero_grad()
out = m(batch)
loss = out.sum()
loss.backward()
optimizer.step()
Running Tests / Lint
$ cargo fmt
% cargo test
License
Apache 2.0 -- see LICENSE for more details.
Copyright (c) Tristan Rice 2024
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file torchft-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: torchft-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.0 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33a87022eb439002c72aae0f30343f1d6ba3d6746efe7e214c9da1ec84f6b545 |
|
MD5 | d3d1042ffff688990fa51a2c0ccde466 |
|
BLAKE2b-256 | d935dadba40bc3d38cc24ed7ba8aaa8ada86ae17ce6237a6c949d591faa98297 |
File details
Details for the file torchft-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: torchft-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.0 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce57ed03819c48824f48892da6fede5f74b133dfd3cd8672f1aba4f5a1982b67 |
|
MD5 | a999964c2ae50a72adffd3732794af62 |
|
BLAKE2b-256 | 5f8f3a9d4219740d9b05a4669900856e8d3b0045ebc779531e90fb6e5c2d0302 |
File details
Details for the file torchft-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: torchft-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.0 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad25b21d10c5206124cf5d44aee5b30f6b0585a2e27c0946502198a019f04920 |
|
MD5 | 462941e7b0d05cab9575462442c51052 |
|
BLAKE2b-256 | 0c26d5ac7a3b2d4a720ed2585ab0eca477967ce282a448fe6cb38fc57857547e |
File details
Details for the file torchft-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: torchft-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.0 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 243626e9c6919b81666accf9a5ef3dd6d3ec56768ec27fb4b4a941d785965933 |
|
MD5 | 64a997925ed98d58bf49417dbe0b02a2 |
|
BLAKE2b-256 | bcc310fcb7822b7080b74b85efcc041bf802ace7b12c5a368410d3c0bd143691 |
File details
Details for the file torchft-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: torchft-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.0 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4309cebe34fb5a3a0e9e9aca4fb0aff9136150c6428c6a846ea5f37a194dcff3 |
|
MD5 | 8e5df98f36ab211757cf9aa760f169a1 |
|
BLAKE2b-256 | 31bfc8fd90952e761fb4492c12025311678230165ead1bf4ac86c16b479d3997 |