Skip to main content

No project description provided

Project description

torch-ft

Prototype repo for PyTorch fault tolerance

This implements a lighthouse server that coordinates across the different replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which can greatly improve efficiency by avoiding stop the world training on errors.

Installation

$ pip install .

This uses pyo3+maturin to build the package, you'll need maturin installed.

To install in editable mode w/ the Rust extensions you can use the normal pip install command:

$ pip install -e .

Lighthouse

You can start a lighthouse server by running:

$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000

Example Training Loop

See train.py for the full example.

Invoke with:

$ TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py

train.py:

from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

manager = Manager(
    pg=ProcessGroupGloo(), 
    load_state_dict=...,
    state_dict=...,
)

m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))

for i in range(1000):
    batch = torch.rand(2, 2, device=device)

    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    loss.backward()

    optimizer.step()

Running Tests / Lint

$ cargo fmt
% cargo test

License

Apache 2.0 -- see LICENSE for more details.

Copyright (c) Tristan Rice 2024

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

torchft-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

torchft-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

torchft-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

torchft-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

torchft-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

File details

Details for the file torchft-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 33a87022eb439002c72aae0f30343f1d6ba3d6746efe7e214c9da1ec84f6b545
MD5 d3d1042ffff688990fa51a2c0ccde466
BLAKE2b-256 d935dadba40bc3d38cc24ed7ba8aaa8ada86ae17ce6237a6c949d591faa98297

See more details on using hashes here.

File details

Details for the file torchft-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ce57ed03819c48824f48892da6fede5f74b133dfd3cd8672f1aba4f5a1982b67
MD5 a999964c2ae50a72adffd3732794af62
BLAKE2b-256 5f8f3a9d4219740d9b05a4669900856e8d3b0045ebc779531e90fb6e5c2d0302

See more details on using hashes here.

File details

Details for the file torchft-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ad25b21d10c5206124cf5d44aee5b30f6b0585a2e27c0946502198a019f04920
MD5 462941e7b0d05cab9575462442c51052
BLAKE2b-256 0c26d5ac7a3b2d4a720ed2585ab0eca477967ce282a448fe6cb38fc57857547e

See more details on using hashes here.

File details

Details for the file torchft-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 243626e9c6919b81666accf9a5ef3dd6d3ec56768ec27fb4b4a941d785965933
MD5 64a997925ed98d58bf49417dbe0b02a2
BLAKE2b-256 bcc310fcb7822b7080b74b85efcc041bf802ace7b12c5a368410d3c0bd143691

See more details on using hashes here.

File details

Details for the file torchft-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchft-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4309cebe34fb5a3a0e9e9aca4fb0aff9136150c6428c6a846ea5f37a194dcff3
MD5 8e5df98f36ab211757cf9aa760f169a1
BLAKE2b-256 31bfc8fd90952e761fb4492c12025311678230165ead1bf4ac86c16b479d3997

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page