PyTorch Elastic Training
Project description
TorchElastic
TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.
Requirements
torchelastic requires
- python3 (3.8+)
- torch
- etcd
Installation
pip install torchelastic
Quickstart
Fault-tolerant on 4
nodes, 8
trainers/node, total 4 * 8 = 32
trainers.
Run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Elastic on 1 ~ 4
nodes, 8
trainers/node, total 8 ~ 32
trainers. Job
starts as soon as 1
node is healthy, you may add up to 4
nodes.
python -m torchelastic.distributed.launch
--nnodes=1:4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Contributing
We welcome PRs. See the CONTRIBUTING file.
License
torchelastic is BSD licensed, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file torchelastic-0.2.2.tar.gz
.
File metadata
- Download URL: torchelastic-0.2.2.tar.gz
- Upload date:
- Size: 90.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd214aa807bf50120ff7a6544fd6b12a7996bf4767438998242f8b8b6959e11f |
|
MD5 | 26edf446974517c052ef47ab0890c938 |
|
BLAKE2b-256 | 4fb56b598fe8881a2de40e5a01100ab5932c8b791b9249ccc99c0d5006443c93 |
File details
Details for the file torchelastic-0.2.2-py3.8.egg
.
File metadata
- Download URL: torchelastic-0.2.2-py3.8.egg
- Upload date:
- Size: 245.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb4f1d7844987ba95fa09ef00774f09ab3a5bf0ffeb61a4b330ca0d3fcbc1f74 |
|
MD5 | e960dc140e6caa62b2c0628fc1cc6929 |
|
BLAKE2b-256 | becc9e30a540a55a568673bfef28ed74c8e155ab82f9f3a1d72a26d45cf4dfc4 |
File details
Details for the file torchelastic-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: torchelastic-0.2.2-py3-none-any.whl
- Upload date:
- Size: 111.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99c9f67f371c73e4c80b1ec71c36be91e5fdd106edcf4c848415bf55cfed6416 |
|
MD5 | 73903dfd9e2ed5c7753aacef2daa409b |
|
BLAKE2b-256 | 0fcfa1c438dce530fee452acbce43a561c1cbbd8c158a1766b927184a1692fee |