PyTorch Elastic Training
Project description
TorchElastic
TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.
Requirements
torchelastic requires
- python3 (3.6+)
- torch
- etcd
Installation
pip install torchelastic
Quickstart
Fault-tolerant on 4
nodes, 8
trainers/node, total 4 * 8 = 32
trainers.
Run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Elastic on 1 ~ 4
nodes, 8
trainers/node, total 8 ~ 32
trainers. Job
starts as soon as 1
node is healthy, you may add up to 4
nodes.
python -m torchelastic.distributed.launch
--nnodes=1:4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Contributing
We welcome PRs. See the CONTRIBUTING file.
License
torchelastic is BSD licensed, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
torchelastic-0.2.0rc1.tar.gz
(51.8 kB
view hashes)
Built Distribution
Close
Hashes for torchelastic-0.2.0rc1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae61fd100ea34f5055e9c0fae2337956283ceb2ff5117111dff43e035fa6309a |
|
MD5 | e9d3a3808de6ea7e48cafbf8518493e4 |
|
BLAKE2b-256 | 84b68034f69a51ecc99d0f256899f27486111da7ab5c454390a06de7416bbddf |