Skip to main content

PyTorch Elastic Training

Project description

LicenseCircleCI

TorchElastic

TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.

Requirements

torchelastic requires

  • python3 (3.6+)
  • torch
  • etcd

Installation

pip install torchelastic

Quickstart

Fault-tolerant on 4 nodes, 8 trainers/node, total 4 * 8 = 32 trainers. Run the following on all nodes.

python -m torchelastic.distributed.launch
            --nnodes=4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Elastic on 1 ~ 4 nodes, 8 trainers/node, total 8 ~ 32 trainers. Job starts as soon as 1 node is healthy, you may add up to 4 nodes.

python -m torchelastic.distributed.launch
            --nnodes=1:4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Contributing

We welcome PRs. See the CONTRIBUTING file.

License

torchelastic is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchelastic-0.2.0rc1.tar.gz (51.8 kB view details)

Uploaded Source

Built Distribution

torchelastic-0.2.0rc1-py3-none-any.whl (67.6 kB view details)

Uploaded Python 3

File details

Details for the file torchelastic-0.2.0rc1.tar.gz.

File metadata

  • Download URL: torchelastic-0.2.0rc1.tar.gz
  • Upload date:
  • Size: 51.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.10

File hashes

Hashes for torchelastic-0.2.0rc1.tar.gz
Algorithm Hash digest
SHA256 fbe2a076247144ca5fd4525ff3c1ad5adc444ca22fd2c9342f3f782fe2cd70a8
MD5 1a1b7d2e97927968c48e616f8fc45ac9
BLAKE2b-256 054335c77fb4fb4ee239f496c66e19aa28006a3b8ebb108176d6510be8e78176

See more details on using hashes here.

File details

Details for the file torchelastic-0.2.0rc1-py3-none-any.whl.

File metadata

  • Download URL: torchelastic-0.2.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 67.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.10

File hashes

Hashes for torchelastic-0.2.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 ae61fd100ea34f5055e9c0fae2337956283ceb2ff5117111dff43e035fa6309a
MD5 e9d3a3808de6ea7e48cafbf8518493e4
BLAKE2b-256 84b68034f69a51ecc99d0f256899f27486111da7ab5c454390a06de7416bbddf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page