Skip to main content

PyTorch Elastic Training

Project description

LicenseCircleCI

TorchElastic

TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.

Requirements

torchelastic requires

  • python3 (3.6+)
  • torch
  • etcd

Installation

pip install torchelastic

Quickstart

Fault-tolerant on 4 nodes, 8 trainers/node, total 4 * 8 = 32 trainers. Run the following on all nodes.

python -m torchelastic.distributed.launch
            --nnodes=4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Elastic on 1 ~ 4 nodes, 8 trainers/node, total 8 ~ 32 trainers. Job starts as soon as 1 node is healthy, you may add up to 4 nodes.

python -m torchelastic.distributed.launch
            --nnodes=1:4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Contributing

We welcome PRs. See the CONTRIBUTING file.

License

torchelastic is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchelastic-0.2.0rc0.tar.gz (51.5 kB view details)

Uploaded Source

Built Distribution

torchelastic-0.2.0rc0-py3-none-any.whl (67.1 kB view details)

Uploaded Python 3

File details

Details for the file torchelastic-0.2.0rc0.tar.gz.

File metadata

  • Download URL: torchelastic-0.2.0rc0.tar.gz
  • Upload date:
  • Size: 51.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.10

File hashes

Hashes for torchelastic-0.2.0rc0.tar.gz
Algorithm Hash digest
SHA256 4a6a14b7222a634040f32b31bd493fc5ccd2ffb5b5ae7270152daf2eff9b25ea
MD5 cf580cb8ed36dd71bd6661bbc3b6a2da
BLAKE2b-256 6784ea52ccf76f228ccd0f3312adbf1882b61eff62e9f0c0210f1765a7e7aa6c

See more details on using hashes here.

File details

Details for the file torchelastic-0.2.0rc0-py3-none-any.whl.

File metadata

  • Download URL: torchelastic-0.2.0rc0-py3-none-any.whl
  • Upload date:
  • Size: 67.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.10

File hashes

Hashes for torchelastic-0.2.0rc0-py3-none-any.whl
Algorithm Hash digest
SHA256 e5303c5ea81754d3a076abf0160a831287eb77e6c28a39a913fd299da74d6db2
MD5 abcce2d03198fb6c83111e87fcd56fa0
BLAKE2b-256 b399e541e257f1a4afc00a8c95441a428a159bedfcc3c02030c7228c0ae3717d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page