Skip to main content

PyTorch Elastic Training

Project description

LicenseCircleCI

TorchElastic

TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.

Requirements

torchelastic requires

  • python3 (3.6+)
  • torch
  • etcd

Installation

pip install torchelastic

Quickstart

Fault-tolerant on 4 nodes, 8 trainers/node, total 4 * 8 = 32 trainers. Run the following on all nodes.

python -m torchelastic.distributed.launch
            --nnodes=4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Elastic on 1 ~ 4 nodes, 8 trainers/node, total 8 ~ 32 trainers. Job starts as soon as 1 node is healthy, you may add up to 4 nodes.

python -m torchelastic.distributed.launch
            --nnodes=1:4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Contributing

We welcome PRs. See the CONTRIBUTING file.

License

torchelastic is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchelastic-0.2.0.tar.gz (51.4 kB view details)

Uploaded Source

Built Distribution

torchelastic-0.2.0-py3-none-any.whl (66.1 kB view details)

Uploaded Python 3

File details

Details for the file torchelastic-0.2.0.tar.gz.

File metadata

  • Download URL: torchelastic-0.2.0.tar.gz
  • Upload date:
  • Size: 51.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for torchelastic-0.2.0.tar.gz
Algorithm Hash digest
SHA256 17a37c414925af97020dba24496090b6435de180c09e684daf3ec966d4a8a29a
MD5 a20e2857f5f14feb1bba9aa87e5bbdf6
BLAKE2b-256 f3d7d58328186671bc3dba50aba54e3a9109bfc48b17a2a20aa9579186f83198

See more details on using hashes here.

File details

Details for the file torchelastic-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: torchelastic-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 66.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for torchelastic-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 42de380cfda0555a0b0769cd39bd99269191383623020e7c595424afcbf70d2f
MD5 1ff0a334a7abfcc2bf6c6bcb0f7a5264
BLAKE2b-256 058fd6ece94b951caae0ce4f424941a3b352a35aec7b02f850b953a660477d5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page