Skip to main content

PyTorch Elastic Training

Project description

LicenseCircleCI

PyTorch Elastic

PyTorch Elastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.

Requirements

torchelastic requires

  • python3 (3.6+)
  • torch
  • etcd

Installation

pip install torchelastic

Quickstart

Fault-tolerant on 4 nodes, 8 trainers/node, total 4 * 8 = 32 trainers. Run the following on all nodes.

python -m torchelastic.distributed.launch
            --nnodes=4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Elastic on 1 ~ 4 nodes, 8 trainers/node, total 8 ~ 32 trainers. Job starts as soon as 1 node is healthy, you may add up to 4 nodes.

python -m torchelastic.distributed.launch
            --nnodes=1:4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Contributing

We welcome PRs. See the CONTRIBUTING file.

License

torchelastic is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchelastic-0.2.0.dev0.tar.gz (51.2 kB view details)

Uploaded Source

Built Distribution

torchelastic-0.2.0.dev0-py3-none-any.whl (66.8 kB view details)

Uploaded Python 3

File details

Details for the file torchelastic-0.2.0.dev0.tar.gz.

File metadata

  • Download URL: torchelastic-0.2.0.dev0.tar.gz
  • Upload date:
  • Size: 51.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.10

File hashes

Hashes for torchelastic-0.2.0.dev0.tar.gz
Algorithm Hash digest
SHA256 dfe4a8f047f480571f57079711d742f628117854945f4be4af3a8a46de36ac16
MD5 10d3d1e2b6329f05fff33af4c1f282d3
BLAKE2b-256 aafdcdd15aa73c018b3a9d82134550dd5fca9eb3e08980db36e9775b4c0daa21

See more details on using hashes here.

File details

Details for the file torchelastic-0.2.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: torchelastic-0.2.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 66.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.10

File hashes

Hashes for torchelastic-0.2.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 7450cc8d78ba6f797ce219e180ef51240ae731436fdb5955f49038a7c41b0080
MD5 fada23a98e6ce09885625b3065b6b19e
BLAKE2b-256 55d54342d920a695f43e49060badf059a0420dc5677a13c1c444f8b7be85276d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page