Skip to main content

PyTorch Elastic Training

Project description

LicenseCircleCI

TorchElastic

TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.

Requirements

torchelastic requires

  • python3 (3.8+)
  • torch
  • etcd

Installation

pip install torchelastic

Quickstart

Fault-tolerant on 4 nodes, 8 trainers/node, total 4 * 8 = 32 trainers. Run the following on all nodes.

python -m torchelastic.distributed.launch
            --nnodes=4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Elastic on 1 ~ 4 nodes, 8 trainers/node, total 8 ~ 32 trainers. Job starts as soon as 1 node is healthy, you may add up to 4 nodes.

python -m torchelastic.distributed.launch
            --nnodes=1:4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Contributing

We welcome PRs. See the CONTRIBUTING file.

License

torchelastic is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchelastic-0.2.1.tar.gz (64.4 kB view details)

Uploaded Source

Built Distributions

torchelastic-0.2.1-py3.8.egg (180.0 kB view details)

Uploaded Egg

torchelastic-0.2.1-py3-none-any.whl (84.0 kB view details)

Uploaded Python 3

File details

Details for the file torchelastic-0.2.1.tar.gz.

File metadata

  • Download URL: torchelastic-0.2.1.tar.gz
  • Upload date:
  • Size: 64.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200917 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for torchelastic-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f13b1b1fbbe4c19b9f79445c7c10c9cc23e00edafadb68d97062c82b0d183842
MD5 0ee44738bfd8e8feab3d9046d41cb665
BLAKE2b-256 0f3c1dfb5e5837b16d20bfcae8e94f14d838517d1f2d94b12821926ea75b13d5

See more details on using hashes here.

File details

Details for the file torchelastic-0.2.1-py3.8.egg.

File metadata

  • Download URL: torchelastic-0.2.1-py3.8.egg
  • Upload date:
  • Size: 180.0 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200917 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for torchelastic-0.2.1-py3.8.egg
Algorithm Hash digest
SHA256 44a7d26d1df30256cff47f4601d68c3226039a75859546d0a88d574e866c9c73
MD5 c98a7663a6f197dcef51e0ab4122f9cf
BLAKE2b-256 00049d87d921ecdd48be850adf6d906e9fd00db817ebfbd15685c5c7cf926093

See more details on using hashes here.

File details

Details for the file torchelastic-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: torchelastic-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 84.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200917 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for torchelastic-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d5cde4de50cfca3930bf952aaaee83b7e8425d3f1976b9f1df9626d9f4f7ae89
MD5 e7c8174b8136dc6877d545938d255656
BLAKE2b-256 c12b8d8b9227905c8aa7a8c06fc3191072345c0e74615af1d050b9e5adec3d88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page