A tool for launching and running commands on multiple EC2 instances
Project description
ec2-cluster
Simple library and CLI to work with clusters of EC2 instances. Multi-purpose, but created to make deep learning distributed training infrastructure easier.
ec2-cluster is designed for simple distributed tasks where Kubernetes is overkill or where fast cluster spin up/down is crucial. It provides the ability to launch a cluster, to retrieve IP addresses for all nodes in the cluster, to delete the cluster and to execute commands on some or all of the instances.
Other benefits:
- Resilient to EC2 capacity limits. If instances are not available,
ec2-clusterwill retry until the all nodes in the cluster are created or until the user-set timeout is reached. - Easy to quickly launch duplicate clusters for parallel training runs.
- Can write orchestration logic that needs to be run when launching a cluster, e.g. enabling passwordless ssh between all instances for Horovod-based training
Examples
Library
CLI
Goals
- Provide the minimal set of features to run distributed deep learning training jobs on EC2 instances.
- Provide libraries, not a framework or platform.
- Make cluster environments reproducible to allow for parallelization of experiments
- Make cluster launches fast
- Be resilient to EC2 capacity limitations
- Encourage ephemeral infrastructure design
- Focus on iterative, not disruptive, improvements on the common methodology of manually launching EC2 instances, ssh-ing to them, configuring environments by hand and running scripts
Overview
ec2-cluster can be consumed in two ways:
- A CLI for launching, describing and deleting clusters.
- A python library for scripting.
This library has three main components:L
- infra: creating cluster infrastructure
- orch: orchestrating simple runtime cluster configuration (e.g. generate a hostfile with runtime IPs)
- control: running commands on the cluster
CLI Quick Start
Library Quick Start
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ec2_cluster-0.3.3rc2.tar.gz.
File metadata
- Download URL: ec2_cluster-0.3.3rc2.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0940b05a562a88389ed92f542e54cb7a056c7ddc696169f5183b75dca5fbc024
|
|
| MD5 |
eba3bc05e6fe17ba71f70278582d8358
|
|
| BLAKE2b-256 |
4336b8aa2fdc72c5e06fac4d342f8c6cdf9ad27d0c0eb17a017410eb55df5dbd
|
File details
Details for the file ec2_cluster-0.3.3rc2-py3-none-any.whl.
File metadata
- Download URL: ec2_cluster-0.3.3rc2-py3-none-any.whl
- Upload date:
- Size: 22.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
415de84d8aa9cf1582239d365ba4ae0e3ae4bd210f11e6a155871a81c7205df1
|
|
| MD5 |
9524081f75373b06d6f7b78b0d0ff078
|
|
| BLAKE2b-256 |
f63ada4e83b7a09e9c5fb2c5c079d1767c42009ab44727b16cc1c50af1a4453a
|