A tool for launching and running commands on multiple EC2 instances
Project description
ec2-cluster
Simple library and CLI to work with clusters of EC2 instances. Multi-purpose, but created to make deep learning distributed training infrastructure easier.
ec2-cluster
is designed for simple distributed tasks where Kubernetes is overkill or where fast cluster spin up/down is crucial. It provides the ability to launch a cluster, to retrieve IP addresses for all nodes in the cluster, to delete the cluster and to execute commands on some or all of the instances.
Other benefits:
- Resilient to EC2 capacity limits. If instances are not available,
ec2-cluster
will retry until the all nodes in the cluster are created or until the user-set timeout is reached. - Easy to quickly launch duplicate clusters for parallel training runs.
- Can write orchestration logic that needs to be run when launching a cluster, e.g. enabling passwordless ssh between all instances for Horovod-based training
Examples
Library
CLI
Goals
- Provide the minimal set of features to run distributed deep learning training jobs on EC2 instances.
- Provide libraries, not a framework or platform.
- Make cluster environments reproducible to allow for parallelization of experiments
- Make cluster launches fast
- Be resilient to EC2 capacity limitations
- Encourage ephemeral infrastructure design
- Focus on iterative, not disruptive, improvements on the common methodology of manually launching EC2 instances, ssh-ing to them, configuring environments by hand and running scripts
Overview
ec2-cluster
can be consumed in two ways:
- A CLI for launching, describing and deleting clusters.
- A python library for scripting.
This library has three main components:L
- infra: creating cluster infrastructure
- orch: orchestrating simple runtime cluster configuration (e.g. generate a hostfile with runtime IPs)
- control: running commands on the cluster
CLI Quick Start
Library Quick Start
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ec2_cluster-0.3.3rc2.tar.gz
(21.1 kB
view hashes)
Built Distribution
Close
Hashes for ec2_cluster-0.3.3rc2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 415de84d8aa9cf1582239d365ba4ae0e3ae4bd210f11e6a155871a81c7205df1 |
|
MD5 | 9524081f75373b06d6f7b78b0d0ff078 |
|
BLAKE2b-256 | f63ada4e83b7a09e9c5fb2c5c079d1767c42009ab44727b16cc1c50af1a4453a |