Skip to main content

A tool for launching and running commands on multiple EC2 instances

Project description

ec2-cluster

Simple library and CLI to work with clusters of EC2 instances. Multi-purpose, but created to make deep learning distributed training infrastructure easier.

ec2-cluster is designed for simple distributed tasks where Kubernetes is overkill or where fast cluster spin up/down is crucial. It provides the ability to launch a cluster, to retrieve IP addresses for all nodes in the cluster, to delete the cluster and to execute commands on some or all of the instances.

Other benefits:

  • Resilient to EC2 capacity limits. If instances are not available, ec2-cluster will retry until the all nodes in the cluster are created or until the user-set timeout is reached.
  • Easy to quickly launch duplicate clusters for parallel training runs.
  • Can write orchestration logic that needs to be run when launching a cluster, e.g. enabling passwordless ssh between all instances for Horovod-based training

Examples

Library

CLI

Goals

  • Provide the minimal set of features to run distributed deep learning training jobs on EC2 instances.
  • Provide libraries, not a framework or platform.
  • Make cluster environments reproducible to allow for parallelization of experiments
  • Make cluster launches fast
  • Be resilient to EC2 capacity limitations
  • Encourage ephemeral infrastructure design
  • Focus on iterative, not disruptive, improvements on the common methodology of manually launching EC2 instances, ssh-ing to them, configuring environments by hand and running scripts

Overview

ec2-cluster can be consumed in two ways:

  • A CLI for launching, describing and deleting clusters.
  • A python library for scripting.

This library has three main components:L

  • infra: creating cluster infrastructure
  • orch: orchestrating simple runtime cluster configuration (e.g. generate a hostfile with runtime IPs)
  • control: running commands on the cluster

CLI Quick Start

Library Quick Start

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ec2_cluster-0.3.3rc2.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ec2_cluster-0.3.3rc2-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file ec2_cluster-0.3.3rc2.tar.gz.

File metadata

  • Download URL: ec2_cluster-0.3.3rc2.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for ec2_cluster-0.3.3rc2.tar.gz
Algorithm Hash digest
SHA256 0940b05a562a88389ed92f542e54cb7a056c7ddc696169f5183b75dca5fbc024
MD5 eba3bc05e6fe17ba71f70278582d8358
BLAKE2b-256 4336b8aa2fdc72c5e06fac4d342f8c6cdf9ad27d0c0eb17a017410eb55df5dbd

See more details on using hashes here.

File details

Details for the file ec2_cluster-0.3.3rc2-py3-none-any.whl.

File metadata

  • Download URL: ec2_cluster-0.3.3rc2-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for ec2_cluster-0.3.3rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 415de84d8aa9cf1582239d365ba4ae0e3ae4bd210f11e6a155871a81c7205df1
MD5 9524081f75373b06d6f7b78b0d0ff078
BLAKE2b-256 f63ada4e83b7a09e9c5fb2c5c079d1767c42009ab44727b16cc1c50af1a4453a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page