Skip to main content

A tool for launching and running commands on multiple EC2 instances

Project description

ec2-cluster

Simple CLI and Python library to spin up and run shell commands on clusters of EC2 instances using boto3 and fabric. Multi-purpose, but created to make deep learning distributed training infrastructure easier. Also very useful for running performance tests across multiple EC2 instance types.

Quickstart

This code will launch a cluster of EC2 instances, run the command hostname on all of them, return the results of the command and then tear down the cluster.

import ec2_cluster as ec3

with ec3.infra.ConfigCluster("cluster.yaml") as cluster:
    sh = cluster.get_shell()
    results = sh.run_on_all("hostname")
    hostnames = [result.stdout for result in results]

Long-running tasks

ec2-cluster is also designed for long-running tasks where you may not want to keep your local machine awake for the full duration. The library relies on EC2 tags to keep track of EC2 instances, letting you interact with a cluster across sessions without needing an always-on control plane.

Below is one way to launch complicated, long-running jobs and download the results at some later time. check_on_job() can be run in a different session or even a different machine than launch_job() as long as they have the same cluster.yaml and are using the same AWS account.

import ec2_cluster as ec3

def launch_job():
    cluster = ec3.infra.ConfigCluster("cluster.yaml")
    cluster.launch(verbose=True)
    sh = cluster.get_shell()
    sh.copy_from_local_to_all("job_script.py", "job_script.py")
    sh.run_on_all("python job_script.py > job.log 2>&1 &")  # Launch script as background process


def check_on_job():
    cluster = ec3.infra.ConfigCluster("cluster.yaml")
    sh = cluster.get_shell()
    statuses = sh.run_on_all("tail -n 1 job.log", hide=True)
    for status in statuses:
        if status.stdout.rstrip("\n") != "job_script.py complete":
            print("Job is not yet complete on all instances")
            return

    print("Job is finally done on all instances!")

    sh.copy_from_all_to_local("job.log", "./results/")
    cluster.terminate(verbose=True)

This will create a directory on your local machine:

results/
├── 0
│   ├── ip.txt
│   └── job.log
├── 1
│   ├── ip.txt
│   └── job.log
└── 3
    ├── ip.txt
    └── job.log

Overview

ec2-cluster is designed for simple distributed tasks where Kubernetes is overkill. There is no setup required other than the ability to launch EC2 instances with boto3 and the ability to SSH to those instances (only a requirement if you want to run commands on them).

ec2-cluster provides the ability to launch a cluster, to retrieve IP addresses for all nodes/nodes in the cluster, to delete the cluster and to execute commands on some or all of the instances.

Unlike like most cluster management tools, all cluster management in ec2-cluster happens client-side. ec2-cluster creates EC2 instances and gives them globally unique names (using the EC2 Name tag) which is later used to identify which instances are part of a given cluster. ec2-cluster then provides convenience classes to run commands on some or all of the instances via SSH (using the fabric library).

Other benefits

  • Resilient to EC2 capacity limits. If instances are not available, ec2-cluster will retry until the all nodes in the cluster are created or until the user-set timeout is reached.
  • Easy to quickly launch duplicate clusters for parallel training runs.
  • Can write orchestration logic that needs to be run when launching a cluster, e.g. enabling passwordless ssh between all instances for Horovod-based training
  • Iterative, not disruptive, improvements on the common methodology of manually launching EC2 instances, ssh-ing to them, configuring environments by hand and running scripts

Usage

ec2-cluster can be consumed in two ways:

  • A CLI for launching, describing and deleting clusters.
  • A python library for scripting.

This library has three main components:L

  • infra: creating cluster infrastructure
  • orch: orchestrating simple runtime cluster configuration (e.g. generate a hostfile with runtime IPs)
  • control: running commands on the cluster

CLI Quick Start

Library Quick Start

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ec2_cluster-0.5.0.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

ec2_cluster-0.5.0-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file ec2_cluster-0.5.0.tar.gz.

File metadata

  • Download URL: ec2_cluster-0.5.0.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for ec2_cluster-0.5.0.tar.gz
Algorithm Hash digest
SHA256 b2dc0d2b4e485f19ba25f196569af5ca1b4bfee7adfa7b37041d7de5af6903ee
MD5 b898fc7d8161bc27d753e68cc7d528eb
BLAKE2b-256 20324033d4793d1fc5dcbf52c279ad8361d06eeec8b4b4e22d8a6d73a65fcb91

See more details on using hashes here.

File details

Details for the file ec2_cluster-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: ec2_cluster-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for ec2_cluster-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4bc22b6ec6589ac18aeb74f4e255bc7c600c6a4de1f9d3a9229a6cd154480707
MD5 0975fadde73c4b20699832595e23c396
BLAKE2b-256 6643556852e51137a7cc121d67d6739661df2802aa5da134d387774c54451d7e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page