Skip to main content

A *simpler* and *cheaper* way to distribute python (training) code on machines of your choice in the (AWS) cloud.

Project description

Simple Sagemaker

A simpler and cheaper way to distribute python (training) code on machines of your choice in the (AWS) cloud.

Note: this (initial) work is still in progress, currently warps PyTorch implementation...

Requirements

  1. Python 3.6+
  2. AWS account credentials
  3. Configuration of AWS credentials for boto3, as explained on boto3 docs.

Getting started

To install Simple Sagemaker

pip install simple-sagemaker

Then, here's a very simple example. Assuming one would like to run the following worker1.py on the cloud:

import torch

for i in range(torch.cuda.device_count()):
    print(f"-***- Device {i}: {torch.cuda.get_device_properties(i)}")

It's as easy as running the following command to get it running on a ml.p3.2xlarge spot instance:

ssm -p simple-sagemaker-example-cli -t task1 -e worker1.py -o ./output/example1 --it ml.p3.2xlarge

The output, including logs is saved to ./output/example1. The relevant part from the log file (./output/example1/logs/logs0) is:

...
-***- Device 0: _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16160MB, multi_processor_count=80)
...

And now to a real advanced and fully featured version, yet simple to implement: TBD

</code></pre>
<h2>More examples (below)</h2>
<p>Command line based examples:</p>
<ul>
<li><a href="#Passing-command-line-arguments">Passing command line arguments</a></li>
<li><a href="#Task-state-and-output">Task state and output</a></li>
<li><a href="#Providing-input-data">Providing input data</a></li>
<li><a href="#Chaining-tasks">Chaining tasks</a></li>
<li><a href="#Defining-code-dependencies">Defining code dependencies</a></li>
<li><a href="#Configuring-the-docker-image">Configuring the docker image</a></li>
</ul>
<p>Code only:</p>
<ul>
<li><a href="#Single-file-example">Single file example</a></li>
</ul>
<h1>Main features</h1>
<ol>
<li>Simpler - Except for holding and AWS account credentials, no assumptions, AWS pre-configuration nor AWS knowledge is assumed (well, almost :). Behind the scenes you get:
<ul>
<li>Jobs IAM role creation, including policies for accesing needed S3 buckets</li>
<li>Building and uploading a customized docker image to AWS (ECS service)</li>
<li>Synchronizing local source code / input data to a S3 bucket</li>
<li>Downloading the results from S3</li>
<li>...</li>
</ul>
</li>
<li>Cheaper - <a href="https://aws.amazon.com/sagemaker/pricing/">"pay only for what you use"</a>, and save <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html">up to 90% of the cost</a> with spot instances, which got used by default!</li>
<li>Abstraction of how data is maintianed on AWS (S3 service)
<ul>
<li>No need to mess with S3 paths, the data is automatically</li>
<li>State is automaticall maintained between consequetive execution of <strong>jobs</strong> that belongs to the same <strong>task</strong></li>
</ul>
</li>
<li>A simple way to define how data flows between <strong>tasks</strong> of the same <strong>project</strong>, e.g. how the first <strong>task</strong>'s outputs is used as an input for a second <strong>task</strong></li>
<li>(Almost) no code changes are to youe existing code - the API is mostly wrapped by a command line interface (named <em><strong>ssm</strong></em>) to control the execution (a.k.a implement the <strong>runner</strong>, see below)
<ul>
<li>In most cases it's only about 2 line for getting the environment configuration (e.g. input/output/state paths and running parameters) and passing it on to the original code</li>
</ul>
</li>
<li>Easy customization of the docker image (based on a pre-built one)</li>
<li>The rest of the SageMaker advantages, which (mostly) behaves "normally" as defined by AWS, e.g.
<ul>
<li>(Amazon SageMaker Developer Guide)[https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html]</li>
<li>(Amazon SageMaker Python SDK @ Read the Docs)[https://sagemaker.readthedocs.io/en/stable/index.html]</li>
</ul>
</li>
</ol>
<h2>High level flow diagram</h2>
<p><img src="docs/high_level_flow.svg?raw=true" alt="High level flow diagram" title="High level flow" /></p>
<h1>Background</h1>
<p><em>Simple Sagemaker</em> is a thin warpper around SageMaker's training <strong>jobs</strong>, that makes distribution of python code on <a href="https://aws.amazon.com/sagemaker/pricing/">any supported instance type</a> <strong>very simple</strong>.</p>
<p>The solutions is composed of two parts, one on each side: a <strong>runner</strong> on the client machine, and a <strong>worker</strong> which is the distributed code on AWS.</p>
<ul>
<li>The <strong>runner</strong> is the main part of this package, can mostly be controlled by using the <strong>ssm</strong> command line interface (CLI), or be fully customized using code</li>
<li>The <strong>worker</strong> is basically your code, but a small <code>task_tollkit</code> library is injected to it, for extracting the environment configuration, i.e. input/output/state paths and running parameters.</li>
</ul>
<p>The <strong>runner</strong> is used to configure <strong>tasks</strong> and <strong>projects</strong>:</p>
<ul>
<li>A <strong>task</strong> is a logical step that runs on define input and provide output. It's defined by providing a local code path, entrypoint, and list of additional local dependencies</li>
<li>A SageMaker <strong>job</strong> is a <strong>task</strong> instance, i.e. a single <strong>job</strong> is created each time a <strong>task</strong> is executed
<ul>
<li>State is maintained between consecutive execution of the same <strong>task</strong></li>
</ul>
</li>
<li>A <em>prjoect</em> is a series of related <strong>tasks</strong>, with possible depencencies</li>
</ul>
<h1>S3</h1>
<p>TBD</p>
<h1>Examples</h1>
<h2>Passing command line arguments</h2>
<p>Any extra argument passed to the command line in assumed to be an hypermarameter.
To get access to all environment arguments, call <code>algo_lib.parseArgs()</code>. For example, see the following worker code <code>worker2.py</code>:</p>
<pre lang="python"><code>from task_toolkit import algo_lib

args = algo_lib.parseArgs()
print(args.hps["msg"])

Running command:

ssm -p simple-sagemaker-example-cli -t task2 -e worker2.py --msg "Hello, world!" -o ./output/example2

Output from the log file

Invoking script with the following command:

/opt/conda/bin/python worker2.py --msg Hello, world!

Hello, world!

Task state and output

State

State is maintained between executions of the same task, i.e. between execution jobs that belongs to the same task. The local path is available in args.state. When running multiple instances, the data is merged into a single directory, so in order to avoid collisions, algo_lib.initMultiWorkersState(args) initializes a per instance sub directory. On top of that, algo_lib provides an additional important API to mark the task as completed: algo_lib.markCompleted(args). If all instances of the job mark it as completed, the task is assumed to be completed by that job, which allows:

  1. To skip it next time (unlesss eforced otherwise)
  2. To use its output as input for other tasks (see below: "Chaining tasks")

Output

There're 3 main output mechanisms:

  1. Logs - any output writen to standard output
  2. Output data - args.output_data_dir is compressed into a tar.gz file, only the main instance data is kept
  3. Model - args.model_dir is compressed into a tar.gz file, data from all instance is merged, so be carful with collisions.

A complete example can be seen in worker3.py:

import os

from task_toolkit import algo_lib

args = algo_lib.parseArgs()

open(os.path.join(args.output_data_dir, "output_data_dir"), "wt").write(
    "output_data_dir file"
)
open(os.path.join(args.model_dir, "model_dir"), "wt").write("model_dir file")
open(os.path.join(args.state, "state_dir"), "wt").write("state_dir file")

# Mark the tasks as completed, to allow other tasks using its output, and to avoid re-running it (unless enforced)
algo_lib.markCompleted(args)

Running command:

ssm -p simple-sagemaker-example-cli -t task3 -e worker3.py -o ./output/example3

Output from the log file

Invoking script with the following command:

/opt/conda/bin/python worker2.py --msg Hello, world!

Hello, world!

Providing input data

Job can be configured to get a few data sources:

  • A single local path can be used with the -i argument. This path is synchronized to the task directory on the S3 bucket before running the task. On the worker side the data is accesible in args.input_data
  • Additional S3 paths (many) can be set as well. Each input source is provided with --iis [name] [S3 URI], and is accesible by the worker with args.input_[name] when [name] is the same one as was provided on the command line.
  • Setting an output of a another task on the same project, see below "Chaining tasks"

Assuming a local data folder containtin a single sample_data.txt file, a complete example can be seen in worker4.py:

import logging
import subprocess
import sys

from task_toolkit import algo_lib

logger = logging.getLogger(__name__)


def listDir(path):
    logger.info(f"*** START listing files in {path}")
    logger.info(
        subprocess.run(
            ["ls", "-la", "-R", path], stdout=subprocess.PIPE, universal_newlines=True
        ).stdout
    )
    logger.info(f"*** END file listing {path}")


if __name__ == "__main__":
    logging.basicConfig(stream=sys.stdout)
    algo_lib.setDebugLevel()
    args = algo_lib.parseArgs()
    listDir(args.input_data)
    listDir(args.input_bucket)

Running command:

ssm -p simple-sagemaker-example-cli -t task4 -e worker4.py -i ./data --iis bucket s3://awsglue-datasets/examples/us-legislators/all/persons.json -o ./output/example4

Output from the log file

...
INFO:__main__:*** START listing files in /opt/ml/input/data/data
INFO:__main__:/opt/ml/input/data/data:
total 12
drwxr-xr-x 2 root root 4096 Sep 14 21:51 .
drwxr-xr-x 4 root root 4096 Sep 14 21:51 ..
-rw-r--r-- 1 root root   19 Sep 14 21:51 sample_data.txt

INFO:__main__:*** END file listing /opt/ml/input/data/data
INFO:__main__:*** START listing files in /opt/ml/input/data/bucket
INFO:__main__:/opt/ml/input/data/bucket:
total 7796
drwxr-xr-x 2 root root    4096 Sep 14 21:51 .
drwxr-xr-x 4 root root    4096 Sep 14 21:51 ..
-rw-r--r-- 1 root root 7973806 Sep 14 21:51 persons.json

INFO:__main__:*** END file listing /opt/ml/input/data/bucket
...

Chaining tasks

The output of a completed task on the same project can be used as an input to another task, by using the --iit [name] [task name] [output type] command line parameter, where:

  • [name] - is the name of the input source, caccesible by the worker with args.input_[name]
  • [task name] - the name of the task whose output is used as input
  • [output type] - the task output type, one of "model", "output", "state"

Using the output of task3 and the same worker4.py code, we can now run:

ssm -p simple-sagemaker-example-cli -t task5 -e worker4.py --iit bucket task3 model -o ./output/example5

And get the following output from in the log file:

INFO:__main__:*** START listing files in 
INFO:__main__:
INFO:__main__:*** END file listing 
INFO:__main__:*** START listing files in /opt/ml/input/data/bucket
INFO:__main__:/opt/ml/input/data/bucket:
total 12
drwxr-xr-x 2 root root 4096 Sep 14 21:55 .
drwxr-xr-x 3 root root 4096 Sep 14 21:55 ..
-rw-r--r-- 1 root root  128 Sep 14 21:55 model.tar.gz

INFO:__main__:*** END file listing /opt/ml/input/data/bucket

Configuring the docker image

TBD

Defining code dependencies

TBD


Simple Sagemaker is a lightweight wrapper around AWS Sage Maker machine learning python wrapper around AWS SageMaker, to easily empower your data science projects

The idea is simple -

You define a series of tasks within a project, provide the code, define how the input and outputs flows through the tasks, set the running instance(s) parameters, and let simple-sagemaker do the rest

Single file example

A single file example can be found in the examples directory. First, define the runner:

dockerFileContent = """
# __BASE_IMAGE__ is automatically replaced with the correct base image
FROM __BASE_IMAGE__
RUN pip3 install pandas==1.1 scikit-learn==0.21.3
"""
file_path = Path(__file__).parent


def runner(project_name="simple-sagemaker-sf", prefix="", postfix="", output_path=None):
    from simple_sagemaker.sm_project import SageMakerProject

    sm_project = SageMakerProject(project_name=project_name)
    # define the code parameters
    sm_project.setDefaultCodeParams(source_dir=None, entryPoint=__file__, dependencies=[])
    # define the instance parameters
    sm_project.setDefaultInstanceParams(instance_count=2)
    # docker image
    sm_project.setDefaultImageParams(
        aws_repo_name="task_repo",
        repo_name="task_repo",
        img_tag="latest",
        docker_file_path_or_content=dockerFileContent,
    )
    image_uri = sm_project.buildOrGetImage(
        instance_type=sm_project.defaultInstanceParams.instance_type
    )
    # ceate the IAM role
    sm_project.createIAMRole()

    # *** Task 1 - process input data
    task1_name = "task1"
    # set the input data
    input_data_path = file_path.parent / "data"
    # run the task
    sm_project.runTask(
        task1_name,
        image_uri,
        distribution="ShardedByS3Key",  # distribute the input files among the workers
        hyperparameters={"worker": 1, "arg": "hello world!", "task": 1},
        input_data_path=str(input_data_path) if input_data_path.is_dir() else None,
        clean_state=True,  # clean the current state, also forces re-running
    )
    # download the results
    if not output_path:
        output_path = file_path.parent / "output"
    shutil.rmtree(output_path, ignore_errors=True)
    sm_project.downloadResults(task1_name, Path(output_path) / "output1")

An additional task that depends on the previous one can now be scheduled as well:

    # *** Task 2 - process the results of Task 1
    task2_name = "task2"
    # set the input
    additional_inputs = {
        "task2_data": sm_project.getInputConfig(task1_name, model=True),
        "task2_data_dist": sm_project.getInputConfig(
            task1_name, model=True, distribution="ShardedByS3Key"
        ),
    }
    # run the task
    sm_project.runTask(
        task2_name,
        image_uri,
        hyperparameters={"worker": 1, "arg": "hello world!", "task": 2},
        clean_state=True,  # clean the current state, also forces re-running
        additional_inputs=additional_inputs,
    )
    # download the results
    sm_project.downloadResults(task1_name, Path(output_path) / "output2")

    return sm_project

Then, the worker code (note: the same function is used for the two different tasks, depending on the task hyperparameter):

def worker():
    from task_toolkit import algo_lib

    algo_lib.setDebugLevel()

    logger.info("Starting worker...")
    # parse the arguments
    args = algo_lib.parseArgs()

    state_dir = algo_lib.initMultiWorkersState(args)

    logger.info(f"Hyperparams: {args.hps}")
    logger.info(f"Input data files: {list(Path(args.input_data).rglob('*'))}")
    logger.info(f"State files: { list(Path(args.state).rglob('*'))}")

    if int(args.hps["task"]) == 1:
        # update the state per running instance
        open(f"{state_dir}/state_{args.current_host}", "wt").write("state")
        # write to the model output directory
        for file in Path(args.input_data).rglob("*"):
            relp = file.relative_to(args.input_data)
            path = Path(args.model_dir) / (str(relp) + "_proc_by_" + args.current_host)
            path.write_text(file.read_text() + " processed by " + args.current_host)
        open(f"{args.model_dir}/output_{args.current_host}", "wt").write("output")
    elif int(args.hps["task"]) == 2:
        logger.info(f"Input task2_data: {list(Path(args.input_task2_data).rglob('*'))}")
        logger.info(
            f"Input task2_data_dist: {list(Path(args.input_task2_data_dist).rglob('*'))}"
        )

    # mark the task as completed
    algo_lib.markCompleted(args)
    logger.info("finished!")

To pack everything in a single file, we use the command line argumen --worker (as defined in the runner function) to distinguish between runner and worker runs

import logging
import shutil
import sys
from pathlib import Path

logger = logging.getLogger(__name__)

...

def main():
    logging.basicConfig(stream=sys.stdout, level=logging.INFO)
    if "--worker" in sys.argv:
        worker()
    else:
        runner()


if __name__ == "__main__":
    main()

Running the file, with a sibling directory named data with a sample file as on the example, prduces the following outputs for Task 1:

INFO:__main__:Hyperparams: {'arg': 'hello world!', 'task': 1, 'worker': 1}
INFO:__main__:Input data files: [PosixPath('/opt/ml/input/data/data/sample_data1.txt')]
INFO:__main__:State files: [PosixPath('/state/algo-1')]
INFO:task_toolkit.algo_lib:Marking instance algo-1 completion
INFO:task_toolkit.algo_lib:Creating instance specific state dir
INFO:__main__:finished!
INFO:__main__:Hyperparams: {'arg': 'hello world!', 'task': 1, 'worker': 1}
INFO:__main__:Input data files: [PosixPath('/opt/ml/input/data/data/sample_data2.txt')]
INFO:__main__:State files: [PosixPath('/state/algo-2')]
INFO:task_toolkit.algo_lib:Marking instance algo-2 completion
INFO:task_toolkit.algo_lib:Creating instance specific state dir
INFO:__main__:finished!

And the following for Task 2:

INFO:__main__:Hyperparams: {'arg': 'hello world!', 'task': 2, 'worker': 1}
INFO:__main__:Input data files: [PosixPath('task_toolkit'), PosixPath('example.py'), PosixPath('task_toolkit/algo_lib.py'), PosixPath('task_toolkit/__pycache__'), PosixPath('task_toolkit/__init__.py'), PosixPath('task_toolkit/__pycache__/__init__.cpython-38.pyc'), PosixPath('task_toolkit/__pycache__/algo_lib.cpython-38.pyc')]
INFO:__main__:State files: [PosixPath('/state/algo-1')]
INFO:__main__:Input task2_data: [PosixPath('/opt/ml/input/data/task2_data/model.tar.gz')]
INFO:__main__:Input task2_data_dist: [PosixPath('/opt/ml/input/data/task2_data_dist/model.tar.gz')]
INFO:task_toolkit.algo_lib:Marking instance algo-1 completion
INFO:task_toolkit.algo_lib:Creating instance specific state dir
INFO:__main__:Hyperparams: {'arg': 'hello world!', 'task': 1, 'worker': 1}
INFO:__main__:Input data files: [PosixPath('/opt/ml/input/data/data/sample_data2.txt')]
INFO:__main__:State files: [PosixPath('/state/algo-2')]
INFO:task_toolkit.algo_lib:Marking instance algo-2 completion
INFO:task_toolkit.algo_lib:Creating instance specific state dir
INFO:__main__:finished!

As mentioned, the complete code can be found in this directory,

Development

Pushing a code change

  1. Develop ...
  2. Format & lint
tox -e cf
tox -e lint
  1. Cleanup
tox -e clean
  1. Test
tox
  1. Generate & test coverage
tox -e report
  1. [Optionally] - bump the version string on /src/simple_sagemaker/init to allow the release of a new version
  2. Push your code to a development branch
    • Every push is tested for linting + some
  3. Create a pull request to the master branch
    • Every master push is fully tested
  4. If the tests succeed, the new version is publihed to PyPi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_sagemaker-0.9.9.tar.gz (90.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simple_sagemaker-0.9.9-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file simple_sagemaker-0.9.9.tar.gz.

File metadata

  • Download URL: simple_sagemaker-0.9.9.tar.gz
  • Upload date:
  • Size: 90.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for simple_sagemaker-0.9.9.tar.gz
Algorithm Hash digest
SHA256 7efb79a7265e92c9e94d68376645b61f1c9fdc6cf8efad364b23d9c854b6eacd
MD5 4a1b62af5f1000552bc5a482e50e3571
BLAKE2b-256 b8e3997fb69259a125674ccebcc758a36bbead52181439ef825138962461e2af

See more details on using hashes here.

File details

Details for the file simple_sagemaker-0.9.9-py3-none-any.whl.

File metadata

  • Download URL: simple_sagemaker-0.9.9-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for simple_sagemaker-0.9.9-py3-none-any.whl
Algorithm Hash digest
SHA256 163ec9cb2eada75d566764a3b0ca6579016031fda93ea0a139825e32c3ab2e15
MD5 72fb3691774332d6b276728fb0359029
BLAKE2b-256 7d33df1ce1017220028b146a229dbb67f48460c0fc6e676b206508c7b4b73908

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page