TRAINS Agent - Auto-Magical DevOps for Deep Learning

These details have not been verified by PyPI

Project links

Homepage

Project description

Allegro Trains Agent

Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)

"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"

Help improve Trains by filling our 2-min user survey

Trains Agent is an AI experiment cluster solution.

It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.

Full AutoML in 5 steps

Install the Trains Server (or use our open server)
pip install trains-agent (install the Trains Agent on any GPU machine: on-premises / cloud / ...)
Add Trains to your code with just 2 lines & run it once (on your machine / laptop)
Change the parameters in the UI & schedule for execution (or automate with an AutoML pipeline)
:chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:

Using the Trains Agent, you can now set up a dynamic cluster with *epsilon DevOps

*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work

(Experience Trains live at https://demoapp.trains.allegro.ai)

Simple, Flexible Experiment Orchestration

The Trains Agent was built to address the DL/ML R&D DevOps needs:

Easily add & remove machines from the cluster
Reuse machines without the need for any dedicated containers or images
Combine GPU resources across any cloud and on-prem
No need for yaml/json/template configuration of any kind
User friendly UI
Manageable resource allocation that can be used by researchers and engineers
Flexible and controllable scheduler with priority support
Automatic instance spinning in the cloud (coming soon)

But ... K8S?

We think Kubernetes is awesome.
Combined with KubeFlow it is a robust solution for production-grade DevOps.
We've observed, however, that it can be a bit of an overkill as an R&D DL/ML solution. If you are considering K8S for your research, also consider that you will soon be managing hundreds of containers...

In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).

We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S when the need arises.
(If you already have a K8S cluster for AI, detailed instructions on how to integrate Trains into your K8S cluster are here with included helm chart)

Using the Trains Agent

Full scale HPC with a click of a button

The Trains Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.

Any 'Draft' experiment can be scheduled for execution by a Trains agent.

A previously run experiment can be put into 'Draft' state by either of two methods:

Using the 'Reset' action from the experiment right-click context menu in the Trains UI - This will clear any results and artifacts the previous run had created.
Using the 'Clone' action from the experiment right-click context menu in the Trains UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.

An experiment is scheduled for execution using the 'Enqueue' action from the experiment right-click context menu in the Trains UI and selecting the execution queue.

See creating an experiment and enqueuing it for execution.

Once an experiment is enqueued, it will be picked up and executed by a Trains agent monitoring this queue.

The Trains UI Workers & Queues page provides ongoing execution information:

Workers Tab: Monitor you cluster
- Review available resources
- Monitor machines statistics (CPU / GPU / Disk / Network)
Queues Tab:
- Control the scheduling order of jobs
- Cancel or abort job execution
- Move jobs between execution queues

What The Trains Agent Actually Does

The Trains Agent executes experiments using the following process:

Create a new virtual environment (or launch the selected docker image)
Clone the code into the virtual-environment (or inside the docker)
Install python packages based on the package requirements listed for the experiment
- Special note for PyTorch: The Trains Agent will automatically select the torch packages based on the CUDA_VERSION environment variable of the machine
Execute the code, while monitoring the process
Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)

System Design & Flow

                                                                              +-----------------+
                                                                              |  GPU  Machine   |
Development Machine                                                           |                 |
+------------------------+                                                    | +-------------+ |
|    Data Scientist's    |                            +--------------+        | |Trains Agent | |
|      DL/ML Code        |                            |    WEB UI    |        | |             | |
|                        |                            |              |        | | +---------+ | |
|                        |                            |              |        | | |  DL/ML  | | |
|                        |                            +--------------+        | | |  Code   | | |
|                        |       User Clones Exp #1  / . . . . . . . /        | | |         | | |
| +-------------------+  |           into Exp #2    / . . . . . . . /         | | +---------+ | |
| |      Trains       |  |         +---------------/-_____________-/          | |             | |
| +---------+---------+  |         |                                          | |      ^      | |
+-----------|------------+         |                                          | +------|------+ |
            |                      |                                          +--------|--------+
 Auto-Magically                    |                                                   |
 Creates Exp #1                    |                                      The Trains Agent
             \          User Change Hyper-Parameters                      Pulls Exp #2, setup the
             |                     |                                      environment & clone code.
             |                     |                                      Start execution with the
+------------|------------+        |            +--------------------+    new set of Hyper-Parameters.
|  +---------v---------+  |        |            |   Trains Server    |                 |
|  | Experiment #1     |  |        |            |                    |                 |
|  +-------------------+  |        |            |  Execution Queue   |                 |
|            ||           |        |            |                    |                 |
|  +-------------------+<----------+            |                    |                 |
|  |                   |  |                     |                    |                 |
|  | Experiment #2     |  |                     |                    |                 |
|  +-------------------<------------\           |                    |                 |
|                         |          ------------->---------------+  |                 |
|                         |  User Send Exp #2   | |Execute Exp #2 +--------------------+
|                         |  For Execution      | +---------------+  |
|     Trains Server       |                     |                    |
+-------------------------+                     +--------------------+

Installing the Trains Agent

pip install trains-agent

Trains Agent Usage Examples

Full Interface and capabilities are available with

trains-agent --help
trains-agent daemon --help

Configuring the Trains Agent

trains-agent init

Note: The Trains Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Trains Agent cache folder is ~/.trains

See full details in your configuration file at ~/trains.conf

Note: The Trains agent extends the Trains configuration file ~/trains.conf They are designed to share the same configuration file, see example here

Running the Trains Agent

For debug and experimentation, start the Trains agent in foreground mode, where all the output is printed to screen

trains-agent daemon --queue default --foreground

For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe) Notice: with --detached flag, the trains-agent will be running in the background

trains-agent daemon --detached --queue default

GPU allocation is controlled via the standard OS environment NVIDIA_VISIBLE_DEVICES or --gpus flag (or disabled with --cpu-only).

If no flag is set, and NVIDIA_VISIBLE_DEVICES variable doesn't exist, all GPU's will be allocated for the trains-agent
If --cpu-only flag is set, or NVIDIA_VISIBLE_DEVICES is an empty string (""), no gpu will be allocated for the trains-agent

Example: spin two agents, one per gpu on the same machine: Notice: with --detached flag, the trains-agent will be running in the background

trains-agent daemon --detached --gpus 0 --queue default
trains-agent daemon --detached --gpus 1 --queue default

Example: spin two agents, pulling from dedicated dual_gpu queue, two gpu's per agent

trains-agent daemon --detached --gpus 0,1 --queue dual_gpu
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu

Starting the Trains Agent in docker mode

For debug and experimentation, start the Trains agent in foreground mode, where all the output is printed to screen

trains-agent daemon --queue default --docker --foreground

For actual service mode, all the stdout will be stored automatically into a file (no need to pipe) Notice: with --detached flag, the trains-agent will be running in the background

trains-agent daemon --detached --queue default --docker

Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda docker:

trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
trains-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda

Example: spin two agents, pulling from dedicated dual_gpu queue, two gpu's per agent, with default nvidia/cuda docker:

trains-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda

Starting the Trains Agent - Priority Queues

Priority Queues are also supported, example use case:

High priority queue: important_jobs Low priority queue: default

trains-agent daemon --queue important_jobs default

The Trains Agent will first try to pull jobs from the important_jobs queue, only then it will fetch a job from the default queue.

Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our open server

Stopping the Trains Agent

To stop a Trains Agent running in the background, run the same command line used to start the agent with --stop appended.
For example, to stop the first of the above shown same machine, single gpu agents:

trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda --stop

How do I create an experiment on the Trains Server?

Integrate Trains with your code
Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
As your code is running, Trains creates an experiment logging all the necessary execution information:
- Git repository link and commit ID (or an entire jupyter notebook)
- Git diff (we’re not saying you never commit and push, but still...)
- Python packages used by your code (including specific versions used)
- Hyper-Parameters
- Input Artifacts
You now have a 'template' of your experiment with everything required for automated execution
In the Trains UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
You now have a new draft experiment cloned from your original experiment, feel free to edit it
- Change the Hyper-Parameters
- Switch to the latest code base of the repository
- Update package versions
- Select a specific docker image to run in (see docker execution mode section)
- Or simply change nothing to run the same experiment again...
Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'

Trains-Agent Services Mode

Trains-Agent Services is a special mode of Trains-Agent that provides the ability to launch long-lasting jobs that previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data transparency)

Trains-Agent Services mode will spin any task enqueued into the specified queue. Every task launched by Trains-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities. Currently trains-agent in services-mode supports cpu only configuration. Trains-agent services mode can be launched alongside GPU agents.

trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only

Note: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.

AutoML and Orchestration Pipelines

The Trains Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the Trains package.

Sample AutoML & Orchestration examples can be found in the Trains example/automation folder.

AutoML examples

Toy Keras training experiment
- In order to create an experiment-template in the system, this code must be executed once manually
Random Search over the above Keras experiment-template
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations

Experiment Pipeline examples

First step experiment
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
Second step experiment
- In order to create an experiment-template in the system, this code must be executed once manually

License

Apache License, Version 2.0 (see the LICENSE for more information)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.16.3

Dec 22, 2020

0.16.2

Dec 10, 2020

0.16.2rc2 pre-release

Nov 29, 2020

0.16.2rc1 pre-release

Nov 11, 2020

0.16.2rc0 pre-release

Oct 14, 2020

0.16.1

Oct 5, 2020

0.16.0

Aug 11, 2020

0.15.2rc0 pre-release

Jul 6, 2020

0.15.1

Jun 21, 2020

0.15.0

Jun 1, 2020

0.15.0rc0 pre-release

May 21, 2020

0.14.2rc2 pre-release

Apr 30, 2020

0.14.2rc1 pre-release

Apr 23, 2020

0.14.2rc0 pre-release

Apr 19, 2020

0.14.1

Mar 24, 2020

0.14.1rc1 pre-release

Mar 20, 2020

0.14.1rc0 pre-release

Mar 17, 2020

0.14.0

Mar 12, 2020

0.14.0rc0 pre-release

Mar 10, 2020

0.13.3

Mar 9, 2020

0.13.3rc12 pre-release

Mar 9, 2020

0.13.3rc11 pre-release

Mar 8, 2020

0.13.3rc10 pre-release

Mar 8, 2020

0.13.3rc9 pre-release

Mar 8, 2020

0.13.3rc8 pre-release

Mar 8, 2020

0.13.3rc7 pre-release

Mar 8, 2020

0.13.3rc6 pre-release

Mar 5, 2020

0.13.3rc5 pre-release

Mar 5, 2020

0.13.3rc4 pre-release

Mar 5, 2020

0.13.3rc3 pre-release

Mar 5, 2020

0.13.3rc2 pre-release

Mar 5, 2020

0.13.3rc1 pre-release

Mar 4, 2020

0.13.3rc0 pre-release

Mar 3, 2020

0.13.2

Feb 23, 2020

0.13.2rc3 pre-release

Feb 16, 2020

0.13.2rc2 pre-release

Feb 12, 2020

0.13.2rc1 pre-release

Feb 6, 2020

0.13.2rc0 pre-release

Jan 30, 2020

0.13.1

Jan 27, 2020

0.13.1rc10 pre-release

Jan 26, 2020

0.13.1rc9 pre-release

Jan 26, 2020

0.13.1rc8 pre-release

Jan 22, 2020

0.13.1rc7 pre-release

Jan 22, 2020

0.13.1rc6 pre-release

Jan 21, 2020

0.13.1rc4 pre-release

Jan 21, 2020

0.13.1rc3 pre-release

Jan 21, 2020

0.13.1rc2 pre-release

Jan 15, 2020

0.13.1rc1 pre-release

Jan 14, 2020

0.13.1rc0 pre-release

Jan 13, 2020

0.13.0

Jan 6, 2020

0.12.3rc1 pre-release

Dec 23, 2019

0.12.3rc0 pre-release

Dec 20, 2019

0.12.2

Dec 15, 2019

0.12.2rc3 pre-release

Dec 15, 2019

0.12.2rc2.post1 pre-release

Dec 7, 2019

0.12.2rc2 pre-release

Dec 3, 2019

0.12.2rc1 pre-release

Nov 25, 2019

0.12.2rc0 pre-release

Nov 21, 2019

0.12.1

Nov 15, 2019

0.12.1rc5 pre-release

Nov 15, 2019

0.12.1rc4 pre-release

Nov 13, 2019

0.12.1rc3 pre-release

Nov 12, 2019

0.12.1rc2 pre-release

Nov 8, 2019

0.12.1rc1 pre-release

Nov 2, 2019

0.12.1rc0 pre-release

Nov 2, 2019

0.12.0

Oct 29, 2019

0.12.0rc5 pre-release

Oct 29, 2019

0.12.0rc4 pre-release

Oct 28, 2019

0.12.0rc3 pre-release

Oct 27, 2019

0.12.0rc2 pre-release

Oct 26, 2019

0.12.0rc1 pre-release

Oct 24, 2019

0.12.0rc0 pre-release

Oct 23, 2019

0.12.0a11 pre-release

Oct 16, 2019

0.12.0a10 pre-release

Oct 16, 2019

0.12.0a9 pre-release

Oct 15, 2019

0.12.0a8 pre-release

Oct 13, 2019

0.12.0a7 pre-release

Oct 13, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trains_agent-0.16.3-py3-none-any.whl (322.9 kB view details)

Uploaded Dec 22, 2020 Python 3

File details

Details for the file trains_agent-0.16.3-py3-none-any.whl.

File metadata

Download URL: trains_agent-0.16.3-py3-none-any.whl
Upload date: Dec 22, 2020
Size: 322.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for trains_agent-0.16.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4aa14727efc5e64b70e2a24c7e65986cd6d994da36d71505f48b243860fea87b`
MD5	`8e3442056bdaa06edbaacfaae1870b00`
BLAKE2b-256	`2bb1a693d40a10e6134f6e9d417964cccda181f79fdd67d338c77ae04246753f`

See more details on using hashes here.

trains-agent 0.16.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Allegro Trains Agent

Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)

Help improve Trains by filling our 2-min user survey

Simple, Flexible Experiment Orchestration

But ... K8S?

Using the Trains Agent

What The Trains Agent Actually Does

System Design & Flow

Installing the Trains Agent

Trains Agent Usage Examples

Configuring the Trains Agent

Running the Trains Agent

Starting the Trains Agent in docker mode

Starting the Trains Agent - Priority Queues

Stopping the Trains Agent

How do I create an experiment on the Trains Server?

Trains-Agent Services Mode

AutoML and Orchestration Pipelines

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes