Skip to main content

A utility for tracking and reproducing Tensorflow runs.

Project description

Machine learning engineers often run multiple versions of an algorithm concurrently. However, this can make keeping track of and reproducing runs difficult. This simple utility solves this problem by maintaining a database in human-readable YAML formal that tracks

  • A unique name assigned to each run.

  • A description of each run.

  • The exact command used for the run.

  • The date and time of the run.

  • The most recent commit before the run.

Installation

The only external prerequisites of this tool are tmux and git. After that, pip install run-manager.

Important paths and files

When you run runs new, the utility automatically creates the following directory structure:

<Runs Directory>/
    <Runs Database>
    checkpoints/
    tensorboard/<Run Name>/

Runs Database

YAML file that stores historical information about Tensorflow runs.

Run Name

This is a unique value that you assign to each run. The runs section explains how the program deals with collisions.

checkpoints directory

Directory where model checkpoints are saved. Used in tf.train.Saver().save(sess, <checkpoints directory>/<Run Name>.ckpt).

tensorboard directory

Directory where events are saved. Used in tf.summary.FileWriter(<tensorboard directory>/<Run Name>/).

Configuration

Runs can be extensively configured using command-line arguments, but the following values can also be configured in a .runsrc file:

name

default

description

runs-dir

.runs/

The name to use for your Runs Directory.

db-filename

.runs.yml

The name that you choose to save your runs database with.

tb-dir-flag

--tb-dir

The flag that gets passed to your program that specifies <tensorboard directory>/<Run Name>/. If None, no flag will be passed to your program.

save-path-flag

--save-path

The flag that gets passed to your program that specifies <checkpoints directory>/<Run Name>. If None, no flag will be passed to your program.

column-width

30

The default column width for the runs table command.

virtualenv-path

None

The path to your virtual environment directory, if you’re using one. Used in the following command: Source <virtualenv-path>/bin/activate.

The program expects to find the .runsrc in the current working directory. The script should always be run from this directory as all file IO commands use relative paths.

Here is an example .runsrc file:

runs-dir: .lstm-runs/
db-filename: lstm-runs.yml
tb-dir-flag: None
save-path-flag: -s
column-width:
virtualenv-path: /home/ethan/virtualenvs/baselines/
extra-flags:
  - [goal-log-dir, <runs-dir>/goal-logs/<run-name>.log]

Assumptions

This program tries to assume as little about your program as possible, while providing useful functionality. These assumptions are as follows:

  • You call the runs command from the same directory every time (all file IO paths are relative).

  • Your program lives in a Git repository.

  • The Git working tree is not dirty (if it is, the program will throw an informative error).

  • Your program accepts a --tb-dir flag, which your program uses in tf.train.Saver().save(sess, <tf-dir>), and a --save-path flag, which your program uses in tf.train.Saver().restore(sess, <save-path>). If your flags are different and you don’t feel like changing them, you can specify the new flag names using command-line arguments (--tb-dir-flag and --save-path-flag) or in your .runsrc (see the Configuration section for more info). If you don’t want to pass either flag to your program, set --tb-dir-flag or --save-path-flag (or the associated values in your .runsrc) to None.

Subcommands

For detailed descriptions of each subcommand and its arguments, run

runs <subcommand> -h

new

Start a new run and build the file structure (see Important paths and files).

It will add an entry to the database keyed by name, with the following values:

  • command

  • commit

  • datetime

  • description

  • host

Finally, it will execute the command in tmux.

runs new 'run-name' 'python main.py' --description='Description of program'

Note: the --tb-dir and --save-path flags will be automatically appended to this command argument, so do not include them in the <command> argument.

delete

Delete all runs matching pattern. This command also deletes associated tensorboard and checkpoint files.

❯ runs delete "continuous.*"
Delete the following runs?
continuous0
continuous1
continuous21509805012
continuous2
continuous11509804959
continuous3
continuous31509805040

list

List all runs matching pattern.

❯ runs list --pattern="continuous.*"
continuous21509805012
continuous0
continuous11509804959
continuous31509805040
continuous1
continuous2
continuous3

table

Display entries in run-database in table form.

❯ runs table
name                           command                            commit                             datetime                    description                          host
-----------------------------  ---------------------------------  ---------------------------------  --------------------------  ---------------------------------  ------
continuous2                    CUDA_VISIBLE_DEVICES=1 python ...  90c0ad704e54d5152d897a4e978cc7...  2017-11-03T13:46:48.633364  Run multiple runs to test stoc...    rldl3
continuous3                    CUDA_VISIBLE_DEVICES=1 python ...  90c0ad704e54d5152d897a4e978cc7...  2017-11-03T13:47:09.951233  Run multiple runs to test stoc...    _
continuous1                    CUDA_VISIBLE_DEVICES=1 python ...  90c0ad704e54d5152d897a4e978cc7...  2017-11-03T13:42:39.879031  Run multiple runs to test stoc...    _
house-cnn-no-current-pos       python train.py --timesteps-pe...  9fb9b5a                            2017-10-28T18:07:44.246089  This is the refactored CNN on ...    _
room-with-original-cnn         python run_custom.py --timeste...  8a5e1c2                            2017-10-28T17:09:49.971061  Test original cnn on room.mjcf       _
continuous11509804959          CUDA_VISIBLE_DEVICES=1 python ...  90c0ad704e54d5152d897a4e978cc7...  2017-11-04T10:15:59.373633  Run multiple runs to test stoc...    _
continuous31509805040          CUDA_VISIBLE_DEVICES=1 python ...  90c0ad704e54d5152d897a4e978cc7...  2017-11-04T10:17:20.286275  Run multiple runs to test stoc...    rldl4
room-cnn-no-current-pos        python train.py --timesteps-pe...  2873fbf                            2017-10-28T18:08:10.615461  This is the refactored CNN on ...    rldl4
continuous21509805012          CUDA_VISIBLE_DEVICES=1 python ...  90c0ad704e54d5152d897a4e978cc7...  2017-11-04T10:16:52.129656  Run multiple runs to test stoc...    _

To filter by regex, use --pattern flag.

lookup

Lookup specific value associated with database entry.

❯ runs lookup continuous0 commit
da6030dd973c810c330d9635eb8d9c2105bdfe2f

reproduce

Print out commands for reproducing run.

❯ runs reproduce continuous0
To reproduce:
 git checkout da6030dd973c810c330d9635eb8d9c2105bdfe2f
 runs new continuous0 'python run_custom.py --timesteps-per-batch=2048 --continuous-actions --neg-reward --use-cnn' --description='None'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tf-run-manager-1.0.3.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

tf_run_manager-1.0.3-py2.py3-none-any.whl (8.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tf-run-manager-1.0.3.tar.gz.

File metadata

File hashes

Hashes for tf-run-manager-1.0.3.tar.gz
Algorithm Hash digest
SHA256 c3a4a8fe4c8a2d7e50d5338ba05863cf671215e3b451d88deca52ef6b27a42db
MD5 6861796b4401044300948e7ebd395f40
BLAKE2b-256 74142559227e03161a7b91cdca9c38c438416b17c7ba71a303b3e65e992f7013

See more details on using hashes here.

File details

Details for the file tf_run_manager-1.0.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for tf_run_manager-1.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6fecc8345df84396642ebff93ee56177ad0e549624a9d683b21a132cc7101e24
MD5 2cb9ad796a5fd16f6c436e251b4e1bc1
BLAKE2b-256 19b3c4891d67ac77f2939d0ac3d666860e3b66f9981cf97002d08a75991d1233

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page