Skip to main content

Tensorflow on Mesos

Project description

Join the chat at https://gitter.im/douban/tfmesos https://img.shields.io/travis/douban/tfmesos.svg https://img.shields.io/pypi/v/tfmesos.svg https://img.shields.io/docker/automated/tfmesos/tfmesos.svg

TFMesos is a lightweight framework to help running distributed Tensorflow Machine Learning tasks on Apache Mesos within Docker and Nvidia-Docker .

TFMesos dynamically allocates resources from a Mesos cluster, builds a distributed training cluster for Tensorflow, and makes different training tasks mangeed and isolated in the shared Mesos cluster with the help of Docker.

Prerequisites

  • For Mesos >= 1.0.0:

  1. Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.

  2. Setup Mesos Agent to enable Mesos Containerizer and Mesos Nvidia GPU Support (optional). eg: mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia

  3. (optional) A Distributed Filesystem (eg: MooseFS)

  4. Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster

  • For Mesos < 1.0.0:

  1. Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.

  2. Docker (cf: Docker Get Start Tutorial)

  3. Mesos Docker Containerizer Support (cf: Mesos Docker Containerizer)

  4. (optional) Nvidia-docker installation (cf: Nvidia-docker installation) and make sure nvidia-plugin is accessible from remote host (with -l 0.0.0.0:3476)

  5. (optional) A Distributed Filesystem (eg: MooseFS)

  6. Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster

If you are using AWS G2 instance, here is a sample script to setup most of there prerequisites.

Running simple Test

After setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.

$ docker run -e MESOS_MASTER=mesos-master:5050 \
    -e DOCKER_IMAGE=tfmesos/tfmesos \
    --net=host \
    -v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \
    --rm \
    -it \
    tfmesos/tfmesos \
    python /tmp/plus.py mesos-master:5050

Successfully running the test should result in an output of 42 on the console.

Running in replica mode

This mode is called Between-graph replication in official Distributed Tensorflow Howto

Most distributed training models that Google has open sourced (such as mnist_replica and inception) are using this mode. In this mode, two kind of Jobs are defined with the names ‘ps’ and ‘worker’. ‘ps’ tasks act as ‘Parameter Server’ and ‘worker’ tasks run the actual training process.

Here we use our modified ‘mnist_replica’ as example:

  1. Checkout the mnist example codes into a directory in shared filesystem, eg: /nfs/mnist

  2. Assume Mesos master is mesos-master:5050

  3. Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1  \
             -V /nfs/mnist:/nfs/mnist \
             -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

Note:

In this mode, tfrun is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container. tfrun will substitute {ps_hosts}, {worker_hosts}, {job_name}, {task_index} with corresponding values of each task.

Running in fine-grained mode

This mode is called In-graph replication in official Distributed Tensorflow Howto

In this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a Grpc server. Each worker is driven by a local thread to run the training task.

Here we use our modified mnist as example:

  1. Checkout the mnist example codes into a directory, eg: /tmp/mnist

  2. Assume Mesos master is mesos-master:5050

  3. Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py --worker-gpus 1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfmesos-0.0.10.tar.gz (10.7 kB view details)

Uploaded Source

Built Distributions

tfmesos-0.0.10-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

tfmesos-0.0.10-py2-none-any.whl (11.6 kB view details)

Uploaded Python 2

File details

Details for the file tfmesos-0.0.10.tar.gz.

File metadata

  • Download URL: tfmesos-0.0.10.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.14

File hashes

Hashes for tfmesos-0.0.10.tar.gz
Algorithm Hash digest
SHA256 556050dfbb41b8808d62d87926ca4fadbf25b84733ea0b752fa9f54da61172a8
MD5 7cbbe54c340c86da29dad49ce29cc4df
BLAKE2b-256 29b0e8b2fc4f3429c24343f5690db219500364f80e6b6c84a098ab9fc6e787e5

See more details on using hashes here.

File details

Details for the file tfmesos-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: tfmesos-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.5.6

File hashes

Hashes for tfmesos-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 8e36faef79207d8ada1ad6e84e9389f7c894aa65af895dbd55876f763f6f84ed
MD5 81ce13704c55c196cd8aab21ae2dc712
BLAKE2b-256 69a1630ee501c688b8a339cb6fbff8f76ce37dc25643ccc2ad670951b7c0cf80

See more details on using hashes here.

File details

Details for the file tfmesos-0.0.10-py2-none-any.whl.

File metadata

  • Download URL: tfmesos-0.0.10-py2-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.14

File hashes

Hashes for tfmesos-0.0.10-py2-none-any.whl
Algorithm Hash digest
SHA256 2b5f06c2c3337e1202d230d5c65d3c282ef8e51f4e0e2969ed5a68e7a34c103c
MD5 5deca30671fc493e3136167c55569282
BLAKE2b-256 971233c741eb4f2e557ef0e8a71da2749f31ac7ee56a34bd305f79a397d96f42

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page