tfmesos

Tensorflow on Mesos

These details have not been verified by PyPI

Project links

Download

Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- POSIX :: Linux
Programming Language
- Python
- Python :: Implementation :: CPython
Topic
- Software Development :: Libraries

Project description

Join the chat at https://gitter.im/douban/tfmesos

https://img.shields.io/travis/douban/tfmesos.svg

https://img.shields.io/pypi/v/tfmesos.svg

https://img.shields.io/docker/automated/tfmesos/tfmesos.svg

TFMesos is a lightweight framework to help running distributed Tensorflow Machine Learning tasks on Apache Mesos within Docker and Nvidia-Docker .

TFMesos dynamically allocates resources from a Mesos cluster, builds a distributed training cluster for Tensorflow, and makes different training tasks mangeed and isolated in the shared Mesos cluster with the help of Docker.

Prerequisites

For Mesos >= 1.0.0:

Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.
Setup Mesos Agent to enable Mesos Containerizer and Mesos Nvidia GPU Support (optional). eg: mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
(optional) A Distributed Filesystem (eg: MooseFS)
Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster

For Mesos < 1.0.0:

Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.
Docker (cf: Docker Get Start Tutorial)
Mesos Docker Containerizer Support (cf: Mesos Docker Containerizer)
(optional) Nvidia-docker installation (cf: Nvidia-docker installation) and make sure nvidia-plugin is accessible from remote host (with -l 0.0.0.0:3476)
(optional) A Distributed Filesystem (eg: MooseFS)
Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster

If you are using AWS G2 instance, here is a sample script to setup most of there prerequisites.

Running simple Test

After setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.

$ docker run -e MESOS_MASTER=mesos-master:5050 \
    -e DOCKER_IMAGE=tfmesos/tfmesos \
    --net=host \
    -v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \
    --rm \
    -it \
    tfmesos/tfmesos \
    python /tmp/plus.py mesos-master:5050

Successfully running the test should result in an output of 42 on the console.

Running in replica mode

This mode is called Between-graph replication in official Distributed Tensorflow Howto

Most distributed training models that Google has open sourced (such as mnist_replica and inception) are using this mode. In this mode, two kind of Jobs are defined with the names ‘ps’ and ‘worker’. ‘ps’ tasks act as ‘Parameter Server’ and ‘worker’ tasks run the actual training process.

Here we use our modified ‘mnist_replica’ as example:

Checkout the mnist example codes into a directory in shared filesystem, eg: /nfs/mnist
Assume Mesos master is mesos-master:5050
Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1  \
             -V /nfs/mnist:/nfs/mnist \
             -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

Note:

In this mode, tfrun is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container. tfrun will substitute {ps_hosts}, {worker_hosts}, {job_name}, {task_index} with corresponding values of each task.

Running in fine-grained mode

This mode is called In-graph replication in official Distributed Tensorflow Howto

In this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a Grpc server. Each worker is driven by a local thread to run the training task.

Here we use our modified mnist as example:

Checkout the mnist example codes into a directory, eg: /tmp/mnist
Assume Mesos master is mesos-master:5050
Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py --worker-gpus 1

Project details

These details have not been verified by PyPI

Project links

Download

Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- POSIX :: Linux
Programming Language
- Python
- Python :: Implementation :: CPython
Topic
- Software Development :: Libraries

Release history Release notifications | RSS feed

This version

0.0.10

Aug 23, 2018

0.0.9

Mar 14, 2018

0.0.8

Jan 9, 2018

0.0.6

Aug 16, 2017

0.0.5

Jul 12, 2017

0.0.4

May 11, 2017

0.0.3

Apr 21, 2017

0.0.2

Dec 22, 2016

0.0.1

Nov 17, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfmesos-0.0.10.tar.gz (10.7 kB view details)

Uploaded Aug 23, 2018 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tfmesos-0.0.10-py3-none-any.whl (11.6 kB view details)

Uploaded Aug 23, 2018 Python 3

tfmesos-0.0.10-py2-none-any.whl (11.6 kB view details)

Uploaded Aug 23, 2018 Python 2

File details

Details for the file tfmesos-0.0.10.tar.gz.

File metadata

Download URL: tfmesos-0.0.10.tar.gz
Upload date: Aug 23, 2018
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.14

File hashes

Hashes for tfmesos-0.0.10.tar.gz
Algorithm	Hash digest
SHA256	`556050dfbb41b8808d62d87926ca4fadbf25b84733ea0b752fa9f54da61172a8`
MD5	`7cbbe54c340c86da29dad49ce29cc4df`
BLAKE2b-256	`29b0e8b2fc4f3429c24343f5690db219500364f80e6b6c84a098ab9fc6e787e5`

See more details on using hashes here.

File details

Details for the file tfmesos-0.0.10-py3-none-any.whl.

File metadata

Download URL: tfmesos-0.0.10-py3-none-any.whl
Upload date: Aug 23, 2018
Size: 11.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.5.6

File hashes

Hashes for tfmesos-0.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e36faef79207d8ada1ad6e84e9389f7c894aa65af895dbd55876f763f6f84ed`
MD5	`81ce13704c55c196cd8aab21ae2dc712`
BLAKE2b-256	`69a1630ee501c688b8a339cb6fbff8f76ce37dc25643ccc2ad670951b7c0cf80`

See more details on using hashes here.

File details

Details for the file tfmesos-0.0.10-py2-none-any.whl.

File metadata

Download URL: tfmesos-0.0.10-py2-none-any.whl
Upload date: Aug 23, 2018
Size: 11.6 kB
Tags: Python 2
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.14

File hashes

Hashes for tfmesos-0.0.10-py2-none-any.whl
Algorithm	Hash digest
SHA256	`2b5f06c2c3337e1202d230d5c65d3c282ef8e51f4e0e2969ed5a68e7a34c103c`
MD5	`5deca30671fc493e3136167c55569282`
BLAKE2b-256	`971233c741eb4f2e557ef0e8a71da2749f31ac7ee56a34bd305f79a397d96f42`

See more details on using hashes here.

tfmesos 0.0.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Prerequisites

Running simple Test

Running in replica mode

Running in fine-grained mode

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes