Skip to main content

Tensorflow Cluster on Ray

Project description

The raytf framework provides a simple interface to support distributed training on ray, including tensorflow/pytorch/mxnet. Now tensorflow has been supported, others will be included in later.

Quick Start

Only tested under Python3.6 version

  1. Install the latest ray version: pip install ray

  2. Install the latest raytf: pip install raytf

  3. Git clone this project: git clone https://github.com/zuston/raytf.git

  4. Enter the example folder and execute the python script file, like the following command.

cd raytf
cd example
python mnist.py

How to Use

from raytf.raytf_driver import Driver
# When you using it in local single machine
# ray.init()
tf_cluster = Driver.build(resources=
    {
        "ps": {"cores": 2, "memory": 2, "gpu": 2, "instances": 1},
        "worker": {"cores": 2, "memory": 2, "gpu": 2, "instances": 1},
        "chief": {"cores": 2, "memory": 2, "gpu": 2, "instances": 1}
    },
    event_log="/tmp/opal/4",
    resources_reserved_timeout=10
)
tf_cluster.start(model_process=process, args=None)

This training code will be attached to the existed perm-Ray cluster. If you want to debug, you can use ray.init() to init Ray cluster in local.

When you specify the event_log in tf builder, sidecar tensorboard will be started on one worker.

Gang scheduler is already supported in raytf, which means that only when the resources required by TensorFlow are met, resources will be held. Besides raytf provides the configuration of timeout for waiting for resources, shown in above code. The resources_reserved_timeout unit is sec

How to build and deploy

<Requirement> python -m pip install twine

  1. python setup.py bdist\_wheel --universal

  2. python -m pip install xxxxxx.whl

  3. twine upload dist/*

Tips

  1. To solve the problem of Python module importing on Ray perm-cluster, this project must use Ray 1.5+ version, refer to this RFC(https://github.com/ray-project/ray/issues/14019)

  2. This project is only be tested by Tensorflow estimator training

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

raytf-0.0.1rc3.dev2-py2.py3-none-any.whl (11.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file raytf-0.0.1rc3.dev2-py2.py3-none-any.whl.

File metadata

  • Download URL: raytf-0.0.1rc3.dev2-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.7

File hashes

Hashes for raytf-0.0.1rc3.dev2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 adb9ae2caca5d6c900e68e5695f1542d40f31c7345c95035678589a8c8592c99
MD5 08b6396756fa75445cdb18c3f52c41da
BLAKE2b-256 6d2c9fc0bc2cf39a3df08f5bffd1b85e140522354cb16a3024b88fec0f4d79c5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page