Skip to main content

Setup for training Tensorflow models on SLURM clusters.

Project description

scoach

A setup for training Tensorflow models on SLURM clusters

How does it work?

  • Inputs needed (see examples directory):
    • A .json file with parameters for training
    • A .json file with the model definition
    • A .py file with the training code.
    • There's a CLI app for interacting with scoach
    • Run scoach init for setting up your configuration file, such as in config_example.yaml
    • On the login machine at the SLURM cluster, run scoach start. This will start a daemon that will then launch jobs as requested.
    • On any machine, you can do scoach run submit to submit jobs.
    • This will upload the Python script to MinIO and submit the configurations to the database.
    • The new runs are consumed by the daemon process, which then uses Jinja2 to render the training script and submit it to the cluster.
    • The training script is then run on the cluster, using Dask workers, that will grow as needed.

To do

  • Add option --local on scoach start for launching runs locally
  • Add support for uploading/managing datasets
  • No Python script duplicates

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scoach-0.1.9.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scoach-0.1.9-py3-none-any.whl (41.6 kB view details)

Uploaded Python 3

File details

Details for the file scoach-0.1.9.tar.gz.

File metadata

  • Download URL: scoach-0.1.9.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.8.11 Linux/5.10.0-8-amd64

File hashes

Hashes for scoach-0.1.9.tar.gz
Algorithm Hash digest
SHA256 676247719174ab68e0c020da2297af79dc77e782f0042ae4bb617bee5f0dcc8f
MD5 45344c3874ad257cafc72a58e5cc6820
BLAKE2b-256 cc08029bcc7b4a131b52d4efcc5980a04a4701e9d7151ff2c5d7f68a89da22c2

See more details on using hashes here.

File details

Details for the file scoach-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: scoach-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 41.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.8.11 Linux/5.10.0-8-amd64

File hashes

Hashes for scoach-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 581856a87072a4d00fbbedfa4a78dc0800ab019e61cb8aed4873c258b51ddf81
MD5 2351f3861287b4d8293e44ac1dfa7f39
BLAKE2b-256 a2c4257558efc68c55a36b01f6e51b61c443233a45aa5a45d9b8e6ecb52b452f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page