Setup for training Tensorflow models on SLURM clusters.
Project description
scoach
A setup for training Tensorflow models on SLURM clusters
How does it work?
- Inputs needed (see examples directory):
- A
.json
file with parameters for training - A
.json
file with the model definition - A
.py
file with the training code. - There's a CLI app for interacting with scoach
- Run
scoach init
for setting up your configuration file, such as inconfig_example.yaml
- On the login machine at the SLURM cluster, run
scoach start
. This will start a daemon that will then launch jobs as requested. - On any machine, you can do
scoach run submit
to submit jobs. - This will upload the Python script to MinIO and submit the configurations to the database.
- The new runs are consumed by the daemon process, which then uses Jinja2 to render the training script and submit it to the cluster.
- The training script is then run on the cluster, using Dask workers, that will grow as needed.
- A
To do
- Add option
--local
onscoach start
for launching runs locally - Add support for uploading/managing datasets
- No Python script duplicates
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scoach-0.1.9.tar.gz
(25.7 kB
view hashes)
Built Distribution
scoach-0.1.9-py3-none-any.whl
(41.6 kB
view hashes)