Skip to main content

Backend implementation for running MLFlow projects on Slurm

Project description

MLFlow-Slurm

Backend for executing MLFlow projects on Slurm batch system

Usage

Install this package in the environment from which you will be submitting jobs. If you are submitting jobs from inside jobs, make sure you have this package listed in your conda or pip environment.

Just list this as your --backend in the job run. You should include a json config file to control how the batch script is constructed:

mlflow run --backend slurm \
          --backend-config slurm_config.json \
          examples/sklearn_elasticnet_wine

It will generate a batch script named after the job id and submit it via the Slurm sbatch command. It will tag the run with the Slurm JobID

Configure Jobs

You can set values in a json file to control job submission. The supported properties in this file are:

Config File Setting Use
partition Which Slurm partition should the job run in?
account What account name to run under
environment List of additional environment variables to add to the job
exports List of environment variables to export to the job
gpus_per_node On GPU partitions how many GPUs to allocate per node
gres SLURM Generic RESources requests
mem Amount of memory to allocate to CPU jobs
modules List of modules to load before starting job
nodes Number of nodes to request from SLURM
ntasks Number of tasks to run on each node
exclusive Set to true to insure jobs don't share a node with other jobs
time Max CPU time job may run
sbatch-script-file Name of batch file to be produced. Leave blank to have service generate a script file name based on the run ID

Sequential Worker Jobs

There are occasions where you have a job that can't finish in the maximum allowable wall time. If you are able to write out a checkpoint file, you can use sequential worker jobs to continue the job where it left off. This is useful for training deep learning models or other long running jobs.

To use this, you just need to provide a parameter to the mlflow run command

  mlflow run --backend slurm -c ../../slurm_config.json -P sequential_workers=3 .

This will the submit the job as normal, but also submit 3 additional jobs that each depend on the previous job. As soon as the first job terminates, the next job will start. This will continue until all jobs have completed.

Development

The slurm docker deployment is handy for testing and development. You can start up a slurm environment with the included docker-compose file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlflow_slurm-1.0.6a2.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlflow_slurm-1.0.6a2-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file mlflow_slurm-1.0.6a2.tar.gz.

File metadata

  • Download URL: mlflow_slurm-1.0.6a2.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.20

File hashes

Hashes for mlflow_slurm-1.0.6a2.tar.gz
Algorithm Hash digest
SHA256 e68163ab1087ab9212023ad82f022a130c8d9654dbe94718d79ce21d23f04a91
MD5 a658e0dd6823ffd9a7f8ccb6198ffc42
BLAKE2b-256 16f7e719475861b79512ced3f10be84fc6407b95cf4ed18aafd1b7c805bc957f

See more details on using hashes here.

File details

Details for the file mlflow_slurm-1.0.6a2-py3-none-any.whl.

File metadata

  • Download URL: mlflow_slurm-1.0.6a2-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.20

File hashes

Hashes for mlflow_slurm-1.0.6a2-py3-none-any.whl
Algorithm Hash digest
SHA256 4c080649e1376c3fb42915c13ed42f928a7afd13f036fd5eac39107fecc07a9d
MD5 9f00fe1d238c760acdd926d625c98f6c
BLAKE2b-256 8e2399a0c50dddcceab857dd60c4a1f8aed082a37d3a5d4faff3184ca656fe64

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page