Skip to main content

Machine Learning Experiment Job Scheduler

Project description

Lightweight Cluster/Cloud VM Job Management 🚀

Pyversions PyPI version Code style: black codecov Colab

Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SSH servers or Google Cloud Platform VMs? mle-scheduler provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

  • MLEJob: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).
  • MLEQueue: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the notebook blog or the example scripts 📖

Colab Local Slurm Grid Engine SSH GCP

Installation ⏳

A PyPI installation is available via:

pip install mle-scheduler

If you want to get the most recent commit, please install directly from the repository:

pip install git+https://github.com/mle-infrastructure/mle-hyperopt.git@main

Managing a Single Job with MLEJob Locally 🚀

from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
    resource_to_run="local",
    job_filename="train.py",
    config_filename="base_config_1.yaml",
    experiment_dir="logs_single",
    seed_id=1
)

_ = job.run()

Managing a Queue of Jobs with MLEQueue Locally 🚀...🚀

from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/<date>_base_config_1
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/<date>_base_config_1
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/<date>_base_config_2
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/<date>_base_config_2
queue = MLEQueue(
    resource_to_run="local",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_queue"
)

queue.run()

Launching Slurm Cluster-Based Jobs 🐒

# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "<SLURM_PARTITION>",  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Launching GridEngine Cluster-Based Jobs 🐘

# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
    "queue": "<GRID_ENGINE_QUEUE>",  # Queue to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "gpu_prefix": "cuda"  #$ -l {gpu_prefix}="{num_gpus}"
}

queue = MLEQueue(
    resource_to_run="sge-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_grid_engine",
    random_seeds=[0, 1]
)
queue.run()

Launching SSH Server-Based Jobs 🦊

ssh_settings = {
    "user_name": "<SSH_USER_NAME>",  # SSH server user name
    "pkey_path": "<PKEY_PATH>",  # Private key path (e.g. ~/.ssh/id_rsa)
    "main_server": "<SSH_SERVER>",  # SSH Server address
    "jump_server": '',  # Jump host address
    "ssh_port": 22,  # SSH port
    "remote_dir": "mle-code-dir",  # Dir to sync code to on server
    "start_up_copy_dir": True,  # Whether to copy code to server
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True  # Whether to use anaconda venv
}

queue = MLEQueue(
    resource_to_run="ssh-node",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_ssh_queue",
    job_arguments=job_args,
    ssh_settings=ssh_settings)

queue.run()

Launching GCP VM-Based Jobs 🦄

cloud_settings = {
    "project_name": "<GCP_PROJECT_NAME>",  # Name of your GCP project
    "bucket_name": "<GCS_BUCKET_NAME>", # Name of your GCS bucket
    "remote_dir": "<GCS_CODE_DIR_NAME>",  # Name of code dir in bucket
    "start_up_copy_dir": True,  # Whether to copy code to bucket
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "num_gpus": 0,  # Number of requested GPUs per job
    "gpu_type": None,  # GPU requested e.g. "nvidia-tesla-v100"
    "num_logical_cores": 1,  # Number of requested CPU cores per job
}

queue = MLEQueue(
    resource_to_run="gcp-cloud",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_gcp_queue",
    job_arguments=job_args,
    cloud_settings=cloud_settings,
)
queue.run()

Citing the MLE-Infrastructure ✏️

If you use mle-scheduler in your research, please cite it as follows:

@software{mle_infrastructure2021github,
  author = {Robert Tjarko Lange},
  title = {{MLE-Infrastructure}: A Set of Lightweight Tools for Distributed Machine Learning Experimentation},
  url = {http://github.com/mle-infrastructure},
  year = {2021},
}

Development 👷

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to create an issue and/or start contributing 🤗.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mle_scheduler-0.0.8.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

mle_scheduler-0.0.8-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file mle_scheduler-0.0.8.tar.gz.

File metadata

  • Download URL: mle_scheduler-0.0.8.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for mle_scheduler-0.0.8.tar.gz
Algorithm Hash digest
SHA256 765b804ef7ff78c43cebd34e3b28b1bfe830883275722bf7d65dce2adcdadf67
MD5 c5781f9c7237d4215201cea124cc6dbe
BLAKE2b-256 b941f6158fbda5790ca4de728c49475bba331adef825f66cac8e3095b0cf351b

See more details on using hashes here.

File details

Details for the file mle_scheduler-0.0.8-py3-none-any.whl.

File metadata

File hashes

Hashes for mle_scheduler-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 115870e97fb7db894b6ce44c52d68169645b90cfeb5d9a535aea777b2515cbed
MD5 f87823dab3f17b554d8c0d7db7019653
BLAKE2b-256 b17ed345896ae6588afd2750345416b56c2ba8b03b01700b04eea611ebfd24d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page