A simple tool to perform sweeps on SLURM clusters.

Project description

slurm sweeps

A simple tool to perform parameter sweeps on SLURM clusters.

The main motivation was to provide a lightweight ASHA implementation for SLURM clusters that is fully compatible with pytorch-lightning's ddp.

It is heavily inspired by tools like Ray Tune and Optuna. However, on a SLURM cluster, these tools can be complicated to set up and introduce considerable overhead.

Slurm sweeps is simple, lightweight, and has few dependencies. It uses SLURM Job Steps to run the individual trials.

Installation

pip install slurm-sweeps

Dependencies

cloudpickle
numpy
pandas
pyyaml

Usage

You can just run this example on your laptop. By default, the maximum number of parallel trials equals the number of CPUs on your machine.

""" Content of test_ss.py """
from time import sleep
import slurm_sweeps as ss


# Define your train function
def train(cfg: dict):
    for epoch in range(cfg["epochs"]):
        sleep(0.5)
        loss = (cfg["parameter"] - 1) ** 2 * epoch
        # log your metrics
        ss.log({"loss": loss}, epoch)


# Define your experiment
experiment = ss.Experiment(
    train=train,
    cfg={
        "epochs": 10,
        "parameter": ss.Uniform(0, 2),
    },
    asha=ss.ASHA(metric="loss", mode="min"),
)


# Run your experiment
dataframe = experiment.run(n_trials=1000)

# Your results are stored in a pandas DataFrame
print(f"\nBest trial:\n{dataframe.sort_values('loss').iloc[0]}")

Or submit it to a SLURM cluster. Write a small SLURM script test_ss.slurm that runs the code above:

#!/bin/bash -l
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=18
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1GB

python test_ss.py

By default, this will run $SLURM_NTASKS trials in parallel. In the case above: 2 nodes * 18 tasks = 36 trials

Then submit it to the queue:

sbatch test_ss.slurm

See the tests folder for an advanced example of training a PyTorch model with Lightning's DDP.

API Documentation

`CLASS slurm_sweeps.Experiment`

def __init__(
    self,
    train: Callable,
    cfg: Dict,
    name: str = "MySweep",
    local_dir: Union[str, Path] = "./slurm_sweeps",
    backend: Optional[Backend] = None,
    asha: Optional[ASHA] = None,
    restore: bool = False,
    overwrite: bool = False,
)

Set up an HPO experiment.

Arguments:

train - A train function that takes as input the cfg dict.
cfg - A dict passed on to the train function. It must contain the search spaces via slurm_sweeps.Uniform, slurm_sweeps.Choice, etc.
name - The name of the experiment.
local_dir - Where to store and run the experiments. In this directory we will create the database slurm_sweeps.db and a folder with the experiment name.
backend - A backend to execute the trials. By default, we choose the SlurmBackend if Slurm is available, otherwise we choose the standard Backend that simply executes the trial in another process.
asha - An optional ASHA instance to cancel less promising trials.
restore - Restore an experiment with the same name?
overwrite - Overwrite an existing experiment with the same name?

`Experiment.run`

def run(
    self,
    n_trials: int = 1,
    max_concurrent_trials: Optional[int] = None,
    summary_interval_in_sec: float = 5.0,
    nr_of_rows_in_summary: int = 10,
    summarize_cfg_and_metrics: Union[bool, List[str]] = True,
) -> pd.DataFrame

Run the experiment.

Arguments:

n_trials - Number of trials to run. For grid searches this parameter is ignored.
max_concurrent_trials - The maximum number of trials running concurrently. By default, we will set this to the number of cpus available, or the number of total Slurm tasks divided by the number of trial Slurm tasks requested.
summary_interval_in_sec - Print a summary of the experiment every x seconds.
nr_of_rows_in_summary - How many rows of the summary table should we print?
summarize_cfg_and_metrics - Should we include the cfg and the metrics in the summary table? You can also pass in a list of strings to only select a few cfg and metric keys.

Returns:

A DataFrame of the database.

`CLASS slurm_sweeps.SlurmBackend`

def __init__(
    self,
    exclusive: bool = True,
    nodes: int = 1,
    ntasks: int = 1,
    args: str = ""
)

Execute the training runs on a Slurm cluster via srun.

Pass an instance of this class to your experiment.

Arguments:

exclusive - Add the --exclusive switch.
nodes - How many nodes do you request for your srun?
ntasks - How many tasks do you request for your srun?
args - Additional command line arguments for srun, formatted as a string.

Contact

David Carreto Fidalgo (david.carreto.fidalgo@mpcdf.mpg.de)

Project details

Release history Release notifications | RSS feed

This version

0.1.3

Nov 15, 2023

0.1.2

Nov 15, 2023

0.1.1

Nov 14, 2023

0.1.0

Nov 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurm_sweeps-0.1.3.tar.gz (19.7 kB view details)

Uploaded Nov 15, 2023 Source

Built Distribution

slurm_sweeps-0.1.3-py3-none-any.whl (18.6 kB view details)

Uploaded Nov 15, 2023 Python 3

File details

Details for the file slurm_sweeps-0.1.3.tar.gz.

File metadata

Download URL: slurm_sweeps-0.1.3.tar.gz
Upload date: Nov 15, 2023
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for slurm_sweeps-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`d9a3c9ffb80c303b9b81e59c7f429ddfe692d16e0b4ff425a7729f14e2f9cb8a`
MD5	`93c22110365335f856c4147920f49947`
BLAKE2b-256	`ed550660e3b0a98f42a0daf3b699803eb0b5fb641b64f3682195e113496d9883`

See more details on using hashes here.

File details

Details for the file slurm_sweeps-0.1.3-py3-none-any.whl.

File metadata

Download URL: slurm_sweeps-0.1.3-py3-none-any.whl
Upload date: Nov 15, 2023
Size: 18.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for slurm_sweeps-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e92a16dc2d74f43996874861e16b584d8c86586aa80276a16c3582b230b4965b`
MD5	`045205fef821d570f18a1f887f067eb9`
BLAKE2b-256	`121d2957515ad62c78e89d1facebc89d83f7a10e3bb9eb6aafb124339d48d612`

See more details on using hashes here.

slurm-sweeps 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

slurm sweeps

Installation

Dependencies

Usage

API Documentation

`CLASS slurm_sweeps.Experiment`

`Experiment.run`

`CLASS slurm_sweeps.SlurmBackend`

Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes