slurm-longrun

No project description provided

These details have not been verified by PyPI

Project description

Slurm Longrun

Slurm Longrun is a Python package that wraps Slurm’s sbatch command to automatically resubmit jobs that time out, allowing you to run workloads that exceed a single‐job walltime without manual intervention. It supports optional terminal detachment (so your monitor survives after you log out), configurable retry limits, and built-in logging via Loguru.

This tool was developed as a project for the Large-Scale AI Engineering course on the CSCS Alps supercomputer.

Installation

Prerequisites

Python 3.10+
Slurm workload manager (sbatch, sacct, scontrol in your PATH)

Install from PyPI:

pip install slurm-longrun

Quickstart

Instead of calling sbatch directly, use the sbatch_longrun wrapper:

sbatch_longrun [OPTIONS] [SBATCH_ARGS…]

Example: your job runs longer than 30 minutes, so you give it a 30 min walltime and let Longrun resubmit on timeout:

sbatch_longrun --max-restarts 999 --time=00:30:00 --job-name=my_job my_script.sbatch
#sbatch_longrun <thiswrapperargs> <=========sbatch args===========> <===script.sh==>

This will:

Submit my_script.sbatch with a 30 min limit.
When it hits the 30 min walltime (TIMEOUT), automatically resubmit (opens log file in append mode).
Resubmit up to 999 times or until the job completes successfully.

Command-Line Interface

Usage

sbatch_longrun [OPTIONS] [SBATCH_ARGS…]

Options

--use-verbosity [DEFAULT|VERBOSE|SILENT]
Logging level (DEFAULT = INFO, VERBOSE = DEBUG, SILENT = WARNING).
--detached / --no-detached
Run the monitor loop in background (detached from your terminal).
--max-restarts INTEGER
Maximum number of resubmissions on JobStatus.should_resubmit. Default: 999.
-h, --help
Show help and exit.

All other flags are forwarded to sbatch, they must be provided after the wrapper flags.

Examples

Basic, retry up to 3 times, verbose logging:
```
sbatch_longrun --use-verbosity VERBOSE --max-restarts 3 \
  --time=02:00:00 --job-name=deep_train train.sbatch
```
--use-verbosity VERBOSE --max-restarts 3 are passed to the monitor process. --time=02:00:00 --job-name=deep_train are passed to sbatch.

Detach the monitor so it survives logout:

sbatch_longrun --detached  \
  --time=01:00:00 --job-name=data_proc data_pipeline.sbatch
# → prints “Monitor running in background PID: ”

Example : Assignment 2

Assignment 2 is training an LLM over 1000 epochs. To showcase the resubmission feature, I set the walltime to 3 minutes. Further, I sent a signal to the job 20 seconds before the walltime limit. I use the signal to save the state of the training run.

Submission

sbatch_longrun --signal=SIGTERM@20  example/assignment2_example/run_job.sbatch

using the following file:

# example/assignment2_example/run_job.sbatch

#!/bin/bash
#SBATCH --account=a-large-sc
#SBATCH --job-name=sbatch_longrun_example_assignment2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
#SBATCH --time=00:03:00
#SBATCH --output=run_example_assignment2.log
#SBATCH --error=run_example_assignment2.err
#SBATCH --partition=debug
#SBATCH --environment=/path/to/.../ngc_pt_jan.toml
#SBATCH --export=ALL

set -eo pipefail
echo "START TIME: $(date)"

srun bash -c "python $SLURM_SUBMIT_DIR/example/assignment2_example/assignment_2/train.py \
    --learning-rate 5e-5 \
    --training-steps 1000 \
    --batch-size 1 \
    --lr-warmup-steps 100"

echo "END TIME: $(date)"

Checkpointing in train.py:

############################## LONGRUN : SAVE & RECOVER STATE ##############################
def store_state(
    filepath: str, epoch: int, model: torch.nn.Module, strip_dp: bool = True
) -> None:
    """Save the last completed epoch and model weights to disk."""
    state_dict = (
        model.module.state_dict()
        if strip_dp and hasattr(model, "module")
        else model.state_dict()
    )
    torch.save({"epoch": epoch, "model_state_dict": state_dict}, filepath)


def recover_state(
    filepath: str, device: str = "cpu"
) -> Tuple[int, Dict[str, torch.Tensor]]:
    """Load and return (last_epoch, model_state_dict) from a checkpoint."""
    if not os.path.exists(filepath):
        return 0, None
    ckpt = torch.load(filepath, map_location=device)
    return ckpt["epoch"], ckpt["model_state_dict"]

STATE_PATH = f"{os.environ.get("SLURM_SUBMIT_DIR", ".")}/state-{os.environ.get("SLURM_LONGRUN_INITIAL_JOB_ID", os.environ.get("SLURM_JOB_ID", ""))}.json"
############################## END LONGRUN : SAVE & RECOVER STATE ##########################

def train(args):
    ...
    ############################## END LONGRUN : SAVE & RECOVER STATE ##########################
    with set_default_dtype(model_dtype):
        model = Transformer(model_config)
        ############################## LONGRUN : RECOVER STATE ##############################
        train_step, model_state_dict = recover_state(STATE_PATH, device)
        if model_state_dict is not None:
            model.load_state_dict(model_state_dict, strict=False)
            logger.info(f"Recovered model state from {STATE_PATH}")
        else:
            train_step = 0
            logger.info(f"Starting from scratch, no state found in {STATE_PATH}")
        del model_state_dict
        ############################## END LONGRUN : RECOVER STATE ##########################
        model = model.to(device)
    ############################ LONGRUN : SAVE STATE ON SIGTERM ########################
    def sigterm_handler(signum, frame):
        logger.info(f"[Received SIGTERM] : Saving state to {STATE_PATH}")
        store_state(STATE_PATH, train_step, model)
        logger.info(f"[Received SIGTERM] : Finished saving state.")

    signal.signal(
        signal.SIGTERM,
        sigterm_handler,
    )
    logger.info(f"Registered SIGTERM handler to save state to {STATE_PATH} on termination.")
    ############################ END LONGRUN : SAVE STATE ON SIGTERM ######################
    ...

Logs from monitoring the job (as we didn't run it in detached mode):

$sbatch_longrun --signal=SIGTERM@20  example/assignment2_example/run_job.sbatch
2025-05-21 11:31:10 | SUCCESS  : Job submitted with ID: 454600
2025-05-21 11:31:10 | INFO     : Monitoring job 454600 (submission 1/999)
2025-05-21 11:35:51 | INFO     : Job 454600 reached final state: TIMEOUT
2025-05-21 11:35:52 | SUCCESS  : Resubmitted based on status=TIMEOUT job with ID: 454609
2025-05-21 11:35:52 | INFO     : Monitoring job 454609 (submission 2/999)
2025-05-21 11:40:02 | INFO     : Job 454609 reached final state: TIMEOUT
...
2025-05-21 11:52:04 | SUCCESS  : Resubmitted based on status=TIMEOUT job with ID: 454644
2025-05-21 11:52:04 | INFO     : Monitoring job 454644 (submission 6/999)
2025-05-21 11:55:04 | INFO     : Job 454644 reached final state: COMPLETED
2025-05-21 11:55:04 | SUCCESS  : Job completed successfully.

Logs outputted by the job script:

2025-05-21 11:31:55,595 - root - INFO - Setting up DataLoaders...
2025-05-21 11:31:57,838 - root - INFO - Setting up Model...
2025-05-21 11:32:31,846 - root - INFO - Starting from scratch, no state found in /iopsstor/scratch/cscs/athillen/example_slurmlongrun/state-454600.json
2025-05-21 11:32:33,519 - root - INFO - Registered SIGTERM handler to save state to /iopsstor/scratch/cscs/athillen/example_slurmlongrun/state-454600.json on termination.
2025-05-21 11:32:33,521 - root - INFO - Starting training!
2025-05-21 11:32:34,982 - root - INFO - Step: 1 | Loss: 12.03 | Tokens per second: 2884.67 | Training tokens per second (%): 19.38 | MFU (%): 15.03 | TFLOPs: 148.68
...
2025-05-21 11:33:40,533 - root - INFO - Step: 120 | Loss: 7.89 | Tokens per second: 7542.89 | Training tokens per second (%): 26.43 | MFU (%): 39.31 | TFLOPs: 388.77
slurmstepd: error: *** STEP 454600.0 ON nid006459 CANCELLED AT 2025-05-21T11:33:42 ***
2025-05-21 11:33:43,079 - root - INFO - [Received SIGTERM] : Saving state to /iopsstor/scratch/cscs/athillen/example_slurmlongrun/state-454600.json
2025-05-21 11:33:54,977 - root - INFO - [Received SIGTERM] : Finished saving state.
2025-05-21 11:33:55,081 - root - INFO - Step: 125 | Loss: 8.12 | Tokens per second: 1411.76 | Training tokens per second (%): 24.73 | MFU (%): 7.36 | TFLOPs: 72.76
...
slurmstepd: error: *** JOB 454600 ON nid006459 CANCELLED AT 2025-05-21T11:34:48 DUE TO TIME LIMIT ***
srun: forcing job termination

2025-05-21 11:36:33,182 - root - INFO - Setting up DataLoaders...
2025-05-21 11:36:35,357 - root - INFO - Setting up Model...
2025-05-21 11:37:16,360 - root - INFO - Recovered model state from /iopsstor/scratch/cscs/athillen/example_slurmlongrun/state-454600.json
2025-05-21 11:37:16,626 - root - INFO - Registered SIGTERM handler to save state to /iopsstor/scratch/cscs/athillen/example_slurmlongrun/state-454600.json on termination.
2025-05-21 11:37:16,628 - root - INFO - Starting training!
2025-05-21 11:37:18,498 - root - INFO - Step: 225 | Loss: 7.56 | Tokens per second: 2238.88 | Training tokens per second (%): 19.38 | MFU (%): 11.67 | TFLOPs: 115.39
2025-05-21 11:37:21,168 - root - INFO - Step: 230 | Loss: 7.29 | Tokens per second: 7789.90 | Training tokens per second (%): 13.66 | MFU (%): 40.60 | TFLOPs: 401.50
2025-05-21 11:37:23,877 - root - INFO - Step: 235 | Loss: 7.63 | Tokens per second: 7677.60 | Training tokens per second (%): 22.45 | MFU (%): 40.01 | TFLOPs: 395.71
...
slurmstepd: error: *** JOB 454609 ON nid006455 CANCELLED AT 2025-05-21T11:39:18 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: forcing job termination
srun: got SIGCONT
2025-05-21 11:39:18,023 - root - INFO - [Received SIGTERM] : Saving state to /iopsstor/scratch/cscs/athillen/example_slurmlongrun/state-454600.json
....
....
2025-05-21 11:53:53,561 - root - INFO - Step: 995 | Loss: 4.38 | Tokens per second: 7647.01 | Training tokens per second (%): 10.83 | MFU (%): 39.85 | TFLOPs: 394.14
2025-05-21 11:53:56,307 - root - INFO - Step: 1000 | Loss: 4.74 | Tokens per second: 7571.65 | Training tokens per second (%): 16.73 | MFU (%): 39.46 | TFLOPs: 390.25
2025-05-21 11:53:56,307 - root - INFO - Training completed

How It Works

Submit
Calls sbatch with your arguments; parses the returned job ID.
Monitor
- Polls sacct + scontrol until the job reaches a terminal state.
- If JobStatus.should_resubmit and you haven’t exceeded --max-restarts, it immediately resubmits with --open-mode=append to preserve logs.
Detach (optional)
If --detached is passed, the process forks twice, detaches from the terminal (setsid), redirects stdio to /dev/null, and continues monitoring in background.

JobStatus.should_resubmit holds for jobs that exit due to TIMEOUT, DEADLINE, PREEMPTED, NODE_FAIL, or REVOKED.

Environment Variables

SLURM_LONGRUN_INITIAL_JOB_ID

Set internally to the first submission’s job ID.
You can read it in your job script (e.g., to name checkpoints).

Dependencies

click
loguru

These are installed automatically via pip.

Summary of CLI Options

Option	Default	Description
`--use-verbosity`	DEFAULT	Logging verbosity: DEFAULT (INFO), VERBOSE, SILENT (WARNING)
`--detached / --no-detached`	`--no-detached`	Detach monitoring loop into background process
`--max-restarts`	999	Max auto-resubmissions on TIMEOUT
`[SBATCH_ARGS…]`	/	All subsequent flags passed directly to `sbatch`

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.4

May 21, 2025

0.1.3

May 14, 2025

0.1.2

Apr 30, 2025

0.1.1

Apr 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurm_longrun-0.1.4.tar.gz (9.0 kB view details)

Uploaded May 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slurm_longrun-0.1.4-py3-none-any.whl (11.2 kB view details)

Uploaded May 21, 2025 Python 3

File details

Details for the file slurm_longrun-0.1.4.tar.gz.

File metadata

Download URL: slurm_longrun-0.1.4.tar.gz
Upload date: May 21, 2025
Size: 9.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slurm_longrun-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`074b0b6c3ab6bb4a6345120072d48e4afcc0dfcaf3734acbb87931e3dc8b8cee`
MD5	`967731bbc776190e4cc9f66fc234e5bf`
BLAKE2b-256	`2e428ab046b4f3f3fda0d50a9a0019e1eafd719fedc8e1e7558f2617a102a6fc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_longrun-0.1.4.tar.gz:

Publisher: pypi-publish.yml on alexthillen/slurm_longrun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurm_longrun-0.1.4.tar.gz
- Subject digest: 074b0b6c3ab6bb4a6345120072d48e4afcc0dfcaf3734acbb87931e3dc8b8cee
- Sigstore transparency entry: 216365764
- Sigstore integration time: May 21, 2025
Source repository:
- Permalink: alexthillen/slurm_longrun@2cb9f98d1a819d70e46f0e2e40ecf8e6a629ceac
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/alexthillen
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@2cb9f98d1a819d70e46f0e2e40ecf8e6a629ceac
- Trigger Event: release

File details

Details for the file slurm_longrun-0.1.4-py3-none-any.whl.

File metadata

Download URL: slurm_longrun-0.1.4-py3-none-any.whl
Upload date: May 21, 2025
Size: 11.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slurm_longrun-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`695854217ed8e97794b3a4c5e2e16893d9ca4c8a0b44e4321403fc00df885367`
MD5	`4e2261372e05f9d1b6d5512727e1a9a7`
BLAKE2b-256	`fdaa5a6bc675bcc9e02f0afdda8d31b5c5d4172e08c8ced81f1be514162982bd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_longrun-0.1.4-py3-none-any.whl:

Publisher: pypi-publish.yml on alexthillen/slurm_longrun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurm_longrun-0.1.4-py3-none-any.whl
- Subject digest: 695854217ed8e97794b3a4c5e2e16893d9ca4c8a0b44e4321403fc00df885367
- Sigstore transparency entry: 216365768
- Sigstore integration time: May 21, 2025
Source repository:
- Permalink: alexthillen/slurm_longrun@2cb9f98d1a819d70e46f0e2e40ecf8e6a629ceac
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/alexthillen
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@2cb9f98d1a819d70e46f0e2e40ecf8e6a629ceac
- Trigger Event: release

slurm-longrun 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Slurm Longrun

Installation

Quickstart

Command-Line Interface

Examples

Example : Assignment 2

How It Works

Environment Variables

Dependencies

Summary of CLI Options

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance