Skip to main content

No project description provided

Project description

Slurm Longrun

Slurm Longrun is a Python package that wraps Slurm’s sbatch command to automatically resubmit jobs that time out, allowing you to run workloads that exceed a single‐job walltime without manual intervention. It supports optional terminal detachment (so your monitor survives after you log out), configurable retry limits, and built-in logging via Loguru.


Installation

Prerequisites

  • Python 3.10+
  • Slurm workload manager (sbatch, sacct, scontrol in your PATH)

Install from PyPI:

pip install slurm-longrun

Quickstart

Instead of calling sbatch directly, use the sbatch_longrun wrapper:

sbatch_longrun [OPTIONS] [SBATCH_ARGS…]

Example: your job runs longer than 30 minutes, so you give it a 30 min walltime and let Longrun resubmit on timeout:

sbatch_longrun --max-restarts 999 --time=00:30:00 --job-name=my_job my_script.sbatch
#sbatch_longrun <thiswrapperargs> <=========sbatch args===========> <===script.sh==>

This will:

  1. Submit my_script.sbatch with a 30 min limit.
  2. When it hits the 30 min walltime (TIMEOUT), automatically resubmit (opens log file in append mode).
  3. Resubmit up to 999 times or until the job completes successfully.

Command-Line Interface

Usage

sbatch_longrun [OPTIONS] [SBATCH_ARGS…]

Options

  • --use-verbosity [DEFAULT|VERBOSE|SILENT]
     Logging level (DEFAULT = INFO, VERBOSE = DEBUG, SILENT = WARNING).
  • --detached / --no-detached
     Run the monitor loop in background (detached from your terminal).
  • --max-restarts INTEGER
     Maximum number of resubmissions on TIMEOUT. Default: 99.
  • -h, --help
     Show help and exit.

All other flags are forwarded to sbatch, they must be provided after the wrapper flags.

Examples

  1. Basic, retry up to 3 times, verbose logging:

    sbatch_longrun --use-verbosity VERBOSE --max-restarts 3 \
      --time=02:00:00 --job-name=deep_train train.sbatch
    

    --use-verbosity VERBOSE --max-restarts 3 are passed to the monitor process. --time=02:00:00 --job-name=deep_train are passed to sbatch.

  2. Detach the monitor so it survives logout:

    sbatch_longrun --detached  \
      --time=01:00:00 --job-name=data_proc data_pipeline.sbatch
    # → prints “Monitor running in background PID: ”
    

How It Works

  1. Submit
    Calls sbatch with your arguments; parses the returned job ID.
  2. Monitor
    • Polls sacct + scontrol until the job reaches a terminal state.
    • If TIMEOUT and you haven’t exceeded --max-restarts, it immediately resubmits with --open-mode=append to preserve logs.
  3. Detach (optional)
    If --detached is passed, the process forks twice, detaches from the terminal (setsid), redirects stdio to /dev/null, and continues monitoring in background.

Environment Variables

SLURM_LONGRUN_INITIAL_JOB_ID

  • Set internally to the first submission’s job ID.
  • You can read it in your job script (e.g., to name checkpoints).

Dependencies

  • click
  • loguru

These are installed automatically via pip.


Summary of CLI Options

Option Default Description
--use-verbosity DEFAULT Logging verbosity: DEFAULT (INFO), VERBOSE, SILENT (WARNING)
--detached / --no-detached --no-detached Detach monitoring loop into background process
--max-restarts 99 Max auto-resubmissions on TIMEOUT
[SBATCH_ARGS…] / All subsequent flags passed directly to sbatch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurm_longrun-0.1.3.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurm_longrun-0.1.3-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file slurm_longrun-0.1.3.tar.gz.

File metadata

  • Download URL: slurm_longrun-0.1.3.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slurm_longrun-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7f42f7ed3f571c0ab077131872e393c0f3890b0ca322dc2992e6bb626c392f23
MD5 c887dd97800e067e6b899f46b2112828
BLAKE2b-256 a3dbf889c8fce45efe169541facae4c5e8beef5f8df77a9fd4e0b8f1962886a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_longrun-0.1.3.tar.gz:

Publisher: pypi-publish.yml on alexthillen/slurm_longrun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurm_longrun-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: slurm_longrun-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slurm_longrun-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 07b04c8c0733708589246459c8c949e7c129a2b9954b1545d0f480c51416071f
MD5 9606e606041881d29d3cffa9c5bc0529
BLAKE2b-256 dc29802369b721e962d2fdd4a27fd0bda9ab498f24245eaf7e74b98966386b17

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_longrun-0.1.3-py3-none-any.whl:

Publisher: pypi-publish.yml on alexthillen/slurm_longrun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page