Skip to main content

Manage large Slurm job arrays that exceed cluster submission limits

Project description

slurmgrid

CI codecov PyPI

Manage large Slurm job arrays that exceed your cluster's submission limit.

If you need to run 50,000 small jobs but your cluster caps MaxArraySize at 10,000 (or limits total queued jobs), slurmgrid handles the tedious cycle of "submit a batch, wait, submit the next batch" automatically. It chunks your parameter manifest, submits array jobs via sbatch, monitors completion via sacct, retries failures, and persists state so you can resume if interrupted.

Installation

pip install slurmgrid

Or clone the repo and install in editable mode:

git clone https://github.com/jgaeb/slurmgrid.git
cd slurmgrid
pip install -e .
python -m slurmgrid --help

Quick start

  1. Create a manifest file (CSV or TSV) with one row per job:
alpha,beta,seed
0.1,1,42
0.1,2,42
0.5,1,42
0.5,2,42
...
  1. Run slurmgrid submit with your command template:
python -m slurmgrid submit \
  --manifest params.csv \
  --command "python train.py --alpha {alpha} --beta {beta} --seed {seed}" \
  --partition gpu \
  --time 01:00:00 \
  --mem 4G \
  --max-concurrent 5000

That's it. slurmgrid will:

  • Shuffle and split the manifest into chunks (default: 1/3 of MaxArraySize)
  • Submit each chunk as a fast array job via sbatch, using Slurm's %throttle to limit concurrency to --max-concurrent
  • Poll sacct every 30 seconds to track completion
  • Submit the next chunk when the current one finishes
  • Batch failed jobs into retry chunks (up to --max-retries, default 3)
  • Save state to disk after every poll so you can resume if interrupted

Usage

Submit a new run

python -m slurmgrid submit \
  --manifest params.csv \
  --command "python train.py --alpha {alpha} --beta {beta}" \
  --state-dir ./my_run \
  --partition gpu \
  --time 02:00:00 \
  --mem 8G \
  --cpus-per-task 4 \
  --max-concurrent 5000 \
  --max-retries 3 \
  --poll-interval 30 \
  --preamble "module load python/3.10 && conda activate myenv"

The --command template uses {column_name} placeholders that are resolved from the manifest columns. Any column in the manifest can be referenced.

Use a config file

Instead of a long command line, you can store submit options in a YAML file:

# run.yaml
manifest: params.csv
command: python train.py --alpha {alpha} --beta {beta} --seed {seed}
state-dir: ./my_run
partition: gpu
time: 02:00:00
mem: 8G
max-concurrent: 5000
max-retries: 3
python -m slurmgrid submit --config run.yaml

CLI flags take precedence over config file values, so you can override individual options ad hoc:

python -m slurmgrid submit --config run.yaml --partition debug --time 00:10:00

Run the monitor as a Slurm job (recommended for HPC)

On clusters where login node processes can be killed, submit the monitor itself as a low-resource batch job. Use --max-runtime slightly under the wall time and --self-resubmit to chain automatically:

sbatch --partition=gpu --time=03:00:00 --mem=1G -c 1 \
  --wrap="python -m slurmgrid submit \
    --config run.yaml \
    --max-runtime 10000 \
    --self-resubmit"

When --max-runtime is reached, slurmgrid saves state and submits a new slurmgrid resume job before exiting, so monitoring continues unattended until the run is complete.

To find or kill a running monitor at any time:

cat ./my_run/monitor.lock   # prints hostname:pid
ssh <hostname> kill <pid>

Resume an interrupted run

If you lose your SSH session or Ctrl-C out, running Slurm jobs continue independently. Resume monitoring with:

python -m slurmgrid resume --state-dir ./my_run

Retry permanently failed tasks

If a run finishes with permanently failed tasks (e.g., jobs that timed out), you can reset them and retry with different Slurm parameters:

python -m slurmgrid resume --state-dir ./my_run \
  --reset-failures \
  --time 04:00:00 \
  --mem 16G

--reset-failures clears the permanently_failed flag on all failure records and bumps max_retries so the monitor's retry machinery picks them up. Any Slurm flags passed to resume override the frozen config for this session only — the original config.json is not modified. Overrides are recorded per-chunk in state.json for provenance.

Chain runs with --after-run

If stage 2 depends on stage 1, pass stage 1's state directory to stage 2's submit (or resume) with --after-run. Stage 2's monitor will block until stage 1 is done before submitting any jobs:

# Stage 1 runs in background (or as a Slurm job with --self-resubmit)
python -m slurmgrid submit --config stage1.yaml --state-dir ./stage1 &

# Stage 2 waits for stage 1 to finish before submitting
python -m slurmgrid submit --config stage2.yaml --state-dir ./stage2 \
  --after-run ./stage1

Restart a run from scratch

To re-run from scratch with the same state directory, pass --restart. The old state is backed up automatically before the new run begins:

python -m slurmgrid submit --config run.yaml --restart

The old directory is renamed to <state-dir>.bak.<YYYYMMDD_HHMMSS>. To delete it instead of backing it up:

python -m slurmgrid submit --config run.yaml --restart --no-backup

Check status

python -m slurmgrid status --state-dir ./my_run
==================================================
  Total jobs:            50000
  Completed:             35420  (70.8%)
  Active:                 4580  (12 failing)
  Pending:               10000
  Failed (retrying):         0
  Failed (final):            0
  Chunks: 35/50 completed, 5 active, 10 pending
==================================================

Inspect failing jobs

While a run is in progress (or after it finishes), list all currently-failing tasks with their manifest parameters and log file paths:

python -m slurmgrid failures --state-dir ./my_run
============================================================
Row 42  exit=1  retries=1  permanent=False
  alpha=0.5  beta=2  seed=42
  OUT: ./my_run/logs/chunk_003/slurm-98765_8.out
  ERR: ./my_run/logs/chunk_003/slurm-98765_8.err
  --- last 5 lines of .err ---
  Traceback (most recent call last):
  ...

Useful flags:

  • --permanently-failed-only: show only tasks that have exhausted all retries
  • --tail N: show last N lines of each task's .err log (default: 5)
  • --paths-only: show log paths but suppress log content

Cancel all jobs

python -m slurmgrid cancel --state-dir ./my_run

Dry run

Generate all chunk files and sbatch scripts without actually submitting:

python -m slurmgrid submit --manifest params.csv --command "echo {x}" --dry-run

Inspect the generated scripts in ./sc_state/scripts/ to verify correctness.

How it works

  1. Chunking: The manifest is split into sub-manifests. Each chunk gets its own sbatch script that uses SLURM_ARRAY_TASK_ID to index into the sub-manifest and extract the parameters for that task.

  2. Shuffling: Manifest rows are shuffled before chunking (disable with --no-shuffle) so each chunk gets a representative mix of the parameter space and chunks take roughly the same wall time.

  3. Batch submission: Each chunk is submitted as a single sbatch --array call with a %throttle suffix to limit concurrency, which is orders of magnitude faster than submitting jobs individually.

  4. Monitoring: The tool polls sacct to track job status. Multiple chunks run concurrently: a new chunk is submitted whenever the number of remaining (incomplete) tasks across active chunks drops enough to fit another chunk within --max-concurrent. Use --serial-chunks to run one chunk at a time instead, which is useful when tasks compete for an external resource (e.g., API rate limits) beyond what Slurm's %throttle controls.

  5. Retries: When all regular chunks are done, failed tasks are batched into a single retry chunk and resubmitted, up to --max-retries per task.

  6. State persistence: All state is saved as JSON after every poll. Atomic writes (via temp file + rename) prevent corruption. You can resume at any time.

State directory layout

sc_state/
  config.json          # Frozen copy of the submission configuration
  state.json           # Chunk-level status and failure tracking
  monitor.lock         # hostname:pid of the running monitor (removed on clean exit)
  slurmgrid.log        # Tool's own log file
  chunks/
    chunk_000.chunk    # Sub-manifests (internal format)
    chunk_001.chunk
  scripts/
    chunk_000.sh       # Generated sbatch scripts
    chunk_001.sh
  logs/
    chunk_000/         # Slurm stdout/stderr per chunk
      slurm-12345_0.out
      slurm-12345_0.err

All options

Flag Default Description
--manifest (required) CSV/TSV manifest file
--command (required) Command template with {column} placeholders
--state-dir ./sc_state Directory for state, chunks, scripts, logs
--delimiter auto-detect Manifest delimiter (, for .csv, \t for .tsv)
--chunk-size auto-detect Jobs per array chunk (default: MaxArraySize / 3)
--max-concurrent 10000 Max simultaneously running tasks (Slurm %throttle)
--max-retries 3 Max retries per failed job
--poll-interval 30 Seconds between status checks
--max-runtime unlimited Max seconds to run before saving state and exiting
--dry-run false Generate scripts without submitting
--no-shuffle false Don't shuffle manifest rows before chunking
--partition Slurm partition
--time Wall time limit (e.g., 01:00:00)
--mem Memory per node (e.g., 4G)
--mem-per-cpu Memory per CPU
--cpus-per-task 1 CPUs per task
--gpus GPU specification
--gres Generic resource specification
--account Slurm account
--qos Quality of service
--constraint Node constraint
--exclude Nodes to exclude
--job-name-prefix sc Prefix for Slurm job names
--preamble Shell commands before the main command
--preamble-file File containing preamble commands
--extra-sbatch Extra #SBATCH flags (repeatable)
--after-run Wait for a previous run to finish before submitting (path to its state directory)
--restart false Back up the existing state dir and start fresh
--no-backup false With --restart, delete the old state dir instead of backing it up
--headroom auto Reserve this many task slots for your other Slurm jobs; don't submit a new chunk if it would push your total active tasks above max-concurrent - headroom
--self-resubmit false On --max-runtime exit, automatically sbatch a new resume job
--serial-chunks false Submit one chunk at a time (wait for full completion before submitting the next)
--config YAML config file; any option above can be set as a key
--reset-failures false Reset permanently failed tasks for retry (resume only)

Slurm flags (--time, --mem, --partition, etc.) can also be passed to resume to override the frozen config for that session. These overrides are transient and recorded per-chunk in state.json.

Requirements

  • Python 3.8+
  • Slurm with sbatch, sacct, squeue, scancel, scontrol available
  • Slurm accounting enabled (sacct must work)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmgrid-0.2.0.tar.gz (49.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurmgrid-0.2.0-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file slurmgrid-0.2.0.tar.gz.

File metadata

  • Download URL: slurmgrid-0.2.0.tar.gz
  • Upload date:
  • Size: 49.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmgrid-0.2.0.tar.gz
Algorithm Hash digest
SHA256 413fdb9ebe3b12913a782108649993fe665d52ac76addc0351cfe191456f42d4
MD5 d5e46aca7575a7b4650bba31f8bf176f
BLAKE2b-256 a664a9ff4ab55ca8e06d28979e286f7048c325cf6b3258f01f481434692a0ad3

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmgrid-0.2.0.tar.gz:

Publisher: publish.yml on jgaeb/slurmgrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurmgrid-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: slurmgrid-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmgrid-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 42ec97f74d12b54959235c0c64c28066485106bf786afaea086913c0bb5c543e
MD5 64fc63bc03e303287bd10729d2ea2259
BLAKE2b-256 25214e840a0bef7e774a9fc7afbe8af53444572e0b23f95d9e605e613097f730

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmgrid-0.2.0-py3-none-any.whl:

Publisher: publish.yml on jgaeb/slurmgrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page