Manage large Slurm job arrays that exceed cluster submission limits
Project description
slurmgrid
Manage large Slurm job arrays that exceed your cluster's submission limit.
If you need to run 50,000 small jobs but your cluster caps MaxArraySize at
10,000 (or limits total queued jobs), slurmgrid handles the tedious cycle of
"submit a batch, wait, submit the next batch" automatically. It chunks your
parameter manifest, submits array jobs via sbatch, monitors completion via
sacct, retries failures, and persists state so you can resume if interrupted.
Installation
pip install slurmgrid
Or clone the repo and install in editable mode:
git clone https://github.com/jgaeb/slurmgrid.git
cd slurmgrid
pip install -e .
python -m slurmgrid --help
Quick start
- Create a manifest file (CSV or TSV) with one row per job:
alpha,beta,seed
0.1,1,42
0.1,2,42
0.5,1,42
0.5,2,42
...
- Run
slurmgrid submitwith your command template:
python -m slurmgrid submit \
--manifest params.csv \
--command "python train.py --alpha {alpha} --beta {beta} --seed {seed}" \
--partition gpu \
--time 01:00:00 \
--mem 4G \
--max-concurrent 5000
That's it. slurmgrid will:
- Shuffle and split the manifest into chunks (default: 1/3 of
MaxArraySize) - Submit each chunk as a fast array job via
sbatch, using Slurm's%throttleto limit concurrency to--max-concurrent - Poll
sacctevery 30 seconds to track completion - Submit the next chunk when the current one finishes
- Batch failed jobs into retry chunks (up to
--max-retries, default 3) - Save state to disk after every poll so you can resume if interrupted
Usage
Submit a new run
python -m slurmgrid submit \
--manifest params.csv \
--command "python train.py --alpha {alpha} --beta {beta}" \
--state-dir ./my_run \
--partition gpu \
--time 02:00:00 \
--mem 8G \
--cpus-per-task 4 \
--max-concurrent 5000 \
--max-retries 3 \
--poll-interval 30 \
--preamble "module load python/3.10 && conda activate myenv"
The --command template uses {column_name} placeholders that are resolved
from the manifest columns. Any column in the manifest can be referenced.
Use a config file
Instead of a long command line, you can store submit options in a YAML file:
# run.yaml
manifest: params.csv
command: python train.py --alpha {alpha} --beta {beta} --seed {seed}
state-dir: ./my_run
partition: gpu
time: 02:00:00
mem: 8G
max-concurrent: 5000
max-retries: 3
python -m slurmgrid submit --config run.yaml
CLI flags take precedence over config file values, so you can override individual options ad hoc:
python -m slurmgrid submit --config run.yaml --partition debug --time 00:10:00
Run the monitor as a Slurm job (recommended for HPC)
On clusters where login node processes can be killed, submit the monitor
itself as a low-resource batch job. Use --max-runtime slightly under
the wall time and --self-resubmit to chain automatically:
sbatch --partition=gpu --time=03:00:00 --mem=1G -c 1 \
--wrap="python -m slurmgrid submit \
--config run.yaml \
--max-runtime 10000 \
--self-resubmit"
When --max-runtime is reached, slurmgrid saves state and submits a new
slurmgrid resume job before exiting, so monitoring continues unattended
until the run is complete.
To find or kill a running monitor at any time:
cat ./my_run/monitor.lock # prints hostname:pid
ssh <hostname> kill <pid>
Resume an interrupted run
If you lose your SSH session or Ctrl-C out, running Slurm jobs continue independently. Resume monitoring with:
python -m slurmgrid resume --state-dir ./my_run
Retry permanently failed tasks
If a run finishes with permanently failed tasks (e.g., jobs that timed out), you can reset them and retry with different Slurm parameters:
python -m slurmgrid resume --state-dir ./my_run \
--reset-failures \
--time 04:00:00 \
--mem 16G
--reset-failures clears the permanently_failed flag on all failure records
and bumps max_retries so the monitor's retry machinery picks them up. Any
Slurm flags passed to resume override the frozen config for this session
only — the original config.json is not modified. Overrides are recorded
per-chunk in state.json for provenance.
Chain runs with --after-run
If stage 2 depends on stage 1, pass stage 1's state directory to stage 2's
submit (or resume) with --after-run. Stage 2's monitor will block until
stage 1 is done before submitting any jobs:
# Stage 1 runs in background (or as a Slurm job with --self-resubmit)
python -m slurmgrid submit --config stage1.yaml --state-dir ./stage1 &
# Stage 2 waits for stage 1 to finish before submitting
python -m slurmgrid submit --config stage2.yaml --state-dir ./stage2 \
--after-run ./stage1
Restart a run from scratch
To re-run from scratch with the same state directory, pass --restart. The
old state is backed up automatically before the new run begins:
python -m slurmgrid submit --config run.yaml --restart
The old directory is renamed to <state-dir>.bak.<YYYYMMDD_HHMMSS>. To
delete it instead of backing it up:
python -m slurmgrid submit --config run.yaml --restart --no-backup
Check status
python -m slurmgrid status --state-dir ./my_run
==================================================
Total jobs: 50000
Completed: 35420 (70.8%)
Active: 4580 (12 failing)
Pending: 10000
Failed (retrying): 0
Failed (final): 0
Chunks: 35/50 completed, 5 active, 10 pending
==================================================
Inspect failing jobs
While a run is in progress (or after it finishes), list all currently-failing tasks with their manifest parameters and log file paths:
python -m slurmgrid failures --state-dir ./my_run
============================================================
Row 42 exit=1 retries=1 permanent=False
alpha=0.5 beta=2 seed=42
OUT: ./my_run/logs/chunk_003/slurm-98765_8.out
ERR: ./my_run/logs/chunk_003/slurm-98765_8.err
--- last 5 lines of .err ---
Traceback (most recent call last):
...
Useful flags:
--permanently-failed-only: show only tasks that have exhausted all retries--tail N: show last N lines of each task's.errlog (default: 5)--paths-only: show log paths but suppress log content
Cancel all jobs
python -m slurmgrid cancel --state-dir ./my_run
Dry run
Generate all chunk files and sbatch scripts without actually submitting:
python -m slurmgrid submit --manifest params.csv --command "echo {x}" --dry-run
Inspect the generated scripts in ./sc_state/scripts/ to verify correctness.
How it works
-
Chunking: The manifest is split into sub-manifests. Each chunk gets its own sbatch script that uses
SLURM_ARRAY_TASK_IDto index into the sub-manifest and extract the parameters for that task. -
Shuffling: Manifest rows are shuffled before chunking (disable with
--no-shuffle) so each chunk gets a representative mix of the parameter space and chunks take roughly the same wall time. -
Batch submission: Each chunk is submitted as a single
sbatch --arraycall with a%throttlesuffix to limit concurrency, which is orders of magnitude faster than submitting jobs individually. -
Monitoring: The tool polls
sacctto track job status. Multiple chunks run concurrently: a new chunk is submitted whenever the number of remaining (incomplete) tasks across active chunks drops enough to fit another chunk within--max-concurrent. Use--serial-chunksto run one chunk at a time instead, which is useful when tasks compete for an external resource (e.g., API rate limits) beyond what Slurm's%throttlecontrols. -
Retries: When all regular chunks are done, failed tasks are batched into a single retry chunk and resubmitted, up to
--max-retriesper task. -
State persistence: All state is saved as JSON after every poll. Atomic writes (via temp file + rename) prevent corruption. You can resume at any time.
State directory layout
sc_state/
config.json # Frozen copy of the submission configuration
state.json # Chunk-level status and failure tracking
monitor.lock # hostname:pid of the running monitor (removed on clean exit)
slurmgrid.log # Tool's own log file
chunks/
chunk_000.chunk # Sub-manifests (internal format)
chunk_001.chunk
scripts/
chunk_000.sh # Generated sbatch scripts
chunk_001.sh
logs/
chunk_000/ # Slurm stdout/stderr per chunk
slurm-12345_0.out
slurm-12345_0.err
All options
| Flag | Default | Description |
|---|---|---|
--manifest |
(required) | CSV/TSV manifest file |
--command |
(required) | Command template with {column} placeholders |
--state-dir |
./sc_state |
Directory for state, chunks, scripts, logs |
--delimiter |
auto-detect | Manifest delimiter (, for .csv, \t for .tsv) |
--chunk-size |
auto-detect | Jobs per array chunk (default: MaxArraySize / 3) |
--max-concurrent |
10000 | Max simultaneously running tasks (Slurm %throttle) |
--max-retries |
3 | Max retries per failed job |
--poll-interval |
30 | Seconds between status checks |
--max-runtime |
unlimited | Max seconds to run before saving state and exiting |
--dry-run |
false | Generate scripts without submitting |
--no-shuffle |
false | Don't shuffle manifest rows before chunking |
--partition |
Slurm partition | |
--time |
Wall time limit (e.g., 01:00:00) |
|
--mem |
Memory per node (e.g., 4G) |
|
--mem-per-cpu |
Memory per CPU | |
--cpus-per-task |
1 | CPUs per task |
--gpus |
GPU specification | |
--gres |
Generic resource specification | |
--account |
Slurm account | |
--qos |
Quality of service | |
--constraint |
Node constraint | |
--exclude |
Nodes to exclude | |
--job-name-prefix |
sc |
Prefix for Slurm job names |
--preamble |
Shell commands before the main command | |
--preamble-file |
File containing preamble commands | |
--extra-sbatch |
Extra #SBATCH flags (repeatable) |
|
--after-run |
Wait for a previous run to finish before submitting (path to its state directory) | |
--restart |
false | Back up the existing state dir and start fresh |
--no-backup |
false | With --restart, delete the old state dir instead of backing it up |
--headroom |
auto | Reserve this many task slots for your other Slurm jobs; don't submit a new chunk if it would push your total active tasks above max-concurrent - headroom |
--self-resubmit |
false | On --max-runtime exit, automatically sbatch a new resume job |
--serial-chunks |
false | Submit one chunk at a time (wait for full completion before submitting the next) |
--config |
YAML config file; any option above can be set as a key | |
--reset-failures |
false | Reset permanently failed tasks for retry (resume only) |
Slurm flags (--time, --mem, --partition, etc.) can also be passed to
resume to override the frozen config for that session. These overrides are
transient and recorded per-chunk in state.json.
Requirements
- Python 3.8+
- Slurm with
sbatch,sacct,squeue,scancel,scontrolavailable - Slurm accounting enabled (
sacctmust work)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slurmgrid-0.2.0.tar.gz.
File metadata
- Download URL: slurmgrid-0.2.0.tar.gz
- Upload date:
- Size: 49.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
413fdb9ebe3b12913a782108649993fe665d52ac76addc0351cfe191456f42d4
|
|
| MD5 |
d5e46aca7575a7b4650bba31f8bf176f
|
|
| BLAKE2b-256 |
a664a9ff4ab55ca8e06d28979e286f7048c325cf6b3258f01f481434692a0ad3
|
Provenance
The following attestation bundles were made for slurmgrid-0.2.0.tar.gz:
Publisher:
publish.yml on jgaeb/slurmgrid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmgrid-0.2.0.tar.gz -
Subject digest:
413fdb9ebe3b12913a782108649993fe665d52ac76addc0351cfe191456f42d4 - Sigstore transparency entry: 1141600765
- Sigstore integration time:
-
Permalink:
jgaeb/slurmgrid@c1e2649177051474d247f2026f1c54b784bfa885 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/jgaeb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c1e2649177051474d247f2026f1c54b784bfa885 -
Trigger Event:
release
-
Statement type:
File details
Details for the file slurmgrid-0.2.0-py3-none-any.whl.
File metadata
- Download URL: slurmgrid-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42ec97f74d12b54959235c0c64c28066485106bf786afaea086913c0bb5c543e
|
|
| MD5 |
64fc63bc03e303287bd10729d2ea2259
|
|
| BLAKE2b-256 |
25214e840a0bef7e774a9fc7afbe8af53444572e0b23f95d9e605e613097f730
|
Provenance
The following attestation bundles were made for slurmgrid-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on jgaeb/slurmgrid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmgrid-0.2.0-py3-none-any.whl -
Subject digest:
42ec97f74d12b54959235c0c64c28066485106bf786afaea086913c0bb5c543e - Sigstore transparency entry: 1141601493
- Sigstore integration time:
-
Permalink:
jgaeb/slurmgrid@c1e2649177051474d247f2026f1c54b784bfa885 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/jgaeb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c1e2649177051474d247f2026f1c54b784bfa885 -
Trigger Event:
release
-
Statement type: