Manage large Slurm job arrays that exceed cluster submission limits

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jgaeb

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
Operating System
- POSIX
Programming Language
- Python :: 3
Topic
- System :: Clustering
- System :: Distributed Computing

Project description

slurmgrid

Manage large Slurm job arrays that exceed your cluster's submission limit.

If you need to run 50,000 small jobs but your cluster caps MaxArraySize at 10,000 (or limits total queued jobs), slurmgrid handles the tedious cycle of "submit a batch, wait, submit the next batch" automatically. It chunks your parameter manifest, submits array jobs via sbatch, monitors completion via sacct, retries failures, and persists state so you can resume if interrupted.

Installation

pip install slurmgrid

Or clone the repo and install in editable mode:

git clone https://github.com/jgaeb/slurmgrid.git
cd slurmgrid
pip install -e .
python -m slurmgrid --help

Quick start

Create a manifest file (CSV or TSV) with one row per job:

alpha,beta,seed
0.1,1,42
0.1,2,42
0.5,1,42
0.5,2,42
...

Run slurmgrid submit with your command template:

python -m slurmgrid submit \
  --manifest params.csv \
  --command "python train.py --alpha {alpha} --beta {beta} --seed {seed}" \
  --partition gpu \
  --time 01:00:00 \
  --mem 4G \
  --max-concurrent 5000

That's it. slurmgrid will:

Shuffle and split the manifest into chunks (default: 1/3 of MaxArraySize)
Submit each chunk as a fast array job via sbatch, using Slurm's %throttle to limit concurrency to --max-concurrent
Poll sacct every 30 seconds to track completion
Submit the next chunk when the current one finishes
Batch failed jobs into retry chunks (up to --max-retries, default 3)
Save state to disk after every poll so you can resume if interrupted

Usage

Submit a new run

python -m slurmgrid submit \
  --manifest params.csv \
  --command "python train.py --alpha {alpha} --beta {beta}" \
  --state-dir ./my_run \
  --partition gpu \
  --time 02:00:00 \
  --mem 8G \
  --cpus-per-task 4 \
  --max-concurrent 5000 \
  --max-retries 3 \
  --poll-interval 30 \
  --preamble "module load python/3.10 && conda activate myenv"

The --command template uses {column_name} placeholders that are resolved from the manifest columns. Any column in the manifest can be referenced.

Use a config file

Instead of a long command line, you can store submit options in a YAML file:

# run.yaml
manifest: params.csv
command: python train.py --alpha {alpha} --beta {beta} --seed {seed}
state-dir: ./my_run
partition: gpu
time: 02:00:00
mem: 8G
max-concurrent: 5000
max-retries: 3

python -m slurmgrid submit --config run.yaml

CLI flags take precedence over config file values, so you can override individual options ad hoc:

python -m slurmgrid submit --config run.yaml --partition debug --time 00:10:00

Run the monitor as a Slurm job (recommended for HPC)

On clusters where login node processes can be killed, submit the monitor itself as a low-resource batch job. Use --max-runtime slightly under the wall time and --self-resubmit to chain automatically:

sbatch --partition=gpu --time=03:00:00 --mem=1G -c 1 \
  --wrap="python -m slurmgrid submit \
    --config run.yaml \
    --max-runtime 10000 \
    --self-resubmit"

When --max-runtime is reached, slurmgrid saves state and submits a new slurmgrid resume job before exiting, so monitoring continues unattended until the run is complete.

To find or kill a running monitor at any time:

cat ./my_run/monitor.lock   # prints hostname:pid
ssh <hostname> kill <pid>

Resume an interrupted run

If you lose your SSH session or Ctrl-C out, running Slurm jobs continue independently. Resume monitoring with:

python -m slurmgrid resume --state-dir ./my_run

Retry permanently failed tasks

If a run finishes with permanently failed tasks (e.g., jobs that timed out), you can reset them and retry with different Slurm parameters:

python -m slurmgrid resume --state-dir ./my_run \
  --reset-failures \
  --time 04:00:00 \
  --mem 16G

--reset-failures clears the permanently_failed flag on all failure records and bumps max_retries so the monitor's retry machinery picks them up. Any Slurm flags passed to resume override the frozen config for this session only — the original config.json is not modified. Overrides are recorded per-chunk in state.json for provenance.

Chain runs with --after-run

If stage 2 depends on stage 1, pass stage 1's state directory to stage 2's submit (or resume) with --after-run. Stage 2's monitor will block until stage 1 is done before submitting any jobs:

# Stage 1 runs in background (or as a Slurm job with --self-resubmit)
python -m slurmgrid submit --config stage1.yaml --state-dir ./stage1 &

# Stage 2 waits for stage 1 to finish before submitting
python -m slurmgrid submit --config stage2.yaml --state-dir ./stage2 \
  --after-run ./stage1

Restart a run from scratch

To re-run from scratch with the same state directory, pass --restart. The old state is backed up automatically before the new run begins:

python -m slurmgrid submit --config run.yaml --restart

The old directory is renamed to <state-dir>.bak.<YYYYMMDD_HHMMSS>. To delete it instead of backing it up:

python -m slurmgrid submit --config run.yaml --restart --no-backup

Check status

python -m slurmgrid status --state-dir ./my_run

==================================================
  Total jobs:            50000
  Completed:             35420  (70.8%)
  Active:                 4580  (12 failing)
  Pending:               10000
  Failed (retrying):         0
  Failed (final):            0
  Chunks: 35/50 completed, 5 active, 10 pending
==================================================

Inspect failing jobs

While a run is in progress (or after it finishes), list all currently-failing tasks with their manifest parameters and log file paths:

python -m slurmgrid failures --state-dir ./my_run

============================================================
Row 42  exit=1  retries=1  permanent=False
  alpha=0.5  beta=2  seed=42
  OUT: ./my_run/logs/chunk_003/slurm-98765_8.out
  ERR: ./my_run/logs/chunk_003/slurm-98765_8.err
  --- last 5 lines of .err ---
  Traceback (most recent call last):
  ...

Useful flags:

--permanently-failed-only: show only tasks that have exhausted all retries
--tail N: show last N lines of each task's .err log (default: 5)
--paths-only: show log paths but suppress log content

Cancel all jobs

python -m slurmgrid cancel --state-dir ./my_run

Dry run

Generate all chunk files and sbatch scripts without actually submitting:

python -m slurmgrid submit --manifest params.csv --command "echo {x}" --dry-run

Inspect the generated scripts in ./sc_state/scripts/ to verify correctness.

How it works

Chunking: The manifest is split into sub-manifests. Each chunk gets its own sbatch script that uses SLURM_ARRAY_TASK_ID to index into the sub-manifest and extract the parameters for that task.
Shuffling: Manifest rows are shuffled before chunking (disable with --no-shuffle) so each chunk gets a representative mix of the parameter space and chunks take roughly the same wall time.
Batch submission: Each chunk is submitted as a single sbatch --array call with a %throttle suffix to limit concurrency, which is orders of magnitude faster than submitting jobs individually.
Monitoring: The tool polls sacct to track job status. Multiple chunks run concurrently: a new chunk is submitted whenever the number of remaining (incomplete) tasks across active chunks drops enough to fit another chunk within --max-concurrent. Use --serial-chunks to run one chunk at a time instead, which is useful when tasks compete for an external resource (e.g., API rate limits) beyond what Slurm's %throttle controls.
Retries: When all regular chunks are done, failed tasks are batched into a single retry chunk and resubmitted, up to --max-retries per task.
State persistence: All state is saved as JSON after every poll. Atomic writes (via temp file + rename) prevent corruption. You can resume at any time.

State directory layout

sc_state/
  config.json          # Frozen copy of the submission configuration
  state.json           # Chunk-level status and failure tracking
  monitor.lock         # hostname:pid of the running monitor (removed on clean exit)
  slurmgrid.log        # Tool's own log file
  chunks/
    chunk_000.chunk    # Sub-manifests (internal format)
    chunk_001.chunk
  scripts/
    chunk_000.sh       # Generated sbatch scripts
    chunk_001.sh
  logs/
    chunk_000/         # Slurm stdout/stderr per chunk
      slurm-12345_0.out
      slurm-12345_0.err

All options

Flag	Default	Description
`--manifest`	(required)	CSV/TSV manifest file
`--command`	(required)	Command template with `{column}` placeholders
`--state-dir`	`./sc_state`	Directory for state, chunks, scripts, logs
`--delimiter`	auto-detect	Manifest delimiter (`,` for .csv, `\t` for .tsv)
`--chunk-size`	auto-detect	Jobs per array chunk (default: `MaxArraySize / 3`)
`--max-concurrent`	10000	Max simultaneously running tasks (Slurm `%throttle`)
`--max-retries`	3	Max retries per failed job
`--poll-interval`	30	Seconds between status checks
`--max-runtime`	unlimited	Max seconds to run before saving state and exiting
`--dry-run`	false	Generate scripts without submitting
`--no-shuffle`	false	Don't shuffle manifest rows before chunking
`--partition`		Slurm partition
`--time`		Wall time limit (e.g., `01:00:00`)
`--mem`		Memory per node (e.g., `4G`)
`--mem-per-cpu`		Memory per CPU
`--cpus-per-task`	1	CPUs per task
`--gpus`		GPU specification
`--gres`		Generic resource specification
`--account`		Slurm account
`--qos`		Quality of service
`--constraint`		Node constraint
`--exclude`		Nodes to exclude
`--job-name-prefix`	`sc`	Prefix for Slurm job names
`--preamble`		Shell commands before the main command
`--preamble-file`		File containing preamble commands
`--extra-sbatch`		Extra `#SBATCH` flags (repeatable)
`--after-run`		Wait for a previous run to finish before submitting (path to its state directory)
`--restart`	false	Back up the existing state dir and start fresh
`--no-backup`	false	With `--restart`, delete the old state dir instead of backing it up
`--headroom`	auto	Reserve this many task slots for your other Slurm jobs; don't submit a new chunk if it would push your total active tasks above `max-concurrent - headroom`
`--self-resubmit`	false	On `--max-runtime` exit, automatically sbatch a new resume job
`--serial-chunks`	false	Submit one chunk at a time (wait for full completion before submitting the next)
`--config`		YAML config file; any option above can be set as a key
`--reset-failures`	false	Reset permanently failed tasks for retry (`resume` only)

Slurm flags (--time, --mem, --partition, etc.) can also be passed to resume to override the frozen config for that session. These overrides are transient and recorded per-chunk in state.json.

Requirements

Python 3.8+
Slurm with sbatch, sacct, squeue, scancel, scontrol available
Slurm accounting enabled (sacct must work)

License

MIT

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jgaeb

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
Operating System
- POSIX
Programming Language
- Python :: 3
Topic
- System :: Clustering
- System :: Distributed Computing

Release history Release notifications | RSS feed

This version

0.2.0

Mar 20, 2026

0.1.2

Mar 7, 2026

0.1.1

Feb 24, 2026

0.1.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmgrid-0.2.0.tar.gz (49.5 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slurmgrid-0.2.0-py3-none-any.whl (32.6 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file slurmgrid-0.2.0.tar.gz.

File metadata

Download URL: slurmgrid-0.2.0.tar.gz
Upload date: Mar 20, 2026
Size: 49.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmgrid-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`413fdb9ebe3b12913a782108649993fe665d52ac76addc0351cfe191456f42d4`
MD5	`d5e46aca7575a7b4650bba31f8bf176f`
BLAKE2b-256	`a664a9ff4ab55ca8e06d28979e286f7048c325cf6b3258f01f481434692a0ad3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmgrid-0.2.0.tar.gz:

Publisher: publish.yml on jgaeb/slurmgrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmgrid-0.2.0.tar.gz
- Subject digest: 413fdb9ebe3b12913a782108649993fe665d52ac76addc0351cfe191456f42d4
- Sigstore transparency entry: 1141600765
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: jgaeb/slurmgrid@c1e2649177051474d247f2026f1c54b784bfa885
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/jgaeb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c1e2649177051474d247f2026f1c54b784bfa885
- Trigger Event: release

File details

Details for the file slurmgrid-0.2.0-py3-none-any.whl.

File metadata

Download URL: slurmgrid-0.2.0-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 32.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmgrid-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`42ec97f74d12b54959235c0c64c28066485106bf786afaea086913c0bb5c543e`
MD5	`64fc63bc03e303287bd10729d2ea2259`
BLAKE2b-256	`25214e840a0bef7e774a9fc7afbe8af53444572e0b23f95d9e605e613097f730`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmgrid-0.2.0-py3-none-any.whl:

Publisher: publish.yml on jgaeb/slurmgrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmgrid-0.2.0-py3-none-any.whl
- Subject digest: 42ec97f74d12b54959235c0c64c28066485106bf786afaea086913c0bb5c543e
- Sigstore transparency entry: 1141601493
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: jgaeb/slurmgrid@c1e2649177051474d247f2026f1c54b784bfa885
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/jgaeb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c1e2649177051474d247f2026f1c54b784bfa885
- Trigger Event: release

slurmgrid 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

slurmgrid

Installation

Quick start

Usage

Submit a new run

Use a config file

Run the monitor as a Slurm job (recommended for HPC)

Resume an interrupted run

Retry permanently failed tasks

Chain runs with --after-run

Restart a run from scratch

Check status

Inspect failing jobs

Cancel all jobs

Dry run

How it works

State directory layout

All options

Requirements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance