Manage large Slurm job arrays that exceed cluster submission limits

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jgaeb

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
Operating System
- POSIX
Programming Language
- Python :: 3
Topic
- System :: Clustering
- System :: Distributed Computing

Project description

slurmgrid

Manage large Slurm job arrays that exceed your cluster's submission limit.

If you need to run 50,000 small jobs but your cluster caps MaxArraySize at 10,000 (or limits total queued jobs), slurmgrid handles the tedious cycle of "submit a batch, wait, submit the next batch" automatically. It chunks your parameter manifest, submits array jobs via sbatch, monitors completion via sacct, retries failures, and persists state so you can resume if interrupted.

Installation

pip install slurmgrid

Or clone the repo and install in editable mode:

git clone https://github.com/jgaeb/slurmgrid.git
cd slurmgrid
pip install -e .
python -m slurmgrid --help

Quick start

Create a manifest file (CSV or TSV) with one row per job:

alpha,beta,seed
0.1,1,42
0.1,2,42
0.5,1,42
0.5,2,42
...

Run slurmgrid submit with your command template:

python -m slurmgrid submit \
  --manifest params.csv \
  --command "python train.py --alpha {alpha} --beta {beta} --seed {seed}" \
  --partition gpu \
  --time 01:00:00 \
  --mem 4G \
  --max-concurrent 5000

That's it. slurmgrid will:

Shuffle and split the manifest into chunks (default: 1/3 of MaxArraySize)
Submit each chunk as a fast array job via sbatch, using Slurm's %throttle to limit concurrency to --max-concurrent
Poll sacct every 30 seconds to track completion
Submit the next chunk when the current one finishes
Batch failed jobs into retry chunks (up to --max-retries, default 3)
Save state to disk after every poll so you can resume if interrupted

Usage

Submit a new run

python -m slurmgrid submit \
  --manifest params.csv \
  --command "python train.py --alpha {alpha} --beta {beta}" \
  --state-dir ./my_run \
  --partition gpu \
  --time 02:00:00 \
  --mem 8G \
  --cpus-per-task 4 \
  --max-concurrent 5000 \
  --max-retries 3 \
  --poll-interval 30 \
  --preamble "module load python/3.10 && conda activate myenv"

The --command template uses {column_name} placeholders that are resolved from the manifest columns. Any column in the manifest can be referenced.

Use a config file

Instead of a long command line, you can store submit options in a YAML file:

# run.yaml
manifest: params.csv
command: python train.py --alpha {alpha} --beta {beta} --seed {seed}
state-dir: ./my_run
partition: gpu
time: 02:00:00
mem: 8G
max-concurrent: 5000
max-retries: 3

python -m slurmgrid submit --config run.yaml

CLI flags take precedence over config file values, so you can override individual options ad hoc:

python -m slurmgrid submit --config run.yaml --partition debug --time 00:10:00

Resume an interrupted run

If you lose your SSH session or Ctrl-C out, running Slurm jobs continue independently. Resume monitoring with:

python -m slurmgrid resume --state-dir ./my_run

Check status

python -m slurmgrid status --state-dir ./my_run

==================================================
  Total jobs:            50000
  Completed:             35420  (70.8%)
  Active:                 4580
  Pending:               10000
  Failed (retrying):         0
  Failed (final):            0
  Chunks: 35/50 completed, 5 active, 10 pending
==================================================

Cancel all jobs

python -m slurmgrid cancel --state-dir ./my_run

Dry run

Generate all chunk files and sbatch scripts without actually submitting:

python -m slurmgrid submit --manifest params.csv --command "echo {x}" --dry-run

Inspect the generated scripts in ./sc_state/scripts/ to verify correctness.

How it works

Chunking: The manifest is split into sub-manifests. Each chunk gets its own sbatch script that uses SLURM_ARRAY_TASK_ID to index into the sub-manifest and extract the parameters for that task.
Shuffling: Manifest rows are shuffled before chunking (disable with --no-shuffle) so each chunk gets a representative mix of the parameter space and chunks take roughly the same wall time.
Batch submission: Each chunk is submitted as a single sbatch --array call with a %throttle suffix to limit concurrency, which is orders of magnitude faster than submitting jobs individually.
Monitoring: The tool polls sacct to track job status. Multiple chunks run concurrently as capacity allows (up to --max-concurrent total tasks); new chunks are submitted as running ones complete.
Retries: When all regular chunks are done, failed tasks are batched into a single retry chunk and resubmitted, up to --max-retries per task.
State persistence: All state is saved as JSON after every poll. Atomic writes (via temp file + rename) prevent corruption. You can resume at any time.

State directory layout

sc_state/
  config.json          # Frozen copy of the submission configuration
  state.json           # Chunk-level status and failure tracking
  slurmgrid.log         # Tool's own log file
  chunks/
    chunk_000.chunk    # Sub-manifests (internal format)
    chunk_001.chunk
  scripts/
    chunk_000.sh       # Generated sbatch scripts
    chunk_001.sh
  logs/
    chunk_000/         # Slurm stdout/stderr per chunk
      slurm-12345_0.out
      slurm-12345_0.err

All options

Flag	Default	Description
`--manifest`	(required)	CSV/TSV manifest file
`--command`	(required)	Command template with `{column}` placeholders
`--state-dir`	`./sc_state`	Directory for state, chunks, scripts, logs
`--delimiter`	auto-detect	Manifest delimiter (`,` for .csv, `\t` for .tsv)
`--chunk-size`	auto-detect	Jobs per array chunk (default: `MaxArraySize / 3`)
`--max-concurrent`	10000	Max simultaneously running tasks (Slurm `%throttle`)
`--max-retries`	3	Max retries per failed job
`--poll-interval`	30	Seconds between status checks
`--max-runtime`	unlimited	Max seconds to run before saving state and exiting
`--dry-run`	false	Generate scripts without submitting
`--no-shuffle`	false	Don't shuffle manifest rows before chunking
`--partition`		Slurm partition
`--time`		Wall time limit (e.g., `01:00:00`)
`--mem`		Memory per node (e.g., `4G`)
`--mem-per-cpu`		Memory per CPU
`--cpus-per-task`	1	CPUs per task
`--gpus`		GPU specification
`--gres`		Generic resource specification
`--account`		Slurm account
`--qos`		Quality of service
`--constraint`		Node constraint
`--exclude`		Nodes to exclude
`--job-name-prefix`	`sc`	Prefix for Slurm job names
`--preamble`		Shell commands before the main command
`--preamble-file`		File containing preamble commands
`--extra-sbatch`		Extra `#SBATCH` flags (repeatable)
`--after-run`		Wait for a previous run to finish before submitting (path to its state directory)
`--config`		YAML config file; any option above can be set as a key

Requirements

Python 3.8+
Slurm with sbatch, sacct, squeue, scancel, scontrol available
Slurm accounting enabled (sacct must work)

License

MIT

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jgaeb

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
Operating System
- POSIX
Programming Language
- Python :: 3
Topic
- System :: Clustering
- System :: Distributed Computing

Release history Release notifications | RSS feed

0.2.0

Mar 20, 2026

0.1.2

Mar 7, 2026

This version

0.1.1

Feb 24, 2026

0.1.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmgrid-0.1.1.tar.gz (36.7 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slurmgrid-0.1.1-py3-none-any.whl (25.9 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file slurmgrid-0.1.1.tar.gz.

File metadata

Download URL: slurmgrid-0.1.1.tar.gz
Upload date: Feb 24, 2026
Size: 36.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmgrid-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b31fdefd15f423c51432808faeb2e674b58f23e6560e16ea159174f8a009c1b3`
MD5	`71a6b73ee3f665bbbcadf1925839247b`
BLAKE2b-256	`0f31e656710e3100db4b191f9f0382b7f9cb65c75441a806649ed3e4536fc8ef`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmgrid-0.1.1.tar.gz:

Publisher: publish.yml on jgaeb/slurmgrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmgrid-0.1.1.tar.gz
- Subject digest: b31fdefd15f423c51432808faeb2e674b58f23e6560e16ea159174f8a009c1b3
- Sigstore transparency entry: 988668311
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: jgaeb/slurmgrid@ba7c03dfb6bff8ad12d156f8c33ada07d27ee7ce
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/jgaeb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ba7c03dfb6bff8ad12d156f8c33ada07d27ee7ce
- Trigger Event: release

File details

Details for the file slurmgrid-0.1.1-py3-none-any.whl.

File metadata

Download URL: slurmgrid-0.1.1-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 25.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmgrid-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a92afca2aced3cf72437fe7651d3d7e0d8fc95f4620c89710a6f6268668539a0`
MD5	`60992b573460d32a0b15bc2782931310`
BLAKE2b-256	`e0c9c3205882829f92e4a149950477683eaf0e45492666dd57aabf49cf735e21`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmgrid-0.1.1-py3-none-any.whl:

Publisher: publish.yml on jgaeb/slurmgrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmgrid-0.1.1-py3-none-any.whl
- Subject digest: a92afca2aced3cf72437fe7651d3d7e0d8fc95f4620c89710a6f6268668539a0
- Sigstore transparency entry: 988668362
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: jgaeb/slurmgrid@ba7c03dfb6bff8ad12d156f8c33ada07d27ee7ce
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/jgaeb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ba7c03dfb6bff8ad12d156f8c33ada07d27ee7ce
- Trigger Event: release

slurmgrid 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

slurmgrid

Installation

Quick start

Usage

Submit a new run

Use a config file

Resume an interrupted run

Check status

Cancel all jobs

Dry run

How it works

State directory layout

All options

Requirements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance