CLI tools for managing and generating SLURM jobs
Project description
Install • Quick Start • Features • Docs • DeepWiki
A CLI toolkit for managing and generating SLURM jobs.
slurmkit provides tools for:
- Auto-discovering and tracking SLURM job status
- Generating job scripts from templates with parameter sweeps
- Organizing jobs into trackable collections
- Cross-cluster job synchronization
- Cleaning up failed jobs and W&B runs
Installation
Install via pip
pip install slurmkit
Install Latest from GitHub
pip install git+https://github.com/Awni00/slurmkit.git
# include all optional extras (ui + dev + docs)
pip install "slurmkit[all] @ git+https://github.com/Awni00/slurmkit.git"
Clone and Install (Recommended for Development)
git clone https://github.com/Awni00/slurmkit.git
cd slurmkit
pip install -e ".[all]"
Dependencies
Required:
- Python 3.8+
- PyYAML
- Jinja2
- pandas
- tabulate
- requests
Optional:
- wandb (for W&B cleanup features)
- rich (enhanced CLI UI; install with
pip install "slurmkit[ui]") allextra for optional groups (ui,dev,docs)
Quick Start
1. Initialize Project
cd your-project
slurmkit init
This creates .slurmkit/config.yaml with your settings.
2. Check Job Status
slurmkit status my_experiment
3. Generate Jobs from Template
Create a template templates/train.job.j2:
#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --partition={{ slurm.partition }}
#SBATCH --time={{ slurm.time }}
#SBATCH --output={{ logs_dir }}/{{ job_name }}.%j.out
python train.py --lr {{ learning_rate }} --bs {{ batch_size }}
Create a job spec experiments/exp1/job_spec.yaml:
name: exp1
template: ../../templates/train.job.j2
job_subdir: exp1
parameters:
mode: grid
values:
learning_rate: [0.001, 0.01, 0.1]
batch_size: [32, 64]
n_trials: [3]
# Optional: derive effective params before filtering/rendering
parse: params_logic.py:parse_params
# Optional: exclude incompatible combinations
filter: params_logic.py:include_params
slurm_args:
defaults:
partition: gpu
time: "24:00:00"
job_name_pattern: "lr{{ learning_rate }}_bs{{ batch_size }}"
With this layout, slurmkit writes scripts to .jobs/exp1/job_scripts/ and expects logs in .jobs/exp1/logs/.
Generate jobs:
slurmkit generate experiments/exp1/job_spec.yaml --into exp1
4. Submit Jobs
# Preview before actual submission
slurmkit submit exp1 --dry-run
# Submit to SLURM
slurmkit submit exp1
5. Monitor and Resubmit
# Preview active jobs that would be cancelled
slurmkit collections cancel exp1 --dry-run
# View collection status
slurmkit status exp1
slurmkit collections show exp1
# Rich UI (if installed)
slurmkit --ui rich collections analyze exp1
# Resubmit jobs by state filter
slurmkit resubmit exp1 --filter failed
# Resubmit only preempted jobs
slurmkit resubmit exp1 --filter preempted
# Resubmit a single tracked job by SLURM job ID (collection inferred)
slurmkit resubmit --job-id 123456 -y
# Override the autogenerated submission group label
slurmkit resubmit exp1 --filter failed --submission-group retry_after_fix
Testing and Showcase Workflows
A) Local Demo (No SLURM Required)
Use the bundled demo project for a deterministic feature showcase:
cd examples/demo_project
python -m venv .venv
source .venv/bin/activate
pip install -e ../..
./setup_dummy_jobs.py --include-non-terminal
Then run:
slurmkit collections list
slurmkit status fixtures/mixed_30
slurmkit collections show fixtures/mixed_30
slurmkit collections analyze fixtures/mixed_30
# Optional richer formatting (requires rich extra):
slurmkit --ui rich collections analyze fixtures/mixed_30
slurmkit notify test --dry-run
slurmkit notify collection-final --collection notifications/terminal_failed --job-id 991002 --no-refresh --dry-run
B) Real Cluster Workflow
slurmkit generate experiments/exp1/job_spec.yaml --into exp1
slurmkit submit exp1 --dry-run
slurmkit submit exp1
slurmkit status exp1
slurmkit collections refresh exp1
slurmkit collections cancel exp1 --dry-run
slurmkit collections show exp1
slurmkit collections analyze exp1
slurmkit resubmit exp1 --filter failed --dry-run
C) Feature Checklist
| Goal | Command | Success signal |
|---|---|---|
| Initialize config | slurmkit init |
.slurmkit/config.yaml created |
| Generate scripts | slurmkit generate ... --into exp1 |
Job scripts written and collection updated |
| Preview submission | slurmkit submit exp1 --dry-run |
Candidate jobs listed with no submit |
| Inspect collection | slurmkit collections show exp1 |
Summary + jobs table rendered |
| Analyze outcomes | slurmkit collections analyze exp1 |
Parameter tables and risky/stable sections shown |
| Validate notifications | slurmkit notify test --dry-run |
Route resolution and payload preview |
Commands
| Command | Description |
|---|---|
slurmkit init |
Initialize project configuration |
slurmkit install-skill |
Install the slurmkit Codex skill via npx skills |
slurmkit migrate |
Upgrade local config and collections to the current schema |
slurmkit status <collection> |
Show live status for a collection |
slurmkit generate <spec> |
Generate job scripts from a spec into a collection |
slurmkit submit <collection> |
Submit a collection |
slurmkit resubmit [collection] [--job-id <id>] |
Resubmit jobs by explicit state filter in a collection or one tracked job |
slurmkit notify |
Send job lifecycle notifications |
slurmkit collections |
List, inspect, analyze, refresh, cancel, and delete collections |
slurmkit clean outputs |
Clean failed job outputs |
slurmkit clean wandb |
Clean failed W&B runs |
slurmkit sync |
Sync job states for cross-cluster |
Run slurmkit <command> --help for detailed usage.
Install the skill quickly:
slurmkit install-skill --yes
Configuration
Configuration is stored in .slurmkit/config.yaml:
jobs_dir: .jobs/
output_patterns:
- "{job_name}.{job_id}.out"
- "{job_name}.{job_id}.*.out"
- "slurm-{job_id}.out"
slurm_defaults:
partition: gpu
time: "24:00:00"
mem: "32G"
ui:
mode: plain # plain | rich | auto
columns:
collections_show:
- job_name
- job_id
- state
- runtime
- attempt
- submission_group
- resubmissions
- output_path
collections_show:
pager: less # less | none
notifications:
defaults:
events: [job_failed]
timeout_seconds: 5
max_attempts: 3
backoff_seconds: 0.5
output_tail_lines: 40
job:
ai:
enabled: false
callback: null
collection_final:
attempt_mode: latest
min_support: 3
top_k: 10
include_failed_output_tail_lines: 20
ai:
enabled: false
callback: null
routes:
- name: team_slack
type: slack
url: "${SLACK_WEBHOOK_URL}"
events: [job_failed, collection_failed]
- name: team_email
type: email
to: ["ops@example.com", "ml@example.com"]
from: "${SLURMKIT_EMAIL_FROM}"
smtp_host: "${SMTP_HOST}"
smtp_port: 587
smtp_username: "${SMTP_USER}"
smtp_password: "${SMTP_PASSWORD}"
smtp_starttls: true
smtp_ssl: false
events: [job_failed, collection_failed]
Environment Variables
| Variable | Description |
|---|---|
SLURMKIT_CONFIG |
Path to config file |
SLURMKIT_JOBS_DIR |
Jobs directory |
SLURMKIT_WANDB_ENTITY |
W&B entity |
SLURMKIT_DRY_RUN |
Enable dry-run mode |
Documentation
Full documentation is available at https://awni00.github.io/slurmkit/
- Getting Started
- Configuration
- Job Generation
- Collections
- Notifications
- Cross-Cluster Sync
- CLI Reference
Project Structure
your-project/
├── .slurmkit/
│ ├── config.yaml # Project configuration
│ ├── collections/ # Collection YAML files
│ ├── sync/ # Cross-cluster sync files
│ └── backups/ # Migration backups (created on demand)
├── .jobs/
│ └── experiment1/
│ ├── job_scripts/ # Generated job scripts
│ └── logs/ # Job output files
└── templates/ # Jinja2 job templates
Features
Key features at a glance:
1) Job Creation
- Generate parameterized job scripts and attach them to a collection:
slurmkit generate job_spec.yaml --into exp1 - Preview generation and submission safely:
slurmkit generate ... --dry-run,slurmkit submit ... --dry-run - Submit only unsubmitted collection jobs (default):
slurmkit submit exp1 --filter unsubmitted
2) Collection Tracking and Analysis
- Inspect, analyze, and refresh collections:
slurmkit status exp1,slurmkit collections show exp1,slurmkit collections refresh exp1 - Cancel active jobs across tracked attempts:
slurmkit collections cancel exp1 --dry-run - Analyze outcomes by parameter values:
slurmkit collections analyze exp1 --top-k 10 - Resubmit filtered jobs with deterministic regeneration by default, including optional selection and parameter callbacks (e.g., checkpoint dir):
slurmkit resubmit exp1 --filter failed --select-file callbacks.py --extra-params-file extra.py
3) Notifications and Cross-Cluster Sync
- Validate routes and send job notifications:
slurmkit notify test,slurmkit notify job ... - Send one final collection-level summary when a collection reaches terminal state:
slurmkit notify collection-final ... - Sync collection/job state across clusters via git-backed files:
slurmkit sync --push
Job Collections
Track related jobs together:
# List collections
slurmkit collections list
# Show details
slurmkit status my_exp
slurmkit collections show my_exp --state failed
# Update states from SLURM
slurmkit collections refresh my_exp
# Preview which active jobs would be cancelled
slurmkit collections cancel my_exp --dry-run
Notifications
Send job lifecycle notifications to Slack, Discord, email, or generic webhooks:
# Validate route setup
slurmkit notify test
slurmkit notify test --route team_email --dry-run
# Typical end-of-job call from script (default: notify only on failure)
slurmkit notify job --job-id "$SLURM_JOB_ID" --exit-code "$rc"
# Collection-final summary notification (emits only when collection is terminal)
slurmkit notify collection-final --job-id "$SLURM_JOB_ID" --trigger-exit-code "$rc"
Collection-specific overrides are supported via a top-level notifications block in job_spec.yaml:
- If a collection is linked to a spec with
notifications, those values override global.slurmkit/config.yamlnotifications. - If no spec-level block exists (or spec loading fails), slurmkit falls back to global config.
- Dicts deep-merge; lists replace (including
notifications.routes).
See docs/notifications.md and examples/demo_project/README.md for full examples.
Recommended trap snippet inside a job script:
rc=$?
slurmkit notify job --job-id "${SLURM_JOB_ID}" --exit-code "${rc}"
slurmkit notify collection-final --job-id "${SLURM_JOB_ID}" --trigger-exit-code "${rc}"
exit "${rc}"
Parameter Sweeps
Generate jobs from parameter grids:
parameters:
mode: grid
values:
learning_rate: [0.001, 0.01, 0.1]
batch_size: [32, 64, 128]
model: [resnet18, resnet50]
Or explicit lists:
parameters:
mode: list
values:
- {lr: 0.001, bs: 32}
- {lr: 0.01, bs: 64}
Dynamic SLURM Arguments
Use Python functions for complex resource logic:
# slurm_logic.py
def get_slurm_args(params, defaults):
args = defaults.copy()
if params.get('model') == 'resnet50':
args['mem'] = '64G'
args['gpus'] = 2
return args
Cross-Cluster Sync
Share job status across clusters via git:
# On cluster A
slurmkit sync --push
# On cluster B
git pull
slurmkit collections show my_exp
Development
Setup
We recommend using uv to manage the development environment.
# Clone the repository
git clone https://github.com/Awni00/slurmkit.git
cd slurmkit
# Create a virtual environment and install dependencies in editable mode
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
Running Tests
pytest
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slurmkit-0.1.5.tar.gz.
File metadata
- Download URL: slurmkit-0.1.5.tar.gz
- Upload date:
- Size: 129.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
265252dea3d7965cdcf9b869b56292bebd114bafb60405bae56184990490ebca
|
|
| MD5 |
6574469dff23f23cb82382771d12a787
|
|
| BLAKE2b-256 |
2a6c261564e902abc0ce783856d6874a2bc757dc1c0ddc78a64357c3abf2cc9e
|
Provenance
The following attestation bundles were made for slurmkit-0.1.5.tar.gz:
Publisher:
publish.yml on Awni00/slurmkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmkit-0.1.5.tar.gz -
Subject digest:
265252dea3d7965cdcf9b869b56292bebd114bafb60405bae56184990490ebca - Sigstore transparency entry: 1325058666
- Sigstore integration time:
-
Permalink:
Awni00/slurmkit@d6fe6644a48e783c3128635782a8d1788051e104 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/Awni00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d6fe6644a48e783c3128635782a8d1788051e104 -
Trigger Event:
push
-
Statement type:
File details
Details for the file slurmkit-0.1.5-py3-none-any.whl.
File metadata
- Download URL: slurmkit-0.1.5-py3-none-any.whl
- Upload date:
- Size: 118.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d64992d866b2403ae7f085e81b450155598ad627d7d5d1d2757404204b401ee1
|
|
| MD5 |
d2ccff1d4b7cc45071cb2681808e8a40
|
|
| BLAKE2b-256 |
6bbfbbea24636fe509439b8744605b9cc970e5645209a564f8f8cfa751e12c94
|
Provenance
The following attestation bundles were made for slurmkit-0.1.5-py3-none-any.whl:
Publisher:
publish.yml on Awni00/slurmkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmkit-0.1.5-py3-none-any.whl -
Subject digest:
d64992d866b2403ae7f085e81b450155598ad627d7d5d1d2757404204b401ee1 - Sigstore transparency entry: 1325058935
- Sigstore integration time:
-
Permalink:
Awni00/slurmkit@d6fe6644a48e783c3128635782a8d1788051e104 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/Awni00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d6fe6644a48e783c3128635782a8d1788051e104 -
Trigger Event:
push
-
Statement type: