slurmkit

CLI tools for managing and generating SLURM jobs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Awni00

These details have not been verified by PyPI

Project description

slurmkit header

MIT License

Install • Quick Start • Features • Docs • DeepWiki

A CLI toolkit for managing and generating SLURM jobs.

slurmkit provides tools for:

Auto-discovering and tracking SLURM job status
Generating job scripts from templates with parameter sweeps
Organizing jobs into trackable collections
Cross-cluster job synchronization
Cleaning up failed jobs and W&B runs

Installation

Install via pip

pip install slurmkit

Install Latest from GitHub

pip install git+https://github.com/Awni00/slurmkit.git
# include all optional extras (ui + dev + docs)
pip install "slurmkit[all] @ git+https://github.com/Awni00/slurmkit.git"

Clone and Install (Recommended for Development)

git clone https://github.com/Awni00/slurmkit.git
cd slurmkit
pip install -e ".[all]"

Dependencies

Required:

Python 3.8+
PyYAML
Jinja2
pandas
tabulate
requests

Optional:

wandb (for W&B cleanup features)
rich (enhanced CLI UI; install with pip install "slurmkit[ui]")
all extra for optional groups (ui, dev, docs)

Quick Start

1. Initialize Project

cd your-project
slurmkit init

This creates .slurm-kit/config.yaml with your settings.

2. Check Job Status

slurmkit status my_experiment

3. Generate Jobs from Template

Create a template templates/train.job.j2:

#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --partition={{ slurm.partition }}
#SBATCH --time={{ slurm.time }}
#SBATCH --output={{ logs_dir }}/{{ job_name }}.%j.out

python train.py --lr {{ learning_rate }} --bs {{ batch_size }}

Create a job spec experiments/exp1/job_spec.yaml:

name: exp1
template: ../../templates/train.job.j2
output_dir: job_scripts
logs_dir: logs

parameters:
  mode: grid
  values:
    learning_rate: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
  # Optional: exclude incompatible combinations
  filter:
    file: params_filter.py
    function: include_params

slurm_args:
  defaults:
    partition: gpu
    time: "24:00:00"

job_name_pattern: "lr{{ learning_rate }}_bs{{ batch_size }}"

Generate jobs:

slurmkit generate experiments/exp1/job_spec.yaml --collection exp1

4. Submit Jobs

# Preview before actual submission
slurmkit submit --collection exp1 --dry-run

# Submit to SLURM
slurmkit submit --collection exp1

5. Monitor and Resubmit

# Update job states
slurmkit collection update exp1

# View collection status
slurmkit collection show exp1

# View latest effective attempts with primary/history context
slurmkit collection show exp1 --show-primary --show-history

# Rich UI (if installed)
slurmkit --ui rich collection analyze exp1

# Resubmit failed jobs
slurmkit resubmit --collection exp1 --filter failed

# Legacy behavior: reuse existing script file instead of regenerating
slurmkit resubmit --collection exp1 --filter failed --no-regenerate

# Group-aware retry
slurmkit resubmit --collection exp1 --filter failed --submission-group retry_after_fix

Testing and Showcase Workflows

A) Local Demo (No SLURM Required)

Use the bundled demo project for a deterministic feature showcase:

cd examples/demo_project
python -m venv .venv
source .venv/bin/activate
pip install -e ../..
./setup_dummy_jobs.py --include-non-terminal

Then run:

slurmkit collection list
slurmkit collection list --attempt-mode primary  # Optional override to primary submission states
slurmkit collection show demo_terminal_failed
slurmkit collection analyze demo_terminal_failed
# Optional richer formatting (requires rich extra):
slurmkit --ui rich collection analyze demo_terminal_failed
slurmkit notify test --dry-run
slurmkit notify collection-final --collection demo_terminal_failed --job-id 990002 --no-refresh --dry-run

B) Real Cluster Workflow

slurmkit generate experiments/exp1/job_spec.yaml --collection exp1
slurmkit submit --collection exp1 --dry-run
slurmkit submit --collection exp1
slurmkit status exp1
slurmkit collection update exp1
slurmkit collection show exp1
slurmkit collection analyze exp1 --attempt-mode latest
slurmkit collection groups exp1
slurmkit resubmit --collection exp1 --filter failed --dry-run

C) Feature Checklist

Goal	Command	Success signal
Initialize config	`slurmkit init`	`.slurm-kit/config.yaml` created
Generate scripts	`slurmkit generate ... --collection exp1`	Job scripts written and collection updated
Preview submission	`slurmkit submit --collection exp1 --dry-run`	Candidate jobs listed with no submit
Inspect collection	`slurmkit collection show exp1`	Summary + jobs table rendered
Analyze outcomes	`slurmkit collection analyze exp1`	Parameter tables and risky/stable sections shown
Validate notifications	`slurmkit notify test --dry-run`	Route resolution and payload preview

Commands

Command	Description
`slurmkit init`	Initialize project configuration
`slurmkit status <exp>`	Show job status for experiment
`slurmkit find <job_id>`	Find output file for job ID
`slurmkit generate <spec>`	Generate job scripts from template
`slurmkit submit`	Submit job scripts
`slurmkit resubmit`	Resubmit failed jobs
`slurmkit notify`	Send job lifecycle notifications
`slurmkit collection`	Manage job collections
`slurmkit clean outputs`	Clean failed job outputs
`slurmkit clean wandb`	Clean failed W&B runs
`slurmkit sync`	Sync job states for cross-cluster

Run slurmkit <command> --help for detailed usage.

Configuration

Configuration is stored in .slurm-kit/config.yaml:

jobs_dir: jobs/
collections_dir: .job-collections/
sync_dir: .slurm-kit/sync/

output_patterns:
  - "{job_name}.{job_id}.out"
  - "{job_name}.{job_id}.*.out"
  - "slurm-{job_id}.out"

slurm_defaults:
  partition: gpu
  time: "24:00:00"
  mem: "32G"

job_structure:
  scripts_subdir: job_scripts/
  logs_subdir: logs/

ui:
  mode: plain  # plain | rich | auto

notifications:
  defaults:
    events: [job_failed]
    timeout_seconds: 5
    max_attempts: 3
    backoff_seconds: 0.5
    output_tail_lines: 40
  collection_final:
    attempt_mode: latest
    min_support: 3
    top_k: 10
    include_failed_output_tail_lines: 20
    ai:
      enabled: false
      callback: null
  routes:
    - name: team_slack
      type: slack
      url: "${SLACK_WEBHOOK_URL}"
      events: [job_failed, collection_failed]
    - name: team_email
      type: email
      to: ["ops@example.com", "ml@example.com"]
      from: "${SLURMKIT_EMAIL_FROM}"
      smtp_host: "${SMTP_HOST}"
      smtp_port: 587
      smtp_username: "${SMTP_USER}"
      smtp_password: "${SMTP_PASSWORD}"
      smtp_starttls: true
      smtp_ssl: false
      events: [job_failed, collection_failed]

Environment Variables

Variable	Description
`SLURMKIT_CONFIG`	Path to config file
`SLURMKIT_JOBS_DIR`	Jobs directory
`SLURMKIT_COLLECTIONS_DIR`	Collections directory
`SLURMKIT_WANDB_ENTITY`	W&B entity
`SLURMKIT_DRY_RUN`	Enable dry-run mode

Documentation

Full documentation is available at https://awni00.github.io/slurmkit/

Project Structure

your-project/
├── .slurm-kit/
│   ├── config.yaml          # Project configuration
│   └── sync/                 # Cross-cluster sync files
├── .job-collections/         # Collection YAML files
├── jobs/
│   └── experiment1/
│       ├── job_scripts/      # Generated job scripts
│       └── logs/             # Job output files
└── templates/                # Jinja2 job templates

Features

Key features at a glance:

1) Job Creation

Generate parameterized job scripts and attach them to a collection: slurmkit generate job_spec.yaml --collection exp1
Preview generation and submission safely: slurmkit generate ... --dry-run, slurmkit submit ... --dry-run
Submit only unsubmitted collection jobs (default): slurmkit submit --collection exp1 --filter unsubmitted

2) Collection Tracking and Analysis

Create, inspect, and refresh collections: slurmkit collection create exp1, slurmkit collection show exp1, slurmkit collection update exp1
Analyze outcomes by parameter values and latest attempts: slurmkit collection analyze exp1 --attempt-mode latest --top-k 10
Inspect resubmission waves and attempt history: slurmkit collection groups exp1, slurmkit collection show exp1 --show-history
Resubmit failed jobs with deterministic regeneration by default in collection mode, including optional selection and parameter callbacks (e.g., checkpoint dir): slurmkit resubmit --collection exp1 --filter failed --select-file callbacks.py --extra-params-file extra.py

Collections created before regeneration metadata was introduced may require --no-regenerate when resubmitting.

3) Notifications and Cross-Cluster Sync

Validate routes and send job notifications: slurmkit notify test, slurmkit notify job ...
Send one final collection-level summary when a collection reaches terminal state: slurmkit notify collection-final ...
Sync collection/job state across clusters via git-backed files: slurmkit sync --push

Job Collections

Track related jobs together:

# Create collection
slurmkit collection create my_exp --description "Training sweep"

# List collections
slurmkit collection list
slurmkit collection list --attempt-mode primary  # Optional override to primary submission states

# Show details
slurmkit collection show my_exp --state failed
slurmkit collection show my_exp --attempt-mode latest --show-primary

# Update states from SLURM
slurmkit collection update my_exp

# Submission-group summary
slurmkit collection groups my_exp

Notifications

Send job lifecycle notifications to Slack, Discord, email, or generic webhooks:

# Validate route setup
slurmkit notify test
slurmkit notify test --route team_email --dry-run

# Typical end-of-job call from script (default: notify only on failure)
slurmkit notify job --job-id "$SLURM_JOB_ID" --exit-code "$rc"

# Collection-final summary notification (emits only when collection is terminal)
slurmkit notify collection-final --job-id "$SLURM_JOB_ID"

Recommended trap snippet inside a job script:

rc=$?
slurmkit notify job --job-id "${SLURM_JOB_ID}" --exit-code "${rc}"
slurmkit notify collection-final --job-id "${SLURM_JOB_ID}"
exit "${rc}"

Parameter Sweeps

Generate jobs from parameter grids:

parameters:
  mode: grid
  values:
    learning_rate: [0.001, 0.01, 0.1]
    batch_size: [32, 64, 128]
    model: [resnet18, resnet50]

Or explicit lists:

parameters:
  mode: list
  values:
    - {lr: 0.001, bs: 32}
    - {lr: 0.01, bs: 64}

Dynamic SLURM Arguments

Use Python functions for complex resource logic:

# slurm_logic.py
def get_slurm_args(params, defaults):
    args = defaults.copy()
    if params.get('model') == 'resnet50':
        args['mem'] = '64G'
        args['gpus'] = 2
    return args

Cross-Cluster Sync

Share job status across clusters via git:

# On cluster A
slurmkit sync --push

# On cluster B
git pull
slurmkit collection show my_exp

Development

Setup

We recommend using uv to manage the development environment.

# Clone the repository
git clone https://github.com/Awni00/slurmkit.git
cd slurmkit

# Create a virtual environment and install dependencies in editable mode
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Running Tests

pytest

License

MIT License - see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Awni00

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

Apr 25, 2026

0.1.6

Apr 25, 2026

0.1.5

Apr 17, 2026

0.1.4

Apr 6, 2026

0.1.3

Apr 5, 2026

0.1.2

Apr 1, 2026

0.1.1

Mar 25, 2026

0.1.0

Mar 18, 2026

0.0.5

Mar 14, 2026

0.0.4

Feb 16, 2026

0.0.3

Feb 10, 2026

This version

0.0.2

Feb 9, 2026

0.0.1

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmkit-0.0.2.tar.gz (100.1 kB view details)

Uploaded Feb 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slurmkit-0.0.2-py3-none-any.whl (80.0 kB view details)

Uploaded Feb 9, 2026 Python 3

File details

Details for the file slurmkit-0.0.2.tar.gz.

File metadata

Download URL: slurmkit-0.0.2.tar.gz
Upload date: Feb 9, 2026
Size: 100.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmkit-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`2bd6550030a2312791342774211a26618f32b924935d912ab367d999527063fb`
MD5	`da046911fd203e4f5cfdb80a6d4dab08`
BLAKE2b-256	`af79379e8be8d1753a0cc1cd0500eb47543986941547f992b6bb44becf6d295b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmkit-0.0.2.tar.gz:

Publisher: publish.yml on Awni00/slurmkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmkit-0.0.2.tar.gz
- Subject digest: 2bd6550030a2312791342774211a26618f32b924935d912ab367d999527063fb
- Sigstore transparency entry: 934235614
- Sigstore integration time: Feb 9, 2026
Source repository:
- Permalink: Awni00/slurmkit@2a0bcf4c3da064d9c7f1675ec8ea7de6b49732be
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/Awni00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2a0bcf4c3da064d9c7f1675ec8ea7de6b49732be
- Trigger Event: push

File details

Details for the file slurmkit-0.0.2-py3-none-any.whl.

File metadata

Download URL: slurmkit-0.0.2-py3-none-any.whl
Upload date: Feb 9, 2026
Size: 80.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurmkit-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e79726cefe6507a2b3259f734a3084efd51038969c51b1bb94edbcd4ad286c73`
MD5	`313d7bdbe9ed5ccc58577462757e8314`
BLAKE2b-256	`fdf5d2f524d7bee99584d25626a1217ec0a18922110bb8e19aed6309bcb75e36`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmkit-0.0.2-py3-none-any.whl:

Publisher: publish.yml on Awni00/slurmkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurmkit-0.0.2-py3-none-any.whl
- Subject digest: e79726cefe6507a2b3259f734a3084efd51038969c51b1bb94edbcd4ad286c73
- Sigstore transparency entry: 934235675
- Sigstore integration time: Feb 9, 2026
Source repository:
- Permalink: Awni00/slurmkit@2a0bcf4c3da064d9c7f1675ec8ea7de6b49732be
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/Awni00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2a0bcf4c3da064d9c7f1675ec8ea7de6b49732be
- Trigger Event: push

slurmkit 0.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Installation

Install via pip

Install Latest from GitHub

Clone and Install (Recommended for Development)

Dependencies

Quick Start

1. Initialize Project

2. Check Job Status

3. Generate Jobs from Template

4. Submit Jobs

5. Monitor and Resubmit

Testing and Showcase Workflows

A) Local Demo (No SLURM Required)

B) Real Cluster Workflow

C) Feature Checklist

Commands

Configuration

Environment Variables

Documentation

Project Structure

Features

Job Collections

Notifications

Parameter Sweeps

Dynamic SLURM Arguments

Cross-Cluster Sync

Development

Setup

Running Tests

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance