Skip to main content

CLI tools for managing and generating SLURM jobs

Project description

slurmkit header

Unit Tests Docs Publish PyPI version MIT License

InstallQuick StartFeaturesDocsDeepWiki


A CLI toolkit for managing and generating SLURM jobs.

slurmkit provides tools for:

  • Auto-discovering and tracking SLURM job status
  • Generating job scripts from templates with parameter sweeps
  • Organizing jobs into trackable collections
  • Cross-cluster job synchronization
  • Cleaning up failed jobs and W&B runs

Installation

Install via pip

pip install slurmkit

Install Latest from GitHub

pip install git+https://github.com/Awni00/slurmkit.git
# include all optional extras (ui + dev + docs)
pip install "slurmkit[all] @ git+https://github.com/Awni00/slurmkit.git"

Clone and Install (Recommended for Development)

git clone https://github.com/Awni00/slurmkit.git
cd slurmkit
pip install -e ".[all]"

Dependencies

Required:

  • Python 3.8+
  • PyYAML
  • Jinja2
  • pandas
  • tabulate
  • requests

Optional:

  • wandb (for W&B cleanup features)
  • rich (enhanced CLI UI; install with pip install "slurmkit[ui]")
  • all extra for optional groups (ui, dev, docs)

Quick Start

1. Initialize Project

cd your-project
slurmkit init

This creates .slurmkit/config.yaml with your settings.

2. Check Job Status

slurmkit status my_experiment

3. Generate Jobs from Template

Create a template templates/train.job.j2:

#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --partition={{ slurm.partition }}
#SBATCH --time={{ slurm.time }}
#SBATCH --output={{ logs_dir }}/{{ job_name }}.%j.out

python train.py --lr {{ learning_rate }} --bs {{ batch_size }}

Create a job spec experiments/exp1/job_spec.yaml:

name: exp1
template: ../../templates/train.job.j2
job_subdir: exp1

parameters:
  mode: grid
  values:
    learning_rate: [0.001, 0.01, 0.1]
    batch_size: [32, 64]
    n_trials: [3]
  # Optional: derive effective params before filtering/rendering
  parse: params_logic.py:parse_params
  # Optional: exclude incompatible combinations
  filter: params_logic.py:include_params

slurm_args:
  defaults:
    partition: gpu
    time: "24:00:00"

job_name_pattern: "lr{{ learning_rate }}_bs{{ batch_size }}"

With this layout, slurmkit writes scripts to .jobs/exp1/job_scripts/ and expects logs in .jobs/exp1/logs/.

Generate jobs:

slurmkit generate experiments/exp1/job_spec.yaml --into exp1

4. Submit Jobs

# Preview before actual submission
slurmkit submit exp1 --dry-run

# Submit to SLURM
slurmkit submit exp1

5. Monitor and Resubmit

# Preview active jobs that would be cancelled
slurmkit collections cancel exp1 --dry-run

# View collection status
slurmkit status exp1
slurmkit collections show exp1

# Rich UI (if installed)
slurmkit --ui rich collections analyze exp1

# Resubmit jobs by state filter
slurmkit resubmit exp1 --filter failed

# Resubmit only preempted jobs
slurmkit resubmit exp1 --filter preempted

# Resubmit a single tracked job by SLURM job ID (collection inferred)
slurmkit resubmit --job-id 123456 -y

# Override the autogenerated submission group label
slurmkit resubmit exp1 --filter failed --submission-group retry_after_fix

Testing and Showcase Workflows

A) Local Demo (No SLURM Required)

Use the bundled demo project for a deterministic feature showcase:

cd examples/demo_project
python -m venv .venv
source .venv/bin/activate
pip install -e ../..
./setup_dummy_jobs.py --include-non-terminal

Then run:

slurmkit collections list
slurmkit status fixtures/mixed_30
slurmkit collections show fixtures/mixed_30
slurmkit collections analyze fixtures/mixed_30
# Optional richer formatting (requires rich extra):
slurmkit --ui rich collections analyze fixtures/mixed_30
slurmkit notify test --dry-run
slurmkit notify collection-final --collection notifications/terminal_failed --job-id 991002 --no-refresh --dry-run

B) Real Cluster Workflow

slurmkit generate experiments/exp1/job_spec.yaml --into exp1
slurmkit submit exp1 --dry-run
slurmkit submit exp1
slurmkit status exp1
slurmkit collections refresh exp1
slurmkit collections cancel exp1 --dry-run
slurmkit collections show exp1
slurmkit collections analyze exp1
slurmkit resubmit exp1 --filter failed --dry-run

C) Feature Checklist

Goal Command Success signal
Initialize config slurmkit init .slurmkit/config.yaml created
Generate scripts slurmkit generate ... --into exp1 Job scripts written and collection updated
Preview submission slurmkit submit exp1 --dry-run Candidate jobs listed with no submit
Inspect collection slurmkit collections show exp1 Summary + jobs table rendered
Analyze outcomes slurmkit collections analyze exp1 Parameter tables and risky/stable sections shown
Validate notifications slurmkit notify test --dry-run Route resolution and payload preview

Commands

Command Description
slurmkit init Initialize project configuration
slurmkit install-skill Install the slurmkit Codex skill via npx skills
slurmkit migrate Upgrade local config and collections to the current schema
slurmkit status <collection> Show live status for a collection
slurmkit generate <spec> Generate job scripts from a spec into a collection
slurmkit submit <collection> Submit a collection
slurmkit resubmit [collection] [--job-id <id>] Resubmit jobs by explicit state filter in a collection or one tracked job
slurmkit notify Send job lifecycle notifications
slurmkit collections List, inspect, analyze, refresh, cancel, and delete collections
slurmkit clean outputs Clean failed job outputs
slurmkit clean wandb Clean failed W&B runs
slurmkit sync Sync job states for cross-cluster

Run slurmkit <command> --help for detailed usage.

Install the skill quickly:

slurmkit install-skill --yes

Configuration

Configuration is stored in .slurmkit/config.yaml:

jobs_dir: .jobs/

output_patterns:
  - "{job_name}.{job_id}.out"
  - "{job_name}.{job_id}.*.out"
  - "slurm-{job_id}.out"

slurm_defaults:
  partition: gpu
  time: "24:00:00"
  mem: "32G"

ui:
  mode: plain  # plain | rich | auto
  columns:
    collections_show:
      - job_name
      - job_id
      - state
      - runtime
      - attempt
      - submission_group
      - resubmissions
      - output_path
  collections_show:
    pager: less  # less | none

notifications:
  defaults:
    events: [job_failed]
    timeout_seconds: 5
    max_attempts: 3
    backoff_seconds: 0.5
    output_tail_lines: 40
  job:
    ai:
      enabled: false
      callback: null
  collection_final:
    attempt_mode: latest
    min_support: 3
    top_k: 10
    include_failed_output_tail_lines: 20
    ai:
      enabled: false
      callback: null
  routes:
    - name: team_slack
      type: slack
      url: "${SLACK_WEBHOOK_URL}"
      events: [job_failed, collection_failed]
    - name: team_email
      type: email
      to: ["ops@example.com", "ml@example.com"]
      from: "${SLURMKIT_EMAIL_FROM}"
      smtp_host: "${SMTP_HOST}"
      smtp_port: 587
      smtp_username: "${SMTP_USER}"
      smtp_password: "${SMTP_PASSWORD}"
      smtp_starttls: true
      smtp_ssl: false
      events: [job_failed, collection_failed]

Environment Variables

Variable Description
SLURMKIT_CONFIG Path to config file
SLURMKIT_JOBS_DIR Jobs directory
SLURMKIT_WANDB_ENTITY W&B entity
SLURMKIT_DRY_RUN Enable dry-run mode

Documentation

Full documentation is available at https://awni00.github.io/slurmkit/

Project Structure

your-project/
├── .slurmkit/
│   ├── config.yaml          # Project configuration
│   ├── collections/         # Collection YAML files
│   ├── sync/                # Cross-cluster sync files
│   └── backups/             # Migration backups (created on demand)
├── .jobs/
│   └── experiment1/
│       ├── job_scripts/      # Generated job scripts
│       └── logs/             # Job output files
└── templates/                # Jinja2 job templates

Features

Key features at a glance:

1) Job Creation

  • Generate parameterized job scripts and attach them to a collection: slurmkit generate job_spec.yaml --into exp1
  • Preview generation and submission safely: slurmkit generate ... --dry-run, slurmkit submit ... --dry-run
  • Submit only unsubmitted collection jobs (default): slurmkit submit exp1 --filter unsubmitted

2) Collection Tracking and Analysis

  • Inspect, analyze, and refresh collections: slurmkit status exp1, slurmkit collections show exp1, slurmkit collections refresh exp1
  • Cancel active jobs across tracked attempts: slurmkit collections cancel exp1 --dry-run
  • Analyze outcomes by parameter values: slurmkit collections analyze exp1 --top-k 10
  • Resubmit filtered jobs with deterministic regeneration by default, including optional selection and parameter callbacks (e.g., checkpoint dir): slurmkit resubmit exp1 --filter failed --select-file callbacks.py --extra-params-file extra.py

3) Notifications and Cross-Cluster Sync

  • Validate routes and send job notifications: slurmkit notify test, slurmkit notify job ...
  • Send one final collection-level summary when a collection reaches terminal state: slurmkit notify collection-final ...
  • Sync collection/job state across clusters via git-backed files: slurmkit sync --push

Job Collections

Track related jobs together:

# List collections
slurmkit collections list

# Show details
slurmkit status my_exp
slurmkit collections show my_exp --state failed

# Update states from SLURM
slurmkit collections refresh my_exp

# Preview which active jobs would be cancelled
slurmkit collections cancel my_exp --dry-run

Notifications

Send job lifecycle notifications to Slack, Discord, email, or generic webhooks:

# Validate route setup
slurmkit notify test
slurmkit notify test --route team_email --dry-run

# Typical end-of-job call from script (default: notify only on failure)
slurmkit notify job --job-id "$SLURM_JOB_ID" --exit-code "$rc"

# Collection-final summary notification (emits only when collection is terminal)
slurmkit notify collection-final --job-id "$SLURM_JOB_ID" --trigger-exit-code "$rc"

Collection-specific overrides are supported via a top-level notifications block in job_spec.yaml:

  • If a collection is linked to a spec with notifications, those values override global .slurmkit/config.yaml notifications.
  • If no spec-level block exists (or spec loading fails), slurmkit falls back to global config.
  • Dicts deep-merge; lists replace (including notifications.routes).

See docs/notifications.md and examples/demo_project/README.md for full examples.

Recommended trap snippet inside a job script:

rc=$?
slurmkit notify job --job-id "${SLURM_JOB_ID}" --exit-code "${rc}"
slurmkit notify collection-final --job-id "${SLURM_JOB_ID}" --trigger-exit-code "${rc}"
exit "${rc}"

Parameter Sweeps

Generate jobs from parameter grids:

parameters:
  mode: grid
  values:
    learning_rate: [0.001, 0.01, 0.1]
    batch_size: [32, 64, 128]
    model: [resnet18, resnet50]

Or explicit lists:

parameters:
  mode: list
  values:
    - {lr: 0.001, bs: 32}
    - {lr: 0.01, bs: 64}

Dynamic SLURM Arguments

Use Python functions for complex resource logic:

# slurm_logic.py
def get_slurm_args(params, defaults):
    args = defaults.copy()
    if params.get('model') == 'resnet50':
        args['mem'] = '64G'
        args['gpus'] = 2
    return args

Cross-Cluster Sync

Share job status across clusters via git:

# On cluster A
slurmkit sync --push

# On cluster B
git pull
slurmkit collections show my_exp

Development

Setup

We recommend using uv to manage the development environment.

# Clone the repository
git clone https://github.com/Awni00/slurmkit.git
cd slurmkit

# Create a virtual environment and install dependencies in editable mode
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Running Tests

pytest

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmkit-0.1.5.tar.gz (129.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurmkit-0.1.5-py3-none-any.whl (118.5 kB view details)

Uploaded Python 3

File details

Details for the file slurmkit-0.1.5.tar.gz.

File metadata

  • Download URL: slurmkit-0.1.5.tar.gz
  • Upload date:
  • Size: 129.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurmkit-0.1.5.tar.gz
Algorithm Hash digest
SHA256 265252dea3d7965cdcf9b869b56292bebd114bafb60405bae56184990490ebca
MD5 6574469dff23f23cb82382771d12a787
BLAKE2b-256 2a6c261564e902abc0ce783856d6874a2bc757dc1c0ddc78a64357c3abf2cc9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmkit-0.1.5.tar.gz:

Publisher: publish.yml on Awni00/slurmkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurmkit-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: slurmkit-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 118.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurmkit-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d64992d866b2403ae7f085e81b450155598ad627d7d5d1d2757404204b401ee1
MD5 d2ccff1d4b7cc45071cb2681808e8a40
BLAKE2b-256 6bbfbbea24636fe509439b8744605b9cc970e5645209a564f8f8cfa751e12c94

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmkit-0.1.5-py3-none-any.whl:

Publisher: publish.yml on Awni00/slurmkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page