Skip to main content

Slurm job workflow management

Project description

srunx

A unified CLI, web dashboard, and Python API for SLURM job management.

Stop juggling sbatch scripts, squeue loops, and SSH sessions.

PyPI Downloads Python 3.12+ License CI Docs

srunx web dashboard
  • Submit & manage SLURM jobs from CLI, browser, or Python
  • Orchestrate multi-step workflows with YAML and dependency graphs
  • Monitor GPU availability and job states with Slack notifications
  • SSH remote — submit jobs, sync files, and browse remote clusters from your laptop
  • Container-native — Pyxis, Apptainer, and Singularity support built in

Installation

Requires Python 3.12+ and access to a SLURM cluster (local or via SSH).

uv add srunx

For the web dashboard:

uv add "srunx[web]"

Quick Start

# Submit a job
srunx submit python train.py --name training --gpus-per-node 2 --conda ml_env

# Check status and resources
srunx list --show-gpus
srunx resources

# Run a YAML workflow
srunx flow run workflow.yaml

CLI

Command Description
srunx submit Submit a SLURM job
srunx status Check job status
srunx list List jobs in queue
srunx cancel Cancel a job
srunx logs View / stream job logs
srunx resources Display GPU availability
srunx monitor Monitor jobs, resources, or cluster
srunx flow Run / validate YAML workflows
srunx ssh Remote SLURM operations over SSH
srunx history Show job execution history
srunx report Generate job execution report
srunx config Manage configuration
srunx template Manage job templates
srunx ui Launch the web dashboard

Web Dashboard

A dashboard for visual cluster management. Connect to your SLURM cluster over SSH and manage jobs, workflows, and resources from a browser.

srunx ui                # -> http://127.0.0.1:8000
srunx ui --port 3000    # custom port

Jobs — Browse, search, filter, and cancel jobs.

Jobs page

Workflow DAG — Visualize job dependencies. Run workflows directly from the UI.

Workflow DAG visualization

Resources — GPU and node availability per partition.

Resources page

Explorer — Browse remote files via SSH mounts. Shell scripts can be submitted as sbatch jobs directly from the file tree.

Explorer sbatch submission

Workflow Orchestration

Define pipelines in YAML with dependency graphs and Jinja2-parameterized variables:

name: experiment
args:
  model: "bert-base-uncased"
  output_dir: "/outputs/{{ model }}"

jobs:
  - name: preprocess
    command: ["python", "preprocess.py"]
    nodes: 1

  - name: train
    command: ["python", "train.py", "--model", "{{ model }}"]
    depends_on: [preprocess]
    gpus_per_node: 2
    conda: ml_env

  - name: evaluate
    command: ["python", "eval.py", "--output", "{{ output_dir }}"]
    depends_on: [train]

Jobs run as soon as their dependencies complete — independent branches execute in parallel automatically.

  • args with Jinja2 templates for reusable, parameterized pipelines
  • Retry support with configurable delay
  • Dry-run mode and partial execution (--from, --to, --job)

Monitoring

# Monitor a job until completion
srunx monitor jobs 12345

# Wait for GPUs, then submit
srunx monitor resources --min-gpus 4
srunx submit python train.py --gpus-per-node 4

# Periodic cluster reports to Slack
srunx monitor cluster --schedule 1h --notify $SLACK_WEBHOOK

Remote SSH

Keep your local editor workflow while running on the cluster:

# Submit to remote cluster
srunx ssh submit train.py --host dgx-server

# Manage connection profiles
srunx ssh profile add myserver --ssh-host dgx1

# Map local directories to remote and sync with rsync
srunx ssh profile mount add myserver workspace \
  --local ~/projects/ml-exp --remote /home/user/ml-exp
srunx ssh sync
  • SSH config hosts, saved profiles, and proxy jump support
  • Environment variable passthrough (--env KEY=VALUE, --env-local WANDB_API_KEY)
  • File sync via rsync — auto-detects profile from current directory

Slack Notifications

Slack notification
srunx flow run workflow.yaml --slack

Python API

from srunx import Job, JobResource, JobEnvironment, Slurm

job = Job(
    name="training",
    command=["python", "train.py"],
    resources=JobResource(nodes=1, gpus_per_node=2, time_limit="4:00:00"),
    environment=JobEnvironment(conda="ml_env"),
)

client = Slurm()
completed = client.run(job)  # submit and wait for completion

Why srunx?

Tools like submitit and simple-slurm handle job submission, and workflow engines like Snakemake or Nextflow handle pipelines. srunx covers both — plus monitoring, SSH remote access, a web dashboard, and container support — in a single, lightweight package. If you want one tool that covers the full SLURM workflow without heavyweight infrastructure, srunx is a good fit.

Documentation

Full documentation at ksterx.github.io/srunx.

Development

git clone https://github.com/ksterx/srunx.git
cd srunx
uv sync --dev
uv run pytest

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srunx-0.17.0.tar.gz (602.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srunx-0.17.0-py3-none-any.whl (537.0 kB view details)

Uploaded Python 3

File details

Details for the file srunx-0.17.0.tar.gz.

File metadata

  • Download URL: srunx-0.17.0.tar.gz
  • Upload date:
  • Size: 602.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for srunx-0.17.0.tar.gz
Algorithm Hash digest
SHA256 1736e8df541a0c1897dc95b206b28a07b70984d5dfbcaa2d7a112035d5a83af4
MD5 3a4926f12a88e95364d946912bd64639
BLAKE2b-256 40858d09c640cfb6605d27bfb22a9f14cd4fb8cdd76d4285a2b8ff9d16a71944

See more details on using hashes here.

File details

Details for the file srunx-0.17.0-py3-none-any.whl.

File metadata

  • Download URL: srunx-0.17.0-py3-none-any.whl
  • Upload date:
  • Size: 537.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for srunx-0.17.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03703408e17323ff215749b2a9d90e0683ffe77ecb67fd7d1a77af06a0eabcc2
MD5 80a89001a1727f2d05438f13e7a6fac8
BLAKE2b-256 0a8ce8f3f9aaf37e872b93275378a119d19cff5a4151f8b94d54af0fd8d5ae10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page