Slurm job workflow management

Project description

srunx

A modern Python library for SLURM workload manager integration with workflow orchestration capabilities.

Features

🧩 Workflow Orchestration: YAML-based workflow definitions with Prefect integration
⚡ Fine-Grained Parallel Execution: Jobs execute immediately when their specific dependencies complete, not entire workflow phases
🔗 Branched Dependency Control: Independent branches in dependency graphs run simultaneously without false dependencies
📝 Template System: Customizable Jinja2 templates for SLURM scripts
🛡️ Type Safe: Full type hints and mypy compatibility
🖥️ CLI Tools: Command-line interfaces for both job management and workflows
🚀 Simple Job Submission: Easy-to-use API for submitting SLURM jobs
⚙️ Flexible Configuration: Support for various environments (conda, venv, sqsh)
📋 Job Management: Submit, monitor, cancel, and list jobs

Installation

Using uv (Recommended)

uv add srunx

Using pip

pip install srunx

Development Installation

git clone https://github.com/ksterx/srunx.git
cd srunx
uv sync --dev

Quick Start

You can try the workflow example:

cd examples
srunx flow run sample_workflow.yaml

graph TD
    A["Job A"]
    B1["Job B1"]
    B2["Job B2"]
    C["Job C"]
    D["Job D"]

    A --> B1
    A --> C
    B1 --> B2
    B2 --> D
    C --> D

Jobs run precisely when they're ready, minimizing wasted compute hours. The workflow engine provides fine-grained dependency control: when Job A completes, B1 and C start immediately in parallel. As soon as B1 finishes, B2 starts regardless of C's status. Job D waits only for both B2 and C to complete, enabling maximum parallelization.

Workflow Orchestration

Create a workflow YAML file:

# workflow.yaml
name: ml_pipeline
jobs:
  - name: preprocess
    command: ["python", "preprocess.py"]
    nodes: 1
    memory_per_node: "16GB"

  - name: train
    command: ["python", "train.py"]
    depends_on: [preprocess]
    nodes: 1
    gpus_per_node: 2
    memory_per_node: "32GB"
    time_limit: "8:00:00"
    conda: ml_env

  - name: evaluate
    command: ["python", "evaluate.py"]
    depends_on: [train]
    nodes: 1

  - name: notify
    command: ["python", "notify.py"]
    depends_on: [train, evaluate]

Execute the workflow:

# Run workflow
srunx flow run workflow.yaml

# Validate workflow without execution
srunx flow validate workflow.yaml

# Show execution plan
srunx flow run workflow.yaml --dry-run

Template Variables with Args

You can define reusable variables in the args section and use them throughout your workflow with Jinja2 templates:

# workflow.yaml
name: ml_experiment
args:
  experiment_name: "bert-fine-tuning-v2"
  dataset_path: "/data/nlp/imdb"
  model_checkpoint: "bert-base-uncased"
  output_dir: "/outputs/{{ experiment_name }}"
  batch_size: 32

jobs:
  - name: preprocess
    command: 
      - "python"
      - "preprocess.py"
      - "--dataset"
      - "{{ dataset_path }}"
      - "--output"
      - "{{ output_dir }}/preprocessed"
    resources:
      nodes: 1
      memory_per_node: "16GB"
    work_dir: "{{ output_dir }}"

  - name: train
    command:
      - "python"
      - "train.py"
      - "--model"
      - "{{ model_checkpoint }}"
      - "--data"
      - "{{ output_dir }}/preprocessed"
      - "--batch-size"
      - "{{ batch_size }}"
      - "--output"
      - "{{ output_dir }}/model"
    depends_on: [preprocess]
    resources:
      nodes: 2
      gpus_per_node: 1
    work_dir: "{{ output_dir }}"
    environment:
      conda: ml_env

  - name: evaluate
    command:
      - "python"
      - "evaluate.py"
      - "--model"
      - "{{ output_dir }}/model"
      - "--dataset"
      - "{{ dataset_path }}"
      - "--output"
      - "{{ output_dir }}/results"
    depends_on: [train]
    work_dir: "{{ output_dir }}"

Template variables can be used in:

command arguments
work_dir paths
Any string field in the job configuration

This approach provides:

Reusability: Define once, use everywhere
Maintainability: Easy to update experiment parameters
Consistency: Avoid typos and ensure consistent naming

Advanced Usage

Custom Templates

Create a custom SLURM template:

#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --nodes={{ nodes }}
{% if gpus_per_node > 0 -%}
#SBATCH --gpus-per-node={{ gpus_per_node }}
{% endif -%}
#SBATCH --time={{ time_limit }}
#SBATCH --output={{ log_dir }}/%x_%j.out

{{ environment_setup }}

srun {{ command }}

Use it with your job:

job = client.run(job, template_path="custom_template.slurm.jinja")

Environment Configuration

Conda Environment

environment = JobEnvironment(
    conda="my_env",
    env_vars={"CUDA_VISIBLE_DEVICES": "0,1"}
)

Programmatic Workflow Execution

from srunx.workflows import WorkflowRunner

runner = WorkflowRunner.from_yaml("workflow.yaml")
results = runner.run()

print("Job IDs:")
for task_name, job_id in results.items():
    print(f"  {task_name}: {job_id}")

Job Submission

# Submit job without waiting
job = client.submit(job)

# Later, wait for completion
completed_job = client.monitor(job, poll_interval=30)
print(f"Job completed with status: {completed_job.status}")

# Subit and wait for completion
completed_job = client.run(job)
print(f"Job completed with status: {completed_job.status}")

Slack Integration

from srunx.callbacks import SlackCallback

slack_callback = SlackCallback(webhook_url="your_webhook_url")
runner = WorkflowRunner.from_yaml("workflow.yaml", callbacks=[slack_callback])

or you can use the CLI:

srunx flow run workflow.yaml --slack

API Reference

Core Classes

`Job`

Main job configuration class with resources and environment settings.

`JobResource`

Resource allocation specification (nodes, GPUs, memory, time).

`JobEnvironment`

Environment setup (conda, venv, sqsh, environment variables).

`Slurm`

Main interface for SLURM operations (submit, status, cancel, list).

`WorkflowRunner`

Workflow execution engine with YAML support.

CLI Commands

Main CLI (`srunx`)

submit - Submit SLURM jobs
status - Check job status
queue - List jobs
cancel - Cancel jobs

Workflow CLI (`srunx flow`)

Execute YAML-defined workflows
Validate workflow files
Show execution plans

Configuration

Environment Variables

SLURM_LOG_DIR: Default directory for SLURM logs (default: logs)

Template Locations

srunx includes built-in templates:

base.slurm.jinja: Basic job template
advanced.slurm.jinja: Full-featured template with all options

Development

Setup Development Environment

git clone https://github.com/ksterx/srunx.git
cd srunx
uv sync --dev

Run Tests

uv run pytest

Type Checking

uv run mypy .

Code Formatting

uv run ruff check .
uv run ruff format .

Examples

Parameterized Workflow with Args

Here's a complete example showing how to use args for a parameterized machine learning workflow:

# ml_experiment.yaml
name: bert_fine_tuning
args:
  experiment_id: "exp_20240816_001"
  model_name: "bert-base-uncased"
  dataset_path: "/data/glue/cola"
  learning_rate: 2e-5
  num_epochs: 3
  batch_size: 16
  max_seq_length: 128
  output_base: "/outputs/{{ experiment_id }}"

jobs:
  - name: setup_experiment
    command:
      - "mkdir"
      - "-p"
      - "{{ output_base }}"
      - "{{ output_base }}/logs"
      - "{{ output_base }}/checkpoints"
    resources:
      nodes: 1

  - name: preprocess_data
    command:
      - "python"
      - "preprocess.py"
      - "--dataset_path"
      - "{{ dataset_path }}"
      - "--model_name"
      - "{{ model_name }}"
      - "--max_seq_length"
      - "{{ max_seq_length }}"
      - "--output_dir"
      - "{{ output_base }}/preprocessed"
    depends_on: [setup_experiment]
    resources:
      nodes: 1
      memory_per_node: "32GB"
    work_dir: "{{ output_base }}"
    environment:
      conda: nlp_env

  - name: train_model
    command:
      - "python"
      - "train.py"
      - "--model_name"
      - "{{ model_name }}"
      - "--train_data"
      - "{{ output_base }}/preprocessed/train.json"
      - "--eval_data"
      - "{{ output_base }}/preprocessed/eval.json"
      - "--learning_rate"
      - "{{ learning_rate }}"
      - "--num_epochs"
      - "{{ num_epochs }}"
      - "--batch_size"
      - "{{ batch_size }}"
      - "--output_dir"
      - "{{ output_base }}/checkpoints"
    depends_on: [preprocess_data]
    resources:
      nodes: 1
      gpus_per_node: 1
      memory_per_node: "64GB"
      time_limit: "4:00:00"
    work_dir: "{{ output_base }}"
    environment:
      conda: nlp_env

  - name: evaluate_model
    command:
      - "python"
      - "evaluate.py"
      - "--model_path"
      - "{{ output_base }}/checkpoints"
      - "--test_data"
      - "{{ dataset_path }}/test.json"
      - "--output_file"
      - "{{ output_base }}/evaluation_results.json"
    depends_on: [train_model]
    resources:
      nodes: 1
      gpus_per_node: 1
    work_dir: "{{ output_base }}"
    environment:
      conda: nlp_env

  - name: generate_report
    command:
      - "python"
      - "generate_report.py"
      - "--experiment_id"
      - "{{ experiment_id }}"
      - "--results_file"
      - "{{ output_base }}/evaluation_results.json"
      - "--output_dir"
      - "{{ output_base }}/reports"
    depends_on: [evaluate_model]
    work_dir: "{{ output_base }}"

Run the workflow:

srunx flow run ml_experiment.yaml

This approach provides several benefits:

Easy experimentation: Change parameters in one place
Reproducible results: All parameters are documented in the YAML
Consistent paths: Template variables ensure path consistency
Environment isolation: Each experiment gets its own directory

Machine Learning Pipeline

# Complete ML pipeline example
from srunx import Job, JobResource, JobEnvironment, Slurm

def create_ml_job(script: str, **kwargs) -> Job:
    return Job(
        name=f"ml_{script.replace('.py', '')}",
        command=["python", script] + [f"--{k}={v}" for k, v in kwargs.items()],
        resources=JobResource(
            nodes=1,
            gpus_per_node=1,
            memory_per_node="32GB",
            time_limit="4:00:00"
        ),
        environment=JobEnvironment(conda="pytorch")
    )

client = Slurm()

# Submit preprocessing job
prep_job = create_ml_job("preprocess.py", data_path="/data", output_path="/processed")
prep_job = client.run(prep_job)

# Wait for preprocessing to complete
client.monitor(prep_job)

# Submit training job
train_job = create_ml_job("train.py", data_path="/processed", model_path="/models")
train_job = client.run(train_job)

print(f"Training job {train_job.job_id} submitted")

Distributed Computing

# Multi-node distributed job
distributed_job = Job(
    name="distributed_training",
    command=[
        "mpirun", "-np", "16",
        "python", "distributed_train.py"
    ],
    resources=JobResource(
        nodes=4,
        ntasks_per_node=4,
        cpus_per_task=8,
        gpus_per_node=2,
        memory_per_node="128GB",
        time_limit="12:00:00"
    ),
    environment=JobEnvironment(
        conda="distributed_ml"
    )
)

job = client.run(distributed_job)

Development Workflow

Fork the repository
Create a feature branch
Make your changes
Add tests
Run type checking and tests
Submit a pull request

License

This project is licensed under the Apache-2.0 License.

Support

🐞 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Acknowledgments

Built with Pydantic for data validation
Template rendering with Jinja2
Package management with uv

Project details

Release history Release notifications | RSS feed

2.2.1

May 11, 2026

2.2.0

May 8, 2026

2.1.3

Apr 24, 2026

2.1.2

Apr 24, 2026

2.1.1

Apr 24, 2026

2.1.0

Apr 24, 2026

2.0.0

Apr 23, 2026

1.2.1

Apr 22, 2026

1.2.0

Apr 22, 2026

1.1.1

Apr 20, 2026

1.1.0

Apr 20, 2026

1.0.1

Apr 15, 2026

1.0.0

Apr 9, 2026

0.17.1

Apr 8, 2026

0.17.0

Apr 8, 2026

0.15.0

Apr 5, 2026

0.13.0

Mar 31, 2026

0.12.1

Mar 31, 2026

0.8.0

Dec 23, 2025

0.6.1

Oct 4, 2025

0.6.0

Oct 2, 2025

0.5.1

Sep 22, 2025

0.5.0

Sep 22, 2025

0.4.1

Sep 13, 2025

0.3.0

Sep 7, 2025

This version

0.2.8

Sep 4, 2025

0.2.6

Aug 21, 2025

0.2.4

Aug 13, 2025

0.2.3

Jun 24, 2025

0.2.1

Jun 22, 2025

0.2.0

Jun 22, 2025

0.1.1

Jun 18, 2025

0.1.0

Jun 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srunx-0.2.8.tar.gz (340.9 kB view details)

Uploaded Sep 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

srunx-0.2.8-py3-none-any.whl (34.1 kB view details)

Uploaded Sep 4, 2025 Python 3

File details

Details for the file srunx-0.2.8.tar.gz.

File metadata

Download URL: srunx-0.2.8.tar.gz
Upload date: Sep 4, 2025
Size: 340.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for srunx-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`62ae14d52b42d390f234655fee365ba3fd3fc8e78d7b98cac3e6bee65d9567ac`
MD5	`c93ef83421ab1d4c2417410caffd730c`
BLAKE2b-256	`45224318a16766c24e0d6663f3f60e0c7dc44f03ed7be8f1d43d90b8f4c5eae0`

See more details on using hashes here.

File details

Details for the file srunx-0.2.8-py3-none-any.whl.

File metadata

Download URL: srunx-0.2.8-py3-none-any.whl
Upload date: Sep 4, 2025
Size: 34.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for srunx-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2f94382e7fbae6c0c048c1a52e583ede1f64cb1864c129871d1e7ebb583275c`
MD5	`eaa452779aaec59f39cc7f53cac7cd19`
BLAKE2b-256	`b058b20dfe126c24df21d78b51462b60fa6068cd74aacc0648ff7cab305bad23`

See more details on using hashes here.

srunx 0.2.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

srunx

Features

Installation

Using uv (Recommended)

Using pip

Development Installation

Quick Start

Workflow Orchestration

Template Variables with Args

Advanced Usage

Custom Templates

Environment Configuration

Conda Environment

Programmatic Workflow Execution

Job Submission

Slack Integration

API Reference

Core Classes

Job

JobResource

JobEnvironment

Slurm

WorkflowRunner

CLI Commands

Main CLI (srunx)

Workflow CLI (srunx flow)

Configuration

Environment Variables

Template Locations

Development

Setup Development Environment

Run Tests

Type Checking

Code Formatting

Examples

Parameterized Workflow with Args

Machine Learning Pipeline

Distributed Computing

Development Workflow

License

Support

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Job`

`JobResource`

`JobEnvironment`

`Slurm`

`WorkflowRunner`

Main CLI (`srunx`)

Workflow CLI (`srunx flow`)