Skip to main content

Pythonic SDK for Slurm.

Project description

Python Slurm SDK

A developer-friendly SDK to define, submit, and manage Slurm jobs in Python with native array job support, fluent dependency APIs, and container packaging.

Quick Start

Simple Task with Dependencies

from slurm import Cluster, task

@task(time="00:01:00", mem="1G")
def preprocess(dataset: str) -> str:
    return f"processed_{dataset}"

@task(time="00:05:00", mem="4G", cpus_per_task=4)
def train_model(data: str, lr: float) -> dict:
    return {"model": f"trained_on_{data}", "lr": lr, "accuracy": 0.95}

@task(time="00:01:00", mem="1G")
def evaluate(model: dict) -> str:
    return f"Model accuracy: {model['accuracy']}"

if __name__ == "__main__":
    # Local backend for offline development
    with Cluster(backend_type="local") as cluster:
        # Fluent dependency API with .after()
        prep_job = preprocess("dataset.csv")
        train_job = train_model.after(prep_job)(data=prep_job, lr=0.001)
        eval_job = evaluate.after(train_job)(model=train_job)

        eval_job.wait()
        print(eval_job.get_result())

Workflow Orchestration

Workflows run on the cluster and orchestrate multiple tasks, enabling complex pipelines that survive preemption and scheduling delays:

from slurm import Cluster, task, workflow
from slurm.workflow import WorkflowContext

@task(time="00:05:00", mem="2G")
def train_epoch(epoch: int, data_path: str) -> dict:
    return {"epoch": epoch, "loss": 1.0 / (epoch + 1), "checkpoint": f"ckpt_{epoch}.pt"}

@task(time="00:02:00", mem="1G")
def evaluate(checkpoint: str) -> float:
    return 0.85 + (int(checkpoint.split("_")[1].split(".")[0]) * 0.02)

@workflow(time="01:00:00", mem="512M")
def training_pipeline(epochs: int, data_path: str, ctx: WorkflowContext) -> dict:
    """Multi-epoch training with evaluation after each epoch."""
    results = []

    for epoch in range(epochs):
        # Submit training job
        train_job = train_epoch(epoch=epoch, data_path=data_path)
        train_result = train_job.get_result()

        # Submit eval after training completes
        eval_job = evaluate.after(train_job)(checkpoint=train_result["checkpoint"])
        accuracy = eval_job.get_result()

        results.append({"epoch": epoch, "loss": train_result["loss"], "accuracy": accuracy})

    return {"results": results, "best_accuracy": max(r["accuracy"] for r in results)}

if __name__ == "__main__":
    cluster = Cluster(
        backend_type="ssh",
        hostname="login.hpc.example.com",
        username="myuser",
    )

    with cluster:
        # The workflow itself runs as a job on the cluster
        job = training_pipeline(epochs=3, data_path="/data/train")
        job.wait()
        print(job.get_result())

Array Jobs for Parallel Processing

from slurm import Cluster, task

@task(time="00:02:00", mem="2G", cpus_per_task=2)
def process_chunk(chunk_id: int, start: int, end: int) -> dict:
    return {"chunk": chunk_id, "sum": sum(range(start, end))}

@task(time="00:01:00", mem="1G")
def aggregate(results: list) -> int:
    return sum(r["sum"] for r in results)

if __name__ == "__main__":
    with Cluster(backend_type="local") as cluster:
        # Map task over items to create native SLURM array job
        chunks = [
            {"chunk_id": i, "start": i * 1000, "end": (i + 1) * 1000}
            for i in range(10)
        ]
        array_job = process_chunk.map(chunks)

        # Aggregate results from array job
        final = aggregate.after(array_job)(results=array_job.get_results())
        final.wait()
        print(f"Total: {final.get_result()}")

SSH Backend with Container Packaging

from slurm import Cluster, task

@task(
    time="00:10:00",
    mem="8G",
    packaging="container",
    packaging_platform="linux/amd64",
    packaging_push=True,
    packaging_registry="myregistry.io/myproject/",
)
def compute_intensive_task(n: int) -> float:
    import numpy as np
    return np.mean(np.random.random(n))

if __name__ == "__main__":
    cluster = Cluster(
        backend_type="ssh",
        hostname="login.hpc.example.com",
        username="myuser",
    )

    with cluster:
        job = compute_intensive_task(1_000_000)
        job.wait()
        print(job.get_result())

Documentation

  • Contributing Guide - Development setup, testing, and contribution guidelines
  • AGENTS.md - Guide for AI agents working with this codebase
  • Changelog - Release history and changes
  • examples/ - Complete working examples including parallelization patterns

Running on an ARM Mac (Apple Silicon)

You can build and publish x86_64 (``linux/amd64```) images from an ARM Mac by enabling QEMU emulation inside your container runtime. The example below shows the required one-time setup for Podman; Docker Desktop users can follow a similar flow using docker buildx (no additional configuration usually needed).

  1. Install QEMU locally. This provides the binary emulators that Podman will load into its virtual machine:

    brew install qemu
    
  2. Make sure the Podman machine is created and running. If you have not initialised it yet:

    podman machine init --now
    
  3. Register the QEMU static binaries inside the Podman machine so it can run amd64 containers. This step reboots the VM and only needs to be performed once (repeat if you recreate the machine):

    podman machine ssh sudo rpm-ostree install qemu-user-static
    podman machine ssh sudo systemctl reboot
    
  4. After the machine restarts, confirm cross-building works:

    podman build --platform linux/amd64 -t test-amd64 .
    podman run --rm --platform linux/amd64 test-amd64 uname -m
    

    The final command should print x86_64 even though you are on an ARM host.

With QEMU registered you can now build the example container in this project using the provided Slurmfile (platform = "linux/amd64") and push it to your registry before submitting jobs to an x86 cluster.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurm_sdk-0.4.5.tar.gz (452.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurm_sdk-0.4.5-py3-none-any.whl (295.7 kB view details)

Uploaded Python 3

File details

Details for the file slurm_sdk-0.4.5.tar.gz.

File metadata

  • Download URL: slurm_sdk-0.4.5.tar.gz
  • Upload date:
  • Size: 452.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurm_sdk-0.4.5.tar.gz
Algorithm Hash digest
SHA256 fef4e2edc312b4435d8298e1346da655f4c36618fed27e8bfb17c33e37f6e2da
MD5 5b74a294b7f1c9725fd7dea3de782f3a
BLAKE2b-256 63e11b05c082991c68cdd04ba679f11c4f6e3db85a0639686852aa981a79fd68

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_sdk-0.4.5.tar.gz:

Publisher: publish.yml on ville-k/slurm_sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurm_sdk-0.4.5-py3-none-any.whl.

File metadata

  • Download URL: slurm_sdk-0.4.5-py3-none-any.whl
  • Upload date:
  • Size: 295.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slurm_sdk-0.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3bb1e44719b62a91e2faf2e059fa4266d2951eeadca7207d29d20442a17bcc20
MD5 81e3f227716bf214240151fdb7341d02
BLAKE2b-256 4066c9f3622355c66ed60621b6a019ee94200c0ad959d9ccb2cabda1d9c41eec

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurm_sdk-0.4.5-py3-none-any.whl:

Publisher: publish.yml on ville-k/slurm_sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page