Skip to main content

A tool for launching and tracking Slurm jobs across many clusters in Python.

Project description

Slurmpilot 🚀

Effortlessly launch experiments on Slurm clusters from your local machine.

License: MIT Python version


Slurmpilot is a Python library designed to simplify launching experiments on Slurm clusters directly from your local machine. It automates code synchronization, job submission, and status tracking, letting you focus on your research.

🤔 Why Slurmpilot?

While tools like SkyPilot and Submitit are excellent, Slurmpilot offers a more flexible, multi-cluster experience tailored for academic research environments where Docker might not be available. We focus on sending source files directly, avoiding serialization issues and providing a seamless CLI for managing your experiments.

✨ Core Features

  • 💻 Remote Job Submission: Launch Slurm jobs on any cluster with SSH access from your local machine.
  • 🔁 Simplified Workflow: Automatically handles code synchronization, log management, and job status tracking.
  • 🌐 Multi-Cluster Support: Easily switch between different Slurm clusters.
  • 📝 Reproducibility: Keep track of your experiments with automatically generated metadata.
  • ⌨️ Command-Line Interface (CLI): Manage jobs, view logs, and check status with simple commands.

🚀 Getting Started

1. Installation

Clone and install in editable mode:

git clone https://github.com/geoalgo/slurmpilot.git
cd slurmpilot
pip install -e .

2. Configure Your First Cluster

In case you want to schedule job from your machine, you need to first configure ssh by creating a cluster config file at ~/slurmpilot/config/clusters/YOUR_CLUSTER.yaml:

host: your-cluster-hostname
user: your-username          # optional, defaults to current user
remote_path: ~/slurmpilot    # optional, where files are stored on the cluster
default_partition: gpu       # optional, used when partition is not specified
account: your-account        # optional, Slurm account to charge

Optionally create ~/slurmpilot/config/general.yaml for global settings:

local_path: ~/slurmpilot        # where job files are stored locally
default_cluster: YOUR_CLUSTER   # used when cluster is not specified

Verify your SSH connection:

sp test-ssh YOUR_CLUSTER

💡 Usage Examples

Schedule a Shell Script

from slurmpilot import SlurmPilot, JobCreationInfo, unify

# Initialize SlurmPilot for your cluster
slurm = SlurmPilot(clusters=["YOURCLUSTER"])

# Define the job
job_info = JobCreationInfo(
    cluster="YOURCLUSTER",
    partition="YOURPARTITION",
    jobname=unify("hello-cluster", method="coolname"),
    entrypoint="hellocluster_script.sh",
    src_dir="./",
    n_cpus=1,
    max_runtime_minutes=60,
)

# Launch the job
job_id = slurm.schedule_job(job_info)
print(f"Job {job_id} scheduled on {job_info.cluster}")

where YOURCLUSTER should be reachable with sp test-ssh YOUR_CLUSTER.

Local mode

Alternatively, you can use cluster="local" if you are running Slurmpilot directly from a Slurm login node:

slurm = SlurmPilot(clusters=["local"])

job_info = JobCreationInfo(
    cluster="local",
    partition="gpu",
    jobname=unify("my-job", method="date"),
    entrypoint="train.py",
    python_binary="python",
    n_cpus=4,
    n_gpus=1,
)

job_id = slurm.schedule_job(job_info)

Job files are still written to ~/slurmpilot/jobs/ locally; no SSH connection is opened.

Schedule a Python Script

job_info = JobCreationInfo(
    cluster="YOURCLUSTER",
    partition="YOURPARTITION",
    jobname=unify("python-job", method="date"),
    entrypoint="main.py",
    python_binary="~/miniconda3/bin/python",
    python_args="--data /path/to/data --epochs 10",
    n_cpus=2,
    n_gpus=1,
    mem=16000,
    env={"API_TOKEN": "your-token"},
)

job_id = slurm.schedule_job(job_info)

bash_setup_command lets you run commands before the entrypoint (e.g. activate conda, start a server):

job_info = JobCreationInfo(
    ...
    bash_setup_command="source ~/miniconda3/etc/profile.d/conda.sh && conda activate myenv",
)

Job Arrays

Pass a list to python_args to submit a job array — one task per element:

job_info = JobCreationInfo(
    ...
    python_args=[
        {"lr": 0.001, "batch": 32},
        {"lr": 0.01,  "batch": 16},
    ],
    n_concurrent_jobs=4,   # optional: limit to 4 running at once
)

When calling the entrypoint, each dict is converted to CLI arguments, e.g. --lr 0.001, for the corresponding array task.

Local Python Libraries

Ship additional local packages alongside your code with python_libraries:

job_info = JobCreationInfo(
    ...
    python_libraries=["./mylib"],   # library is copied to the job path and added to PYTHONPATH
)

Mock mode

Use cluster="mock" to run jobs as plain local processes — no Slurm installation required. MockSlurm intercepts sbatch, sacct, and scancel calls, running the generated bash script as a subprocess and using its PID as the job ID.

This is the recommended mode for unit tests:

slurm = SlurmPilot(clusters=["mock"])

job_info = JobCreationInfo(
    cluster="mock",
    jobname="test/my-job",
    entrypoint="run.sh",
    src_dir="./tests/fixtures",
)

job_id = slurm.schedule_job(job_info)
slurm.wait_completion(job_info.jobname, max_seconds=30)
stdout, stderr = slurm.log(job_info.jobname)

sacct returns RUNNING while the process is alive, then COMPLETED / FAILED / CANCELLED based on its exit code. No cluster config file is needed.

⌨️ Command-Line Interface (CLI)

All job commands accept an optional job name (defaults to the most recently submitted job).

Job commands

Command Description
sp log [JOBNAME] Print stdout/stderr of a job
sp status [JOBNAME] Print current Slurm state of a job
sp metadata [JOBNAME] Print job metadata (cluster, date, …)
sp path [JOBNAME] Show local and remote paths for a job
sp slurm-script [JOBNAME] Print the generated Slurm script
sp download [JOBNAME] Download the job folder from the cluster
sp stop [JOBNAME] Cancel a running job
sp queue-status [JOBNAME] Show queue position and priority of a pending job

Cluster commands

Command Description
sp list-jobs [N] [--clusters C …] Print a table of the N most recent jobs (default 10)
sp test-ssh CLUSTER … Test SSH connection to one or more clusters
sp stop-all [--clusters C …] Cancel all tracked jobs on cluster(s)

--collapse-job-array on list-jobs shows one row per job array instead of one per task.

sp queue-status runs two squeue calls on the remote cluster and reports the job's priority score, its rank among all PENDING jobs in the same partition, and the top priority score in that partition:

job       : my-experiment (id: 17026264)
partition : small-g
priority  : 5000  (top is 9999)
position  : 3 / 42 pending jobs

Returns a "not pending" message when the job has already started running or completed.

Launch command

sp launch builds and submits a job from a YAML config file and/or inline CLI flags. CLI flags always override YAML values.

# Inline flags only
sp launch --entrypoint main.py --cluster mycluster --partition gpu --n-gpus 1

# From a YAML config (src_dir defaults to the YAML file's directory)
sp launch --config job.yaml

# YAML with a one-off override
sp launch --config job.yaml --cluster local --partition debug

# Preview the generated sbatch script without submitting
sp launch --config job.yaml --dry-run

# Submit and block until the job finishes, then print logs
sp launch --config job.yaml --wait
sp launch --config job.yaml --wait --max-wait-seconds 3600

A minimal job.yaml:

cluster: mycluster
partition: gpu
entrypoint: train.py          # relative to the YAML file's directory

python_binary: python
python_args: "--epochs 10"
n_cpus: 4
n_gpus: 1
max_runtime_minutes: 120

For a job array, set python_args to a list:

python_args:
  - lr: 0.001
    batch: 32
  - lr: 0.01
    batch: 16

jobname is auto-generated from the entrypoint stem via coolname if not provided (e.g. train-charming-swift-otter-of-justice).

Example output from sp list-jobs 5:

job                                          jobid   cluster  creation             min    status      nodelist
-------------------------------------------  ------  -------  -------------------  -----  ----------  --------
job-2026-01-01                               42      mock     2026-01-01 00:00:00  5.0    ✅ COMPLETED  node01
python-job-2026-01-01-12-00-00               43      gpu-c    2026-01-01 12:00:00  12.3   🏃 RUNNING    gpu01

⚙️ Configuration

Config directory layout

~/slurmpilot/config/
  general.yaml            # optional global settings
  clusters/
    YOUR_CLUSTER.yaml     # one file per cluster

general.yaml

local_path: ~/slurmpilot      # where job files are stored locally
default_cluster: YOUR_CLUSTER

clusters/YOUR_CLUSTER.yaml

host: hostname-or-ip
user: your-username           # optional
remote_path: ~/slurmpilot     # optional
default_partition: gpu        # optional
account: slurm-account        # optional

🙌 Contributing

Contributions are welcome! If you have ideas for improvements or find a bug, please open an issue or submit a pull request.

To set up a development environment:

git clone https://github.com/geoalgo/slurmpilot.git
cd slurmpilot
pip install -e ".[dev]"

Run tests:

pytest

Run linting:

ruff check slurmpilot tst

FAQ

How does it work?

When scheduling a job, the files required to run it are copied to ~/slurmpilot/jobs/YOUR_JOB_NAME locally, then synced to the remote cluster. The following files are generated:

  • slurm_script.sh — sbatch script generated from your JobCreationInfo
  • metadata.json — job metadata (cluster, date, config)
  • jobid.json — Slurm job ID after successful submission
  • src/ — copy of your source files
  • logs/stdout, logs/stderr — job output (populated after the job runs)

The working directory on the remote node is ~/slurmpilot/jobs/YOUR_JOB_NAME.

Why SSH and not a cluster login node?

A typical workflow involves SSHing to a login node and calling sbatch there. Slurmpilot automates this so you can manage multiple clusters without ever leaving your local machine.

Why not Docker?

Docker is great for cloud tools (SkyPilot, SageMaker…) but is often unavailable on Slurm clusters due to root-privilege requirements. Slurmpilot sends source files directly, which is simpler and more portable.

What are the dependencies?

Only pyyaml is required at runtime. No pandas, no numpy.

slurmpilot
└── pyyaml

What about other tools?

SkyPilot is excellent for cloud providers. Submitit is great for single-cluster Python-native workflows. Slurmpilot targets the multi-cluster academic use case: sending raw source files, no serialization, CLI-first management across clusters.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmpilot-0.2.0.tar.gz (30.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurmpilot-0.2.0-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file slurmpilot-0.2.0.tar.gz.

File metadata

  • Download URL: slurmpilot-0.2.0.tar.gz
  • Upload date:
  • Size: 30.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for slurmpilot-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ff221ef287e0b23f4da99ee74b2305459b7ca5e929c991dd7325d6442f9b8163
MD5 46a900ae3ea754c14aacac4e4fd5bf5b
BLAKE2b-256 04328f11672bbf36686fa6baefe3db7cbe4c63cac0f915899719b82ee39b661b

See more details on using hashes here.

File details

Details for the file slurmpilot-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: slurmpilot-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 30.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for slurmpilot-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e5a13e437ba706eead319cc4e422e2fa979c9e5f3ce994919b431a74410434fd
MD5 4f0ae3fde420ca718a531c16c68f077f
BLAKE2b-256 917987396526466fd126e62965b4416e9a4f06cf9bf483a4f6c9a8cca5036d08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page