Skip to main content

Yeet code at Slurm clusters. A Modal-like abstraction for multi-cluster Slurm job submission.

Project description

yeet — Yeet Code at Slurm Clusters

A Modal-like abstraction for submitting jobs to multiple Slurm clusters over SSH.

Problem

Managing 3+ Slurm clusters with different filesystems, GPUs, partitions, and configurations is painful. You want to write code and yeet it at a cluster without caring about sbatch scripts, SSH sessions, or rsync incantations.

Design Principles

  • Submit and forget: fire off a function or script, get results back later.
  • Resource-aware routing: say what you need (GPU type, memory), yeet picks the right cluster.
  • Explicit override: always allow forcing a specific cluster.
  • No serialization magic: send source code, not pickled objects. Avoids import/path hell.
  • uv-native: auto-sync pyproject.toml + uv.lock, run uv sync on remote before execution.
  • Volume abstraction: name cluster-local paths (datasets, checkpoints), reference by logical name.
  • Direct cluster-to-cluster sync: for large data, rsync directly between clusters when possible.

Architecture

Built on top of SlurmPilot for SSH, sbatch generation, and multi-cluster support. yeet adds:

  • Decorator + explicit submission APIs
  • Resource-aware cluster routing
  • Volume abstraction for data paths
  • Function source extraction (no pickle)
  • Auto uv environment sync
  • Cross-cluster volume sync with smart routing
  • Rich CLI with progress bars

Package Structure

yeet/
├── __init__.py          # Public API exports
├── config.py            # Cluster config loading (~/.yeet/clusters/*.yaml)
├── decorator.py         # @run decorator → RemoteFunction
├── job.py               # Job class: status, logs, download, cancel
├── router.py            # Match resource hints → best cluster
├── remote.py            # Wrapper around SlurmPilot for submission
├── serializer.py        # Function source extraction → remote script
├── sync.py              # rsync with progress (local↔remote and cluster↔cluster)
├── volume.py            # Volume path resolution
├── cli.py               # CLI commands
└── py.typed
tests/
├── test_config.py
├── test_serializer.py
├── test_router.py
└── test_volume.py
pyproject.toml

Cluster Configuration

Each cluster is defined in ~/.yeet/clusters/<name>.yaml:

name: sprint
host: sprint.uni.de
user: dariush

partitions:
  gpu:
    gpus: [a100]
    max_memory: 256G
    max_time: "72:00:00"
  cpu:
    gpus: []
    max_memory: 128G
    max_time: "168:00:00"

volumes:
  datasets: /scratch/dariush/datasets
  checkpoints: /scratch/dariush/checkpoints
  models: /scratch/dariush/models

remote_dir: /scratch/dariush/yeet_jobs
python: uv

setup_commands:
  - "module load cuda/12.1"

# Which other clusters this cluster can SSH to (aliases from its ~/.ssh/config)
reachable:
  cispa: cispa
  jureca: jureca

API

Decorator API (Modal-style)

from yeetjobs import run, Volume

@run(gpu="a100", memory="32G", time="4:00:00")
def train(lr: float = 0.001):
    import torch
    data = Volume("datasets") / "imagenet"
    out = Volume("checkpoints")
    # ... training ...
    torch.save(model, out / "model.pt")

job = train.submit(lr=0.0003)                    # auto-routes to cluster with A100s
job = train.submit(lr=0.0003, cluster="sprint")  # explicit cluster

Explicit API (for scripts)

from yeetjobs import submit

job = submit(
    "train.py",
    args=["--lr", "0.001"],
    gpu="a100",
    sync_dir="./src",
    time="4:00:00",
)

Job Management

job.status()                                    # PENDING / RUNNING / COMPLETED / FAILED
job.logs()                                      # stdout + stderr
job.download("checkpoints", "*.pt", "./results/")  # rsync artifacts back
job.cancel()

Volume Sync Between Clusters

from yeetjobs import sync

sync(from_cluster="sprint", to_cluster="cispa", volume="checkpoints", pattern="run_42/")

Sync logic:

  1. If source can reach destination → SSH into source, push via rsync
  2. If destination can reach source → SSH into destination, pull via rsync
  3. If neither → relay through local machine (download + upload)

CLI

yeet ls                                         # all jobs across all clusters
yeet status <job_id>                            # job status
yeet logs <job_id>                              # stdout/stderr
yeet cancel <job_id>                            # cancel job
yeet clusters                                   # show clusters + capabilities
yeet upload <local_path> <volume> --cluster X   # upload data to cluster
yeet download <job_id> <remote_path> <local>    # download artifacts
yeet sync --from X --to Y --volume V [--pattern P]

How It Works Under the Hood

  1. @run decorator → creates a RemoteFunction capturing resource hints
  2. .submit() → router checks hints against cluster configs, picks best match
  3. Code sync → rsyncs project dir to {remote_dir}/{job_name}/ on chosen cluster
  4. uv sync → rsyncs pyproject.toml + uv.lock, runs uv sync in sbatch preamble
  5. Script generation → extracts function source, writes wrapper .py with Volume resolution and argument injection
  6. Submission → SlurmPilot handles SSH → sbatch → returns job ID
  7. Monitoring → Job object wraps SlurmPilot's status/log retrieval over SSH
  8. Artifactsjob.download() rsyncs files back; yeet sync moves between clusters

Implementation Order

# Step Complexity
1 Project scaffolding (pyproject.toml, package structure) Low
2 Config system (YAML loading, validation, cluster registry) Medium
3 Volume (path-like object, runtime resolution) Low
4 Sync — local↔remote (rsync wrapper with rich progress) Medium
5 Sync — cluster↔cluster (direct rsync via SSH, with fallback) Medium
6 Router (match gpu/memory/time hints to cluster+partition) Medium
7 Serializer (function source extraction → executable script) Medium
8 Remote (SlurmPilot wrapper: configure, submit, status, logs) Medium
9 Decorator API (@run → RemoteFunction → .submit()) Medium
10 Explicit submit API (submit script with args) Low
11 Job class (status, logs, download, cancel) Medium
12 CLI (click-based, all commands) Medium
13 Tests (config, serializer, router, volume, sync logic) Medium

Dependencies

  • slurmpilot — SSH, sbatch generation, multi-cluster, job status
  • click — CLI framework
  • pyyaml — config parsing
  • rich — progress bars, nice terminal output

Not in v0.1

  • Multi-GPU / multi-node jobs
  • Auto-retry on preemption / checkpointing
  • Job arrays / hyperparameter sweeps
  • Web dashboard
  • Continuous sync / file watching
  • Async job waiting / callbacks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yeetjobs-0.1.0.tar.gz (85.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yeetjobs-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file yeetjobs-0.1.0.tar.gz.

File metadata

  • Download URL: yeetjobs-0.1.0.tar.gz
  • Upload date:
  • Size: 85.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yeetjobs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6b4b6b3700d0bcefdad19593c3c046e568c17c15f36e3b5a915352da7bc065af
MD5 d23116529893153e66038525fa8e6d94
BLAKE2b-256 e1913400e6a86927c702e099ef2b7aba56d3483cca6508b5a1dc9f0e9283d9a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for yeetjobs-0.1.0.tar.gz:

Publisher: publish.yml on dwahdany/yeet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yeetjobs-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: yeetjobs-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yeetjobs-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 030e53c7c1ac1784521a50c1940beeb7c716b472daa93ec86748357ec5e2869e
MD5 1b86baad427f7964e68c3665036aa329
BLAKE2b-256 c97b70ac8ecd55c93623e85be3f04482fc5dc3bdb1ab2dcf4ebcb7f101fbd77f

See more details on using hashes here.

Provenance

The following attestation bundles were made for yeetjobs-0.1.0-py3-none-any.whl:

Publisher: publish.yml on dwahdany/yeet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page