Skip to main content

Render + policy library for SLURM JobSpecs (Mila + DRAC clusters)

Project description

salvo

Policy and render library for SLURM. salvo turns a JobSpec into byte-stable sbatch text and decides what to do when a job hits OOM or gets preempted. It does not move code to the cluster, does not open SSH connections, and is not a runner. For transport, pair it with cluv or invoke sbatch directly.

Install

pip install pysalvo

Two CLI commands ship for diagnostics and one-shot rendering. Everything else is meant to be imported.

salvo doctor                                 # topology, ssh alias hygiene, manifest freshness
salvo render spec.yaml --cluster mila        # JobSpec YAML to sbatch text on stdout

Render

from salvo import JobSpec, render

spec = JobSpec(
    name="train",
    cmd=["python", "train.py"],
    gpus=1, cpus=8, mem="32G", time="2h",
    on_oom=["bump_mem(1.5x, max=128G)", "fail"],
)

sbatch_text = render(
    spec,
    cluster_id="mila",
    account="mila",
    partition="unkillable",
)  # str, byte-stable, no side effects

Same JobSpec, same cluster, same inputs, same bytes every run. Useful when you need to audit what was actually submitted six months later.

If you leave account and partition off, salvo picks them via salvo.dispatch by shelling out to squeue for live capacity. That path only works on a SLURM login node; library callers (cluv, xgenius) should pass both explicitly to keep render() pure.

OOM policy

from salvo.policy import parse, apply_oom, OomContext

steps = parse(["bump_mem(1.5x, max=128G)", "escalate_partition", "fail"])

new_spec, action = apply_oom(prev_spec, OomContext(kind="cpu", max_rss_mb=33_500))
# new_spec is a fresh JobSpec with bumped mem, or None if the policy says fail

The DSL is intentionally small. Steps run in order until one applies:

  • bump_mem(<factor>x, max=<size>) — multiplicative bump, capped
  • escalate_partition — clear partition so the next render picks a larger tier
  • fail — terminal
  • bump_gpus(...), callback(...) — parse today, execute in a later release

Cluster topology

Five presets ship in salvo/topology/presets/: mila, rorqual, narval, beluga, cedar. More DRAC clusters land as YAMLs are contributed. Each YAML lists accounts, partitions, and capacity rules.

from salvo.topology import load_preset, list_presets

list_presets()                      # ['beluga', 'cedar', 'mila', 'narval', 'rorqual']
cluster = load_preset("mila")       # ClusterTopology

Contributing a new cluster is one YAML file. See CONTRIBUTING.md.

Decorator (optional, Python-native)

If you'd rather write a function than a YAML spec:

from salvo import cluster

@cluster.submit(gpus=1, cpus=8, mem="32G", time="2h",
                on_oom=["bump_mem(1.5x, max=128G)", "fail"])
def train(seed: int):
    ...

handle = train.submit(seed=42)

train.submit(...) runs the local-sbatch convenience path: it renders, calls sbatch, and returns a handle. Skip the decorator if you're embedding salvo inside another tool's submit flow; just call render() and let that tool do the submission.

Use with cluv

cluv handles SSH, code sync, and the sbatch call. salvo handles the policy. The natural composition is cluv importing salvo's policy library when a user opts in:

# in your project's pyproject.toml
[tool.cluv.retry]
on_oom = ["bump_mem(1.5x, max=128G)", "fail"]
max_hops = 5

This is a proposal, not a shipped feature in cluv. An upstream issue is in draft.

Use with anything that calls sbatch

salvo has no SSH or async dependencies. It is a pydantic + stdlib library. Any tool that runs sbatch can import salvo.render, salvo.policy.apply_oom, and salvo.topology.load_preset independently. Two runnable examples under examples/ show the wiring end-to-end: cluv_integration.py (policy + render) and xgenius_integration.py (policy-only, when the host tool keeps its own renderer).

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysalvo-0.3.0.tar.gz (92.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysalvo-0.3.0-py3-none-any.whl (35.8 kB view details)

Uploaded Python 3

File details

Details for the file pysalvo-0.3.0.tar.gz.

File metadata

  • Download URL: pysalvo-0.3.0.tar.gz
  • Upload date:
  • Size: 92.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pysalvo-0.3.0.tar.gz
Algorithm Hash digest
SHA256 34c89491a42727a1b218462811ab4dd5c81d3f4bdffcb1520449cc03378a0994
MD5 01ca9af4bf03d45ff968fdb8ffdcbab5
BLAKE2b-256 00ac2ee13d439d80351bb86821ef3e43070d43741a6fad6f14900a8f2ee8f352

See more details on using hashes here.

Provenance

The following attestation bundles were made for pysalvo-0.3.0.tar.gz:

Publisher: release.yml on wietzesuijker/salvo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pysalvo-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pysalvo-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 35.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pysalvo-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 79d2c9ed2fbc76de121ad1ccd4f703bebcee929afc57cd70863d8073c492aa0d
MD5 cc8a8592d4f3f66212a1ba0d03a94b38
BLAKE2b-256 c7397cc87dfec728a727af1d3f461edea2e146eb8efda90136832aef7b2f99d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pysalvo-0.3.0-py3-none-any.whl:

Publisher: release.yml on wietzesuijker/salvo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page