Render + policy library for SLURM JobSpecs (Mila + DRAC clusters)
Project description
salvo
Policy and render library for SLURM. salvo turns a JobSpec into byte-stable sbatch text and decides what to do when a job hits OOM or gets preempted. It does not move code to the cluster, does not open SSH connections, and is not a runner. For transport, pair it with cluv or invoke sbatch directly.
Install
pip install pysalvo
Two CLI commands ship for diagnostics and one-shot rendering. Everything else is meant to be imported.
salvo doctor # topology, ssh alias hygiene, manifest freshness
salvo render spec.yaml --cluster mila # JobSpec YAML to sbatch text on stdout
Render
from salvo import JobSpec, render
spec = JobSpec(
name="train",
cmd=["python", "train.py"],
gpus=1, cpus=8, mem="32G", time="2h",
on_oom=["bump_mem(1.5x, max=128G)", "fail"],
)
sbatch_text = render(
spec,
cluster_id="mila",
account="mila",
partition="unkillable",
) # str, byte-stable, no side effects
Same JobSpec, same cluster, same inputs, same bytes every run. Useful when you need to audit what was actually submitted six months later.
If you leave account and partition off, salvo picks them via salvo.dispatch by shelling out to squeue for live capacity. That path only works on a SLURM login node; library callers (cluv, xgenius) should pass both explicitly to keep render() pure.
OOM policy
from salvo.policy import parse, apply_oom, OomContext
steps = parse(["bump_mem(1.5x, max=128G)", "escalate_partition", "fail"])
new_spec, action = apply_oom(prev_spec, OomContext(kind="cpu", max_rss_mb=33_500))
# new_spec is a fresh JobSpec with bumped mem, or None if the policy says fail
The DSL is intentionally small. Steps run in order until one applies:
bump_mem(<factor>x, max=<size>)— multiplicative bump, cappedescalate_partition— clear partition so the next render picks a larger tierfail— terminalbump_gpus(...),callback(...)— parse today, execute in a later release
Cluster topology
Five presets ship in salvo/topology/presets/: mila, rorqual, narval, beluga, cedar. More DRAC clusters land as YAMLs are contributed. Each YAML lists accounts, partitions, and capacity rules.
from salvo.topology import load_preset, list_presets
list_presets() # ['beluga', 'cedar', 'mila', 'narval', 'rorqual']
cluster = load_preset("mila") # ClusterTopology
Contributing a new cluster is one YAML file. See CONTRIBUTING.md.
Decorator (optional, Python-native)
If you'd rather write a function than a YAML spec:
from salvo import cluster
@cluster.submit(gpus=1, cpus=8, mem="32G", time="2h",
on_oom=["bump_mem(1.5x, max=128G)", "fail"])
def train(seed: int):
...
handle = train.submit(seed=42)
train.submit(...) runs the local-sbatch convenience path: it renders, calls sbatch, and returns a handle. Skip the decorator if you're embedding salvo inside another tool's submit flow; just call render() and let that tool do the submission.
Use with cluv
cluv handles SSH, code sync, and the sbatch call. salvo handles the policy. The natural composition is cluv importing salvo's policy library when a user opts in:
# in your project's pyproject.toml
[tool.cluv.retry]
on_oom = ["bump_mem(1.5x, max=128G)", "fail"]
max_hops = 5
This is a proposal, not a shipped feature in cluv. An upstream issue is in draft.
Use with anything that calls sbatch
salvo has no SSH or async dependencies. It is a pydantic + stdlib library. Any tool that runs sbatch can import salvo.render, salvo.policy.apply_oom, and salvo.topology.load_preset independently. Two runnable examples under examples/ show the wiring end-to-end: cluv_integration.py (policy + render) and xgenius_integration.py (policy-only, when the host tool keeps its own renderer).
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysalvo-0.2.0.tar.gz.
File metadata
- Download URL: pysalvo-0.2.0.tar.gz
- Upload date:
- Size: 87.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
233f6e4fc3081a3985c0a8780963310065323156dc3ff0f76f99c48b21191d18
|
|
| MD5 |
dc98df20a67332551fcb4bcabba49770
|
|
| BLAKE2b-256 |
c6023e2522e27fd32cb772c7c795b45151c542a9dfcb9f9b64ca4a198ae64fd7
|
Provenance
The following attestation bundles were made for pysalvo-0.2.0.tar.gz:
Publisher:
release.yml on wietzesuijker/salvo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pysalvo-0.2.0.tar.gz -
Subject digest:
233f6e4fc3081a3985c0a8780963310065323156dc3ff0f76f99c48b21191d18 - Sigstore transparency entry: 1594074800
- Sigstore integration time:
-
Permalink:
wietzesuijker/salvo@8f11dcf20bf59ccdca290451b67bce6a8563c3bd -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/wietzesuijker
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8f11dcf20bf59ccdca290451b67bce6a8563c3bd -
Trigger Event:
push
-
Statement type:
File details
Details for the file pysalvo-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pysalvo-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90f4ed70898ed71be5ae6e8a49a68077896b4b6597ba115b44842e801056798f
|
|
| MD5 |
0e58701bdcb8caa5b54120c7b25502a3
|
|
| BLAKE2b-256 |
96c8e5c6e05827c8ce6b338fc5735b56d42be7636f79ab91f58093b70919eaaa
|
Provenance
The following attestation bundles were made for pysalvo-0.2.0-py3-none-any.whl:
Publisher:
release.yml on wietzesuijker/salvo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pysalvo-0.2.0-py3-none-any.whl -
Subject digest:
90f4ed70898ed71be5ae6e8a49a68077896b4b6597ba115b44842e801056798f - Sigstore transparency entry: 1594074863
- Sigstore integration time:
-
Permalink:
wietzesuijker/salvo@8f11dcf20bf59ccdca290451b67bce6a8563c3bd -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/wietzesuijker
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8f11dcf20bf59ccdca290451b67bce6a8563c3bd -
Trigger Event:
push
-
Statement type: