Live, process-isolated node-local hardware telemetry for active Slurm jobs
Project description
slurmwatch
Live, per-process CPU / memory / GPU telemetry for a running Slurm job — with a plain-language efficiency verdict.
Features
- Efficiency verdict — grades CPU, memory, and GPU (
GOOD/UNDERUSED/IDLE/WARNING) so you know when you're wasting cores or GPUs. - Per-process GPU attribution — NVML sees only your PIDs, so a neighbor's job never inflates your numbers.
- Honest memory — working set (RSS minus reclaimable cache) with a configurable OOM guard.
- Works anywhere — full live telemetry on the node; auto-falls back to Slurm accounting (
sstat) from a login node. - Zero config —
slurmwatch <jobid>auto-discovers jobs, cgroup v1/v2, and whether it's on the node. No flags to memorize.
Install
pip install slurmwatch
Requires Python 3.10+ and Linux with cgroup v1 or v2. One install works across a mixed cluster: GPU monitoring (NVIDIA, via pynvml) auto-activates on GPU nodes and is silently skipped on CPU-only nodes. Works with pipx / uv too — e.g. uv tool install slurmwatch.
Usage
slurmwatch # auto-discover and attach to your running job
slurmwatch 12345 # attach to a job (array: 12345_3, het: 12345+1)
slurmwatch --demo # try the live TUI right now — no Slurm needed
slurmwatch 12345 --once --json # one machine-readable snapshot, then exit
slurmwatch 12345 --log run.jsonl # headless logging (JSON Lines or CSV)
For full live telemetry, run on the node executing the job:
srun --jobid 12345 --overlap slurmwatch 12345. From a login node you get an sstat summary (peak memory + CPU time + allocation) instead — GPU utilization isn't available remotely, since Slurm tracks GPU count, not per-device util.
TUI keys: c/m/g/v focus a panel, arrows/PgUp/PgDn scroll, q quits.
Exit codes: 0 success · 1 runtime failure · 2 bad config. Errors go to stderr, so piped --once/--log output stays clean.
See slurmwatch --help for the full flag list. Behavior is also tunable via SLURMWATCH_* environment variables — e.g. SLURMWATCH_OOM_WARN, SLURMWATCH_GPU_IDLE_PCT, SLURMWATCH_POLL_INTERVAL (plus ASCII mode and more).
Library
import asyncio
from slurmwatch import TelemetryCollector, resolve_job_context
async def sample(job_id: str):
collector = TelemetryCollector(resolve_job_context(job_id))
await collector.start()
try:
print((await collector.next_snapshot()).to_json())
finally:
await collector.stop()
asyncio.run(sample("12345"))
Limitations
- NVIDIA-only GPU support (no AMD/ROCm).
- Single-node view — multi-node jobs show data for the node you're on.
- Live GPU utilization and working-set memory require running on the job's node.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slurmwatch-0.1.1.tar.gz.
File metadata
- Download URL: slurmwatch-0.1.1.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8fa25ad0ae640554c6b3ac9dde1731838fcd7495d27c76fb6c197543e7dd94d
|
|
| MD5 |
d65041b6b2b0d4415f4f9db8942fac45
|
|
| BLAKE2b-256 |
38b65ad8926d68904e538eccdccc26a5e175c76880080cf957e283c9a6a3a142
|
Provenance
The following attestation bundles were made for slurmwatch-0.1.1.tar.gz:
Publisher:
release.yml on PursuitOfDataScience/slurmwatch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmwatch-0.1.1.tar.gz -
Subject digest:
a8fa25ad0ae640554c6b3ac9dde1731838fcd7495d27c76fb6c197543e7dd94d - Sigstore transparency entry: 2064442508
- Sigstore integration time:
-
Permalink:
PursuitOfDataScience/slurmwatch@aeae54a27227192abb6fa8896a6c2f0c22a2977e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/PursuitOfDataScience
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@aeae54a27227192abb6fa8896a6c2f0c22a2977e -
Trigger Event:
push
-
Statement type:
File details
Details for the file slurmwatch-0.1.1-py3-none-any.whl.
File metadata
- Download URL: slurmwatch-0.1.1-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5bbf30b5a33df2a88be0d426deeab357985c2ba66e17ed3aea95c2e5eadd6278
|
|
| MD5 |
081013e4e32c84870ddcf67f292c0e1b
|
|
| BLAKE2b-256 |
3ff89551a5ce7c9c3c6b751db03736e8d583851f0858a291f84c87db2b28bacb
|
Provenance
The following attestation bundles were made for slurmwatch-0.1.1-py3-none-any.whl:
Publisher:
release.yml on PursuitOfDataScience/slurmwatch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurmwatch-0.1.1-py3-none-any.whl -
Subject digest:
5bbf30b5a33df2a88be0d426deeab357985c2ba66e17ed3aea95c2e5eadd6278 - Sigstore transparency entry: 2064442509
- Sigstore integration time:
-
Permalink:
PursuitOfDataScience/slurmwatch@aeae54a27227192abb6fa8896a6c2f0c22a2977e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/PursuitOfDataScience
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@aeae54a27227192abb6fa8896a6c2f0c22a2977e -
Trigger Event:
push
-
Statement type: