Skip to main content

Live, process-isolated node-local hardware telemetry for active Slurm jobs

Project description

slurmwatch

Live, per-process CPU / memory / GPU telemetry for a running Slurm job — with a plain-language efficiency verdict.

CI PyPI Python 3.10+ MIT License Ruff

slurmwatch live TUI dashboard: per-process CPU, memory, and GPU telemetry for a Slurm job. Memory climbs from safe into the OOM-guard WARNING and CRITICAL bands while the allocation-efficiency verdict flags an idle GPU (1 of 2 active).

Features

  • Efficiency verdict — grades CPU, memory, and GPU (GOOD / UNDERUSED / IDLE / WARNING) so you know when you're wasting cores or GPUs.
  • Per-process GPU attribution — NVML sees only your PIDs, so a neighbor's job never inflates your numbers.
  • Honest memory — working set (RSS minus reclaimable cache) with a configurable OOM guard.
  • Works anywhere — full live telemetry on the node; auto-falls back to Slurm accounting (sstat) from a login node.
  • Zero configslurmwatch <jobid> auto-discovers jobs, cgroup v1/v2, and whether it's on the node. No flags to memorize.

Install

pip install slurmwatch

Requires Python 3.10+ and Linux with cgroup v1 or v2. One install works across a mixed cluster: GPU monitoring (NVIDIA, via pynvml) auto-activates on GPU nodes and is silently skipped on CPU-only nodes. Works with pipx / uv too — e.g. uv tool install slurmwatch.

Usage

slurmwatch                       # auto-discover and attach to your running job
slurmwatch 12345                 # attach to a job (array: 12345_3, het: 12345+1)
slurmwatch --demo                # try the live TUI right now — no Slurm needed
slurmwatch 12345 --once --json   # one machine-readable snapshot, then exit
slurmwatch 12345 --log run.jsonl # headless logging (JSON Lines or CSV)

For full live telemetry, run on the node executing the job: srun --jobid 12345 --overlap slurmwatch 12345. From a login node you get an sstat summary (peak memory + CPU time + allocation) instead — GPU utilization isn't available remotely, since Slurm tracks GPU count, not per-device util.

TUI keys: c/m/g/v focus a panel, arrows/PgUp/PgDn scroll, q quits.

Exit codes: 0 success · 1 runtime failure · 2 bad config. Errors go to stderr, so piped --once/--log output stays clean.

See slurmwatch --help for the full flag list. Behavior is also tunable via SLURMWATCH_* environment variables — e.g. SLURMWATCH_OOM_WARN, SLURMWATCH_GPU_IDLE_PCT, SLURMWATCH_POLL_INTERVAL (plus ASCII mode and more).

Library

import asyncio
from slurmwatch import TelemetryCollector, resolve_job_context

async def sample(job_id: str):
    collector = TelemetryCollector(resolve_job_context(job_id))
    await collector.start()
    try:
        print((await collector.next_snapshot()).to_json())
    finally:
        await collector.stop()

asyncio.run(sample("12345"))

Limitations

  • NVIDIA-only GPU support (no AMD/ROCm).
  • Single-node view — multi-node jobs show data for the node you're on.
  • Live GPU utilization and working-set memory require running on the job's node.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurmwatch-0.1.1.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slurmwatch-0.1.1-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file slurmwatch-0.1.1.tar.gz.

File metadata

  • Download URL: slurmwatch-0.1.1.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurmwatch-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a8fa25ad0ae640554c6b3ac9dde1731838fcd7495d27c76fb6c197543e7dd94d
MD5 d65041b6b2b0d4415f4f9db8942fac45
BLAKE2b-256 38b65ad8926d68904e538eccdccc26a5e175c76880080cf957e283c9a6a3a142

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmwatch-0.1.1.tar.gz:

Publisher: release.yml on PursuitOfDataScience/slurmwatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slurmwatch-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: slurmwatch-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurmwatch-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5bbf30b5a33df2a88be0d426deeab357985c2ba66e17ed3aea95c2e5eadd6278
MD5 081013e4e32c84870ddcf67f292c0e1b
BLAKE2b-256 3ff89551a5ce7c9c3c6b751db03736e8d583851f0858a291f84c87db2b28bacb

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurmwatch-0.1.1-py3-none-any.whl:

Publisher: release.yml on PursuitOfDataScience/slurmwatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page