Live Slurm GPU and job monitor.
Project description
smtop
Live Slurm GPU and job monitor derived from slmtop, installed as a separate
package and command so the original package remains untouched.
Run smtop in a terminal to refresh continuously. Press q to quit.
Useful options:
smtop --once: print one snapshot.smtop --free: only show nodes with free GPUs.smtop -d 5: refresh every five seconds.smtop -n 10: refresh ten times and exit.smtop --gpu-metrics local: show livenvidia-smiutil, memory, power, temp, and fan data for the current node.smtop --gpu-metrics ssh: query Slurm GPU nodes withssh <node> nvidia-smi ....smtop --gpu-interval 10: sample GPU telemetry every ten seconds while still refreshing Slurm state at the main delay.smtop --no-ssh-unlock: disable helper jobs and use plain ssh telemetry only.
The main table is node-level: each row aggregates all GPUs on one Slurm node. GPU utilization is averaged, GPU memory and power are summed, and temperature is the hottest GPU on that node.
The live curses UI uses an nvitop-style layout with boxed node telemetry,
block bars for memory and utilization, cluster resource bars, and a boxed job
queue.
Interactive controls:
- Up/down arrows select node or job rows; Tab switches between the node and job panels.
- Mouse clicks select visible node or job rows; mouse wheel moves selection within the panel under the cursor.
- Click outside selectable rows, or press
c, to clear the current selection. - Press Enter on a selected GPU node to open
nvitopon that node over ssh; pressqinnvitopto return tosmtop. - Press
kon your own selected job to open a confirmation dialog forscancel.
By default, smtop uses ssh telemetry and, for nodes that reject ssh, submits
CPU-only helper jobs to establish persistent SSH master connections. Helper jobs
request one CPU, 100M memory, no GPU, and sleep for 15 minutes by default. They
are submitted in parallel. For unlocked nodes, smtop starts a persistent SSH
telemetry channel while the helper job is active, then cancels the helper job
after that node returns its first successful GPU telemetry sample. Use
--unlock-hold-jobs if your cluster needs the helper jobs to stay alive for
every telemetry refresh, or --unlock-keep-jobs to leave helper jobs running
after smtop exits. To avoid repeated helper churn on nodes that keep rejecting
or dropping telemetry, smtop submits at most three helper jobs per node per
run by default; change this with --unlock-max-attempts.
Nodes that Slurm reports as unavailable, such as DOWN or NODE_FAIL, are not
unlocked or sampled; the node table reports the Slurm state instead.
Unlocking retries SSH master setup for up to 60 seconds by default. The curses UI
keeps accepting q and r while telemetry and unlock work runs in the
background. If a node still cannot be sampled, the ERR column prefers unlock
diagnostics such as unlock submit denied, unlock PD Resources,
unlock timeout, unlock metric denied, or unlock nvidia missing over the
original access-denied message.
When Enter opens nvitop for a GPU node, smtop temporarily suspends its curses
screen and starts ssh -tt <node> nvitop. Existing smtop SSH masters and
persistent telemetry channels stay open; they are only cleaned up when smtop
itself exits. Use --nvitop-command to override the remote command.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smtop-0.10.7.tar.gz.
File metadata
- Download URL: smtop-0.10.7.tar.gz
- Upload date:
- Size: 31.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00609270c156ea8ecc692ea4327d0526f07b91567e0ac60e24064acc199a6aab
|
|
| MD5 |
9c3eaeb014627b74d12ab0ddb2504f02
|
|
| BLAKE2b-256 |
7645c0342d07c1b8c037702c4286e7015bc5110f36d162923027261d88fefc74
|
File details
Details for the file smtop-0.10.7-py3-none-any.whl.
File metadata
- Download URL: smtop-0.10.7-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48d0ddd57ef4d7471c65a690dfed6ca33b8eaf34b87ed842ac87b068e3e651aa
|
|
| MD5 |
9f4ac77f536479dedce883c999c2b188
|
|
| BLAKE2b-256 |
bcfa990253e3f39cff8d20fd819f8dcf7f356e1cd1ef5f8ee40808e4599bdda5
|