Skip to main content

Live Slurm GPU and job monitor.

Project description

smtop

Live Slurm GPU and job monitor derived from slmtop, installed as a separate package and command so the original package remains untouched.

Run smtop in a terminal to refresh continuously. Press q to quit.

Useful options:

  • smtop --once: print one snapshot.
  • smtop --free: only show nodes with free GPUs.
  • smtop -d 5: refresh every five seconds.
  • smtop -n 10: refresh ten times and exit.
  • smtop --gpu-metrics local: show live nvidia-smi util, memory, power, temp, and fan data for the current node.
  • smtop --gpu-metrics ssh: query Slurm GPU nodes with ssh <node> nvidia-smi ....
  • smtop --gpu-interval 10: sample GPU telemetry every ten seconds while still refreshing Slurm state at the main delay.
  • smtop --no-ssh-unlock: disable helper jobs and use plain ssh telemetry only.

The main table is node-level: each row aggregates all GPUs on one Slurm node. GPU utilization is averaged, GPU memory and power are summed, and temperature is the hottest GPU on that node.

The live curses UI uses an nvitop-style layout with boxed node telemetry, block bars for memory and utilization, cluster resource bars, and a boxed job queue.

Interactive controls:

  • Up/down arrows select node or job rows; Tab switches between the node and job panels.
  • Mouse clicks select visible node or job rows; mouse wheel moves selection within the panel under the cursor.
  • Click outside selectable rows, or press c, to clear the current selection.
  • Press Enter on a selected GPU node to open nvitop on that node over ssh; press q in nvitop to return to smtop.
  • Press k on your own selected job to open a confirmation dialog for scancel.

By default, smtop uses ssh telemetry and, for nodes that reject ssh, submits CPU-only helper jobs to establish persistent SSH master connections. Helper jobs request one CPU, 100M memory, no GPU, and sleep for 15 minutes by default. They are submitted in parallel. For unlocked nodes, smtop starts a persistent SSH telemetry channel while the helper job is active, then cancels the helper job after that node returns its first successful GPU telemetry sample. Use --unlock-hold-jobs if your cluster needs the helper jobs to stay alive for every telemetry refresh, or --unlock-keep-jobs to leave helper jobs running after smtop exits. To avoid repeated helper churn on nodes that keep rejecting or dropping telemetry, smtop submits at most three helper jobs per node per run by default; change this with --unlock-max-attempts.

Nodes that Slurm reports as unavailable, such as DOWN or NODE_FAIL, are not unlocked or sampled; the node table reports the Slurm state instead.

Unlocking retries SSH master setup for up to 60 seconds by default. The curses UI keeps accepting q and r while telemetry and unlock work runs in the background. If a node still cannot be sampled, the ERR column prefers unlock diagnostics such as unlock submit denied, unlock PD Resources, unlock timeout, unlock metric denied, or unlock nvidia missing over the original access-denied message.

When Enter opens nvitop for a GPU node, smtop temporarily suspends its curses screen and starts ssh -tt <node> nvitop. Existing smtop SSH masters and persistent telemetry channels stay open; they are only cleaned up when smtop itself exits. Use --nvitop-command to override the remote command.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smtop-0.10.7.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smtop-0.10.7-py3-none-any.whl (32.3 kB view details)

Uploaded Python 3

File details

Details for the file smtop-0.10.7.tar.gz.

File metadata

  • Download URL: smtop-0.10.7.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for smtop-0.10.7.tar.gz
Algorithm Hash digest
SHA256 00609270c156ea8ecc692ea4327d0526f07b91567e0ac60e24064acc199a6aab
MD5 9c3eaeb014627b74d12ab0ddb2504f02
BLAKE2b-256 7645c0342d07c1b8c037702c4286e7015bc5110f36d162923027261d88fefc74

See more details on using hashes here.

File details

Details for the file smtop-0.10.7-py3-none-any.whl.

File metadata

  • Download URL: smtop-0.10.7-py3-none-any.whl
  • Upload date:
  • Size: 32.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for smtop-0.10.7-py3-none-any.whl
Algorithm Hash digest
SHA256 48d0ddd57ef4d7471c65a690dfed6ca33b8eaf34b87ed842ac87b068e3e651aa
MD5 9f4ac77f536479dedce883c999c2b188
BLAKE2b-256 bcfa990253e3f39cff8d20fd819f8dcf7f356e1cd1ef5f8ee40808e4599bdda5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page