Skip to main content

Live Slurm GPU and job monitor.

Project description

smtop

Live Slurm GPU and job monitor derived from slmtop, installed as a separate package and command so the original package remains untouched.

Run smtop in a terminal to refresh continuously. Press q to quit.

Useful options:

  • smtop --once: print one snapshot.
  • smtop --free: only show nodes with free GPUs.
  • smtop -d 5: refresh every five seconds.
  • smtop -n 10: refresh ten times and exit.
  • smtop --gpu-metrics local: show live nvidia-smi util, memory, power, temp, and fan data for the current node.
  • smtop --gpu-metrics ssh: query Slurm GPU nodes with ssh <node> nvidia-smi ....
  • smtop --gpu-interval 10: sample GPU telemetry every ten seconds while still refreshing Slurm state at the main delay.
  • smtop --no-ssh-unlock: disable helper jobs and use plain ssh telemetry only.

The main table is node-level: each row aggregates all GPUs on one Slurm node. GPU utilization is averaged, GPU memory and power are summed, and temperature is the hottest GPU on that node.

The live curses UI uses an nvitop-style layout with boxed node telemetry, block bars for memory and utilization, cluster resource bars, and a boxed job queue.

Interactive controls:

  • Up/down arrows select node or job rows; Tab switches between the node and job panels.
  • Mouse clicks select visible node or job rows; mouse wheel moves selection within the panel under the cursor.
  • Click outside selectable rows, or press c, to clear the current selection.
  • Press Enter on a selected GPU node to open nvitop on that node over ssh; press q in nvitop to return to smtop.
  • Press k on your own selected job to open a confirmation dialog for scancel.

By default, smtop uses ssh telemetry and, for nodes that reject ssh, submits CPU-only helper jobs to establish persistent SSH master connections. Helper jobs request one CPU, 100M memory, no GPU, and sleep for 15 minutes by default. They are submitted in parallel. For unlocked nodes, smtop starts a persistent SSH telemetry channel while the helper job is active, then cancels the helper job after that node returns its first successful GPU telemetry sample. Use --unlock-hold-jobs if your cluster needs the helper jobs to stay alive for every telemetry refresh, or --unlock-keep-jobs to leave helper jobs running after smtop exits. To avoid repeated helper churn on nodes that keep rejecting or dropping telemetry, smtop submits at most three helper jobs per node per run by default; change this with --unlock-max-attempts.

Nodes that Slurm reports as unavailable, such as DOWN or NODE_FAIL, are not unlocked or sampled; the node table reports the Slurm state instead.

Unlocking retries SSH master setup for up to 60 seconds by default. The curses UI keeps accepting q and r while telemetry and unlock work runs in the background. If a node still cannot be sampled, the ERR column prefers unlock diagnostics such as unlock submit denied, unlock PD Resources, unlock timeout, unlock metric denied, or unlock nvidia missing over the original access-denied message.

When Enter opens nvitop for a GPU node, smtop temporarily suspends its curses screen and starts ssh -tt <node> bash -lc '<nvitop command>'. At startup, smtop resolves the default nvitop executable and terminfo paths from the local environment and stores those values for later Enter launches, which works well on clusters with shared home directories or shared conda paths. A small transition screen hides the shell while ssh and remote nvitop initialize. Existing smtop SSH masters and persistent telemetry channels stay open; they are only cleaned up when smtop itself exits. Use --nvitop-command to override the remote command or --nvitop-term to override the exported terminal type.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smtop-0.10.11.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smtop-0.10.11-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file smtop-0.10.11.tar.gz.

File metadata

  • Download URL: smtop-0.10.11.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for smtop-0.10.11.tar.gz
Algorithm Hash digest
SHA256 2d6d0270c2a9659247ee3797a38ca3a30e873e140a1fa46af63ac86be9c5e61e
MD5 4a7099257cbd04c91e04fd3394be905f
BLAKE2b-256 f00a09c978b1ad0d21df074e6b3bed229e1c550a2a9c9c9d4d499395d2c74f94

See more details on using hashes here.

File details

Details for the file smtop-0.10.11-py3-none-any.whl.

File metadata

  • Download URL: smtop-0.10.11-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for smtop-0.10.11-py3-none-any.whl
Algorithm Hash digest
SHA256 ea301aa5b7ad5e7167947010fe25a112ff0e89ccaab2c3db44d2b274b7e4e4c3
MD5 a6784608d88f5a9939d3ce71aee9c0ac
BLAKE2b-256 b47a1124f096b03bb76f8cec659cda6b52d37b35ee9fefc6dc1f63f1a9dc1ef3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page