Skip to main content

Live Slurm GPU and job monitor.

Project description

smtop

Live Slurm GPU and job monitor derived from slmtop, installed as a separate package and command so the original package remains untouched.

Run smtop in a terminal to refresh continuously. Press q to quit.

Useful options:

  • smtop --once: print one snapshot.
  • smtop --free: only show nodes with free GPUs.
  • smtop -d 5: refresh every five seconds.
  • smtop -n 10: refresh ten times and exit.
  • smtop --gpu-metrics local: show live nvidia-smi util, memory, power, temp, and fan data for the current node.
  • smtop --gpu-metrics ssh: query Slurm GPU nodes with ssh <node> nvidia-smi ....
  • smtop --gpu-interval 10: sample GPU telemetry every ten seconds while still refreshing Slurm state at the main delay.
  • smtop --no-ssh-unlock: disable helper jobs and use plain ssh telemetry only.

The main table is node-level: each row aggregates all GPUs on one Slurm node. GPU utilization is averaged, GPU memory and power are summed, and temperature is the hottest GPU on that node.

The live curses UI uses an nvitop-style layout with boxed node telemetry, block bars for memory and utilization, cluster resource bars, and a boxed job queue.

Interactive controls:

  • Up/down arrows select node or job rows; Tab switches between the node and job panels.
  • Mouse clicks select visible node or job rows; mouse wheel moves selection within the panel under the cursor.
  • Click outside selectable rows, or press c, to clear the current selection.
  • Press k on your own selected job to open a confirmation dialog for scancel.

By default, smtop uses ssh telemetry and, for nodes that reject ssh, submits CPU-only helper jobs to establish persistent SSH master connections. Helper jobs request one CPU, 100M memory, no GPU, and sleep for 15 minutes by default. They are submitted in parallel. For unlocked nodes, smtop starts a persistent SSH telemetry channel while the helper job is active, then cancels the helper job after that node returns its first successful GPU telemetry sample. Use --unlock-hold-jobs if your cluster needs the helper jobs to stay alive for every telemetry refresh, or --unlock-keep-jobs to leave helper jobs running after smtop exits. To avoid repeated helper churn on nodes that keep rejecting or dropping telemetry, smtop submits at most three helper jobs per node per run by default; change this with --unlock-max-attempts.

Nodes that Slurm reports as unavailable, such as DOWN or NODE_FAIL, are not unlocked or sampled; the node table reports the Slurm state instead.

Unlocking retries SSH master setup for up to 60 seconds by default. The curses UI keeps accepting q and r while telemetry and unlock work runs in the background. If a node still cannot be sampled, the ERR column prefers unlock diagnostics such as unlock submit denied, unlock PD Resources, unlock timeout, unlock metric denied, or unlock nvidia missing over the original access-denied message.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smtop-0.10.6.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smtop-0.10.6-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file smtop-0.10.6.tar.gz.

File metadata

  • Download URL: smtop-0.10.6.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for smtop-0.10.6.tar.gz
Algorithm Hash digest
SHA256 1096c51971c03230cad51627324b648a7850937cded694195c6d577330c27f63
MD5 8957fbfd1c1faa98091e619e2a118f5f
BLAKE2b-256 dd6279c16a0871df5efaa705fb906eabb837faee6423a6446a04e3a3669e8a80

See more details on using hashes here.

File details

Details for the file smtop-0.10.6-py3-none-any.whl.

File metadata

  • Download URL: smtop-0.10.6-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for smtop-0.10.6-py3-none-any.whl
Algorithm Hash digest
SHA256 29ffbd9a6386969db7ef8745433aa71bdcfc462231e5197b4c574b03208f3445
MD5 fbb3dfed544792ed94c4754e9c6dff2e
BLAKE2b-256 d14f860abc27704670dbb94f24987ff47dfcddfaabfcc61ebc6221a25a1c0ed2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page