Skip to main content

nvtop-inspired real-time GPU monitor for SLURM HPC clusters

Project description

nvnodetop โ€” NVIDIA Node Cluster Top

PyPI version Python Versions License: MIT CI/CD

An nvtop-inspired, real-time GPU monitor for SLURM HPC clusters.
Monitor every GPU across all your running jobs โ€” with utilisation bars, sparkline history, power draw, ECC errors and per-process detail โ€” all from a single terminal window.

nvnodetop screenshot

nvnodetop screenshot2


Table of Contents


Features

  • ๐Ÿ–ฅ๏ธ Multi-node, multi-job โ€” cycles through every node assigned to your running SLURM jobs
  • ๐Ÿ“Š Rich GPU metrics โ€” utilisation, memory (used/total), temperature, power draw/limit, SM & memory clock speeds
  • ๐Ÿ“ˆ Sparkline history โ€” rolling utilisation history plotted in-line with Unicode block characters
  • โšก Asynchronous polling โ€” each node is polled in a dedicated background subprocess; the UI never blocks waiting for SSH
  • ๐Ÿšจ Alert flags โ€” thermal throttle (!THERM), power brake (!PWR), and uncorrected ECC errors are highlighted inline
  • ๐Ÿ‘ค Process table โ€” per-GPU process list showing PID, username, command and GPU memory (toggle with p)
  • ๐Ÿ“ Responsive layout โ€” bar widths adapt dynamically to the terminal width
  • โ™ป๏ธ Graceful cleanup โ€” all background SSH pollers and the temporary cache directory are cleaned up on exit

Requirements

Requirement Notes
Bash โ‰ฅ 4.0 Required for associative arrays (declare -A)
SLURM (squeue, scontrol) Must be available on the login node
SSH key auth Passwordless SSH to compute nodes (e.g. via SLURM cluster config)
nvidia-smi Must be installed on each compute node
python3 Required on compute nodes for process-name/username resolution
tput / stty Standard terminal utilities, available on virtually all Linux systems

Note: nvnodetop only needs to be installed on your login node or local machine. The compute-node side runs a one-liner nvidia-smi query and a tiny inline Python3 snippet over SSH โ€” no remote installation is needed.


Installation

Via pip (recommended)

pip install nvnodetop

This places the nvnodetop command on your PATH.

Via pipx (isolated)

pipx installs the tool into an isolated environment and exposes the command globally โ€” ideal for system-wide HPC environments.

pipx install nvnodetop

Manual install

# Clone
git clone https://github.com/whats2000/nvnodetop.git
cd nvnodetop

# Make executable and add to PATH
chmod +x nvnodetop.sh
cp nvnodetop.sh ~/.local/bin/nvnodetop

Usage

nvnodetop [FETCH_INTERVAL [DISPLAY_INTERVAL]]

Simply run nvnodetop from any terminal on your HPC login node:

nvnodetop          # defaults: poll every 3 s, refresh UI every 1 s
nvnodetop 5        # poll every 5 s, refresh UI every 1 s
nvnodetop 5 2      # poll every 5 s, refresh UI every 2 s

Arguments

Argument Default Description
FETCH_INTERVAL 3 Seconds between GPU data polls per node (SSH calls)
DISPLAY_INTERVAL 1 UI refresh rate in seconds

Key Bindings

Key Action
โ†‘ / k / K Previous job
โ†“ / j / J Next job
โ†’ / > / . Next node within current job
โ† / < / , Previous node within current job
p / P Toggle process table
q / Q Quit

Display Layout

  Job 12345678 my_train_job  job [1/2] โ†‘โ†“jobs  node [1/3] <>nodes  p procs  q quit  poll:3s disp:1s
  gpu-node-01                                                                         [stale 12s]
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  GPU  Name                Temp  Utilization       %  Memory        Used/Tot MiB  Power     SM/MemMHz  Util History  Flags
    0  NVIDIA A100-SXM4   52ยฐC  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  78%  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘    38012/40960  312/400W  1410/1593  โ–„โ–„โ–…โ–†โ–‡โ–‡โ–ˆโ–ˆโ–‡โ–†โ–…โ–†
    1  NVIDIA A100-SXM4   48ยฐC  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  50%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘   22016/40960  201/400W  1350/1500  โ–ƒโ–„โ–„โ–…โ–…โ–„โ–…โ–…โ–„โ–…โ–…โ–„
  โ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œ
  SUM  (2 GPUs)                 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘  64%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘   60028/81920  513/800W

GPU Row

Field Description
GPU GPU index (0-based)
Name GPU model name (truncated to 18 chars)
Temp Core temperature in ยฐC (cyan)
Utilisation bar Coloured fill bar โ€” green < 60 %, yellow < 85 %, red โ‰ฅ 85 %
% Numeric GPU compute utilisation
Memory bar Same colour coding, based on memory percentage
Used/Tot MiB Absolute memory consumption
Power Current draw / TDP limit in Watts
SM/MemMHz Streaming Multiprocessor and memory clock speeds
Util History Rolling sparkline of the last 20 utilisation samples
Flags !THERM (thermal throttle), !PWR (power brake), ECC:N (ECC errors)

Summary Row

Shows the average utilisation, total memory across all GPUs on the node, and total power draw.

Process Table

Toggle with p. Columns: GPU index, PID, username, command name (basename), GPU memory in MiB.


How It Works

Login Node                           Compute Nodes
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ nvnodetop           โ”‚   SSH poll   โ”‚ nvidia-smi query     โ”‚
โ”‚  โ”œโ”€ squeue --me โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ python3 proc resolve โ”‚
โ”‚  โ”œโ”€ background      โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค stdout โ†’ cache file  โ”‚
โ”‚  โ”‚   fetcher/node   โ”‚   CSV data   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚  โ””โ”€ UI render loop  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  1. Job discovery โ€” squeue --me --states=R is called every 30 seconds to find your running jobs and their assigned nodes.
  2. Background pollers โ€” One _node_fetcher_loop subprocess is spawned per unique node. Each loop SSHs into the node, runs nvidia-smi for GPU metrics and an inline Python3 snippet for process info, then writes the result atomically to a temp file (using tmp + mv).
  3. UI render loop โ€” The main process reads the latest cached file, updates sparkline history arrays (which must live in the main shell for persistence), renders the frame, then calls read_key with a timeout equal to DISPLAY_INTERVAL.
  4. Cleanup โ€” A trap cleanup INT TERM EXIT ensures all background pollers are killed and the cache directory is removed on exit.

Configuration

A small set of constants at the top of the script can be tweaked directly:

Variable Default Description
NODE_REFRESH_INTERVAL 30 Seconds between squeue calls to discover node changes
HISTORY_LEN 20 Number of historical samples kept per GPU for the sparkline

These can also be overridden at invocation time via environment variables (future enhancement).


Troubleshooting

No running SLURM jobs found

The script only displays nodes for your running jobs (squeue --me --states=R). Make sure you have at least one job in the R (Running) state.

GPU data shows Waiting for first dataโ€ฆ

On first launch, the background SSH poller needs one full FETCH_INTERVAL cycle to collect data. Wait a few seconds.

[stale Xs] warning

The cached data is more than 3ร— FETCH_INTERVAL old, which usually means the SSH connection to that node is slow or timing out. Check your SSH connectivity to the compute node.

SSH connection refused / hang

Ensure passwordless SSH (BatchMode=yes) is configured for the compute nodes. The script uses ConnectTimeout=5 to avoid hanging.

declare -A / mapfile errors

Your Bash version is older than 4.0. Update Bash:

# On macOS (system bash is 3.2)
brew install bash

Contributing

Contributions, bug reports and feature requests are welcome!

  1. Fork the repository
  2. Create a feature branch: git checkout -b feat/my-feature
  3. Commit your changes with a descriptive message
  4. Open a Pull Request

Please ensure your changes are tested against a real SLURM cluster or a mocked environment before submitting.


License

This project is licensed under the MIT License โ€” see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nvnodetop-0.1.4.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvnodetop-0.1.4-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file nvnodetop-0.1.4.tar.gz.

File metadata

  • Download URL: nvnodetop-0.1.4.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nvnodetop-0.1.4.tar.gz
Algorithm Hash digest
SHA256 0b6d46ff049177d61e631d0ee6c1f42ada2312187e26b4be644a63e80e78e134
MD5 ed9644221f00daea71c2ee96095fa83b
BLAKE2b-256 0346dc09b19fd5426ba27f59ce555e7a4dc20696484a238333ca3adeb0b82c6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for nvnodetop-0.1.4.tar.gz:

Publisher: publish.yml on whats2000/nvnodetop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nvnodetop-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: nvnodetop-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nvnodetop-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f50fc0ece991cabb1602ce28898dd3128c4cc30a408661a89cef93d49bbabd03
MD5 f6168e12ba60527e64c7f843ebef7d54
BLAKE2b-256 e16d935981ac383e678cd69671bf910e6199e9123b5728881c46f8e0bd300966

See more details on using hashes here.

Provenance

The following attestation bundles were made for nvnodetop-0.1.4-py3-none-any.whl:

Publisher: publish.yml on whats2000/nvnodetop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page