nvtop-inspired real-time GPU monitor for SLURM HPC clusters
Project description
nvnodetop โ NVIDIA Node Cluster Top
An nvtop-inspired, real-time GPU monitor for SLURM HPC clusters.
Monitor every GPU across all your running jobs โ with utilisation bars, sparkline history, power draw, ECC errors and per-process detail โ all from a single terminal window.
Table of Contents
- Features
- Requirements
- Installation
- Usage
- Display Layout
- How It Works
- Configuration
- Troubleshooting
- Contributing
- License
Features
- ๐ฅ๏ธ Multi-node, multi-job โ cycles through every node assigned to your running SLURM jobs
- ๐ Rich GPU metrics โ utilisation, memory (used/total), temperature, power draw/limit, SM & memory clock speeds
- ๐ Sparkline history โ rolling utilisation history plotted in-line with Unicode block characters
- โก Asynchronous polling โ each node is polled in a dedicated background subprocess; the UI never blocks waiting for SSH
- ๐จ Alert flags โ thermal throttle (
!THERM), power brake (!PWR), and uncorrected ECC errors are highlighted inline - ๐ค Process table โ per-GPU process list showing PID, username, command and GPU memory (toggle with
p) - ๐ Responsive layout โ bar widths adapt dynamically to the terminal width
- โป๏ธ Graceful cleanup โ all background SSH pollers and the temporary cache directory are cleaned up on exit
Requirements
| Requirement | Notes |
|---|---|
| Bash โฅ 4.0 | Required for associative arrays (declare -A) |
SLURM (squeue, scontrol) |
Must be available on the login node |
| SSH key auth | Passwordless SSH to compute nodes (e.g. via SLURM cluster config) |
nvidia-smi |
Must be installed on each compute node |
python3 |
Required on compute nodes for process-name/username resolution |
tput / stty |
Standard terminal utilities, available on virtually all Linux systems |
Note:
nvnodetoponly needs to be installed on your login node or local machine. The compute-node side runs a one-linernvidia-smiquery and a tiny inline Python3 snippet over SSH โ no remote installation is needed.
Installation
Via pip (recommended)
pip install nvnodetop
This places the nvnodetop command on your PATH.
Via pipx (isolated)
pipx installs the tool into an isolated environment and exposes the command globally โ ideal for system-wide HPC environments.
pipx install nvnodetop
Manual install
# Clone
git clone https://github.com/whats2000/nvnodetop.git
cd nvnodetop
# Make executable and add to PATH
chmod +x nvnodetop.sh
cp nvnodetop.sh ~/.local/bin/nvnodetop
Usage
nvnodetop [FETCH_INTERVAL [DISPLAY_INTERVAL]]
Simply run nvnodetop from any terminal on your HPC login node:
nvnodetop # defaults: poll every 3 s, refresh UI every 1 s
nvnodetop 5 # poll every 5 s, refresh UI every 1 s
nvnodetop 5 2 # poll every 5 s, refresh UI every 2 s
Arguments
| Argument | Default | Description |
|---|---|---|
FETCH_INTERVAL |
3 |
Seconds between GPU data polls per node (SSH calls) |
DISPLAY_INTERVAL |
1 |
UI refresh rate in seconds |
Key Bindings
| Key | Action |
|---|---|
โ / k / K |
Previous job |
โ / j / J |
Next job |
โ / > / . |
Next node within current job |
โ / < / , |
Previous node within current job |
p / P |
Toggle process table |
q / Q |
Quit |
Display Layout
Job 12345678 my_train_job job [1/2] โโjobs node [1/3] <>nodes p procs q quit poll:3s disp:1s
gpu-node-01 [stale 12s]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
GPU Name Temp Utilization % Memory Used/Tot MiB Power SM/MemMHz Util History Flags
0 NVIDIA A100-SXM4 52ยฐC โโโโโโโโโโโโโโโโ 78% โโโโโโโโโ 38012/40960 312/400W 1410/1593 โโโ
โโโโโโโโ
โ
1 NVIDIA A100-SXM4 48ยฐC โโโโโโโโโโโโโโโโ 50% โโโโโโโโโโ 22016/40960 201/400W 1350/1500 โโโโ
โ
โโ
โ
โโ
โ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SUM (2 GPUs) โโโโโโโโโโโโโโโโ 64% โโโโโโโโโโ 60028/81920 513/800W
GPU Row
| Field | Description |
|---|---|
| GPU | GPU index (0-based) |
| Name | GPU model name (truncated to 18 chars) |
| Temp | Core temperature in ยฐC (cyan) |
| Utilisation bar | Coloured fill bar โ green < 60 %, yellow < 85 %, red โฅ 85 % |
| % | Numeric GPU compute utilisation |
| Memory bar | Same colour coding, based on memory percentage |
| Used/Tot MiB | Absolute memory consumption |
| Power | Current draw / TDP limit in Watts |
| SM/MemMHz | Streaming Multiprocessor and memory clock speeds |
| Util History | Rolling sparkline of the last 20 utilisation samples |
| Flags | !THERM (thermal throttle), !PWR (power brake), ECC:N (ECC errors) |
Summary Row
Shows the average utilisation, total memory across all GPUs on the node, and total power draw.
Process Table
Toggle with p. Columns: GPU index, PID, username, command name (basename), GPU memory in MiB.
How It Works
Login Node Compute Nodes
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ nvnodetop โ SSH poll โ nvidia-smi query โ
โ โโ squeue --me โโโโโผโโโโโโโโโโโโโโบโ python3 proc resolve โ
โ โโ background โโโโโโโโโโโโโโโโค stdout โ cache file โ
โ โ fetcher/node โ CSV data โโโโโโโโโโโโโโโโโโโโโโโโ
โ โโ UI render loop โ
โโโโโโโโโโโโโโโโโโโโโโโ
- Job discovery โ
squeue --me --states=Ris called every 30 seconds to find your running jobs and their assigned nodes. - Background pollers โ One
_node_fetcher_loopsubprocess is spawned per unique node. Each loop SSHs into the node, runsnvidia-smifor GPU metrics and an inline Python3 snippet for process info, then writes the result atomically to a temp file (usingtmp+mv). - UI render loop โ The main process reads the latest cached file, updates sparkline history arrays (which must live in the main shell for persistence), renders the frame, then calls
read_keywith a timeout equal toDISPLAY_INTERVAL. - Cleanup โ A
trap cleanup INT TERM EXITensures all background pollers are killed and the cache directory is removed on exit.
Configuration
A small set of constants at the top of the script can be tweaked directly:
| Variable | Default | Description |
|---|---|---|
NODE_REFRESH_INTERVAL |
30 |
Seconds between squeue calls to discover node changes |
HISTORY_LEN |
20 |
Number of historical samples kept per GPU for the sparkline |
These can also be overridden at invocation time via environment variables (future enhancement).
Troubleshooting
No running SLURM jobs found
The script only displays nodes for your running jobs (squeue --me --states=R). Make sure you have at least one job in the R (Running) state.
GPU data shows Waiting for first dataโฆ
On first launch, the background SSH poller needs one full FETCH_INTERVAL cycle to collect data. Wait a few seconds.
[stale Xs] warning
The cached data is more than 3ร FETCH_INTERVAL old, which usually means the SSH connection to that node is slow or timing out. Check your SSH connectivity to the compute node.
SSH connection refused / hang
Ensure passwordless SSH (BatchMode=yes) is configured for the compute nodes. The script uses ConnectTimeout=5 to avoid hanging.
declare -A / mapfile errors
Your Bash version is older than 4.0. Update Bash:
# On macOS (system bash is 3.2)
brew install bash
Contributing
Contributions, bug reports and feature requests are welcome!
- Fork the repository
- Create a feature branch:
git checkout -b feat/my-feature - Commit your changes with a descriptive message
- Open a Pull Request
Please ensure your changes are tested against a real SLURM cluster or a mocked environment before submitting.
License
This project is licensed under the MIT License โ see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvnodetop-0.1.3.tar.gz.
File metadata
- Download URL: nvnodetop-0.1.3.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44a2759b35784077ca3a445fb5ddf875caa54f2439e0bfbf353e78de45e5528b
|
|
| MD5 |
1f15d92c43249e7542138ec41e044a97
|
|
| BLAKE2b-256 |
eb0933a8369f4dba4f1a928a97d3bdffe7130bde017d7e18b408ad8e58a4aec2
|
Provenance
The following attestation bundles were made for nvnodetop-0.1.3.tar.gz:
Publisher:
publish.yml on whats2000/nvnodetop
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nvnodetop-0.1.3.tar.gz -
Subject digest:
44a2759b35784077ca3a445fb5ddf875caa54f2439e0bfbf353e78de45e5528b - Sigstore transparency entry: 1046838933
- Sigstore integration time:
-
Permalink:
whats2000/nvnodetop@6d82c50043cc1cba2f97d23c2a6131e5cea1db76 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/whats2000
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6d82c50043cc1cba2f97d23c2a6131e5cea1db76 -
Trigger Event:
push
-
Statement type:
File details
Details for the file nvnodetop-0.1.3-py3-none-any.whl.
File metadata
- Download URL: nvnodetop-0.1.3-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a021cd08f8109c9ae21ea565e38c7597a9f364d60aa31b6a2fc38e19e299fe77
|
|
| MD5 |
f854249a12a6b5bd1813ebb64b0da3ac
|
|
| BLAKE2b-256 |
e35380d1b6d5946880ec71257850442980beabc190faada5cda506b45136fb21
|
Provenance
The following attestation bundles were made for nvnodetop-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on whats2000/nvnodetop
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nvnodetop-0.1.3-py3-none-any.whl -
Subject digest:
a021cd08f8109c9ae21ea565e38c7597a9f364d60aa31b6a2fc38e19e299fe77 - Sigstore transparency entry: 1046838934
- Sigstore integration time:
-
Permalink:
whats2000/nvnodetop@6d82c50043cc1cba2f97d23c2a6131e5cea1db76 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/whats2000
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6d82c50043cc1cba2f97d23c2a6131e5cea1db76 -
Trigger Event:
push
-
Statement type: