Real-time GPU monitoring dashboard for DCGM Prometheus metrics

These details have not been verified by PyPI

Project links

Project description

KempnerPulse

nvidia-smi says 100% GPU utilization - but are your tensor cores even active? KempnerPulse shows what's actually happening.

Real-time GPU monitoring dashboard for DCGM Prometheus metrics. A single-file Rich-based TUI that streams dcgm-exporter /metrics and renders four interactive views in the terminal.

KempnerPulse Demo

Features

Fleet View : All GPUs at a glance: utilization, memory, power, temperature, PCIe/NVLink bandwidth, sparkline bars.
Focus View : Deep dive into one GPU with per-metric sparkline history.
Plot View : Stacked line charts across all GPUs.
Job View : Running GPU compute processes with per-GPU metrics.
Real Utilization : Weighted composite metric from SM active, tensor pipe, DRAM active, and GR engine counters (customizable weights with presets for AI/ML, HPC, and memory-bound workflows).
Workload Classification : 12-category status based on NVIDIA DCGM profiling metric guidance (idle, tensor-heavy compute, memory-bound, I/O, etc.).
Health Monitoring : Temperature, PCIe replay errors, and ECC errors with color-coded alerts.
SLURM/CUDA Aware : Automatically detects CUDA_VISIBLE_DEVICES, SLURM_JOB_GPUS, etc. to show only your allocated GPUs.
Zero Dependencies beyond Python 3.9+ and rich.

Screenshots

Fleet View

All GPUs at a glance with utilization bars, memory, power, temperature, and bandwidth.

Fleet View

Focus View

Deep dive into a single GPU with per-metric sparkline history.

Focus View

Plot View

Stacked line charts across all GPUs.

Plot View

Job View

Running GPU compute processes with per-GPU metrics.

Job View

Requirements

Linux with NVIDIA GPUs
dcgm-exporter running and exposing /metrics (default: http://localhost:9400/metrics)
Python >= 3.9
nvidia-smi on the PATH (for hardware queries and process listing)

Installation

Install locally (recommended until published on PyPI):

pip install .

Or run directly (installs only the rich dependency):

pip install rich
python3 kempner_pulse.py

Quick Start

# Default: connect to localhost:9400/metrics, show SLURM/CUDA-visible GPUs
kempnerpulse

# Explicit source and GPU selection
kempnerpulse --source http://gpu-node:9400/metrics --gpus 0,1,2,3

# Show all GPUs on the node
kempnerpulse --show-all

# Start in focus view for GPU 0
kempnerpulse --focus-gpu 0

# Use HPC weight preset
kempnerpulse --hpc-weights

# Custom weights (SM, Tensor, DRAM, GR; normalized automatically)
kempnerpulse --weights 0.40,0.30,0.20,0.10

Interactive Commands

Command	Action
`:focus <id>`	Enter focused view for a specific GPU
`:plot`	Enter plot view (line charts)
`:job`	Enter job view (running GPU processes)
`:q`	Return to fleet view (or exit if in fleet)
`:exit`	Exit the dashboard
`Ctrl+C`	Exit the dashboard
`Esc`	Cancel an unfinished `:` command

CLI Reference

Flag	Type	Default	Description
`--version`			Show version and exit.
`--source URL`	string	`http://localhost:9400/metrics`	dcgm-exporter `/metrics` endpoint or a local text file.
`--poll SECS`	float	`1.0`	Dashboard redraw interval in seconds (does not change DCGM sampling rate).
`--history N`	int	`120`	Number of samples kept for sparkline history.
`--focus-gpu ID`	string		Start in Focus View for the given GPU id (e.g. `0`).
`--once`	flag		Render a single snapshot and exit instead of running live.
`--gpus IDS`	string		Explicit GPU ids or ranges (`0,1` or `0-3`). Overrides SLURM/CUDA env vars.
`--show-all`	flag		Ignore SLURM/CUDA visibility env vars; show every GPU in the source.
`--weights W`	4 floats	`0.35,0.35,0.20,0.10`	Comma-separated Real Util weights: SM,TENSOR,DRAM,GR. Auto-normalized.
`--ai-weights`	preset		AI/LLM training preset `(0.35, 0.35, 0.20, 0.10)`. This is the default.
`--hpc-weights`	preset		HPC / mixed CUDA preset `(0.45, 0.15, 0.25, 0.15)`.
`--mem-weights`	preset		Memory-bound / bandwidth-heavy preset `(0.35, 0.10, 0.40, 0.15)`.

GPU Visibility Selection

The dashboard picks the first available source in this order:

--gpus flag
CUDA_VISIBLE_DEVICES env var
NVIDIA_VISIBLE_DEVICES env var
SLURM_STEP_GPUS env var
SLURM_JOB_GPUS env var

If none are set, all GPUs on the node are shown. Use --show-all to explicitly override all env vars. All GPU selections are filtered against GPUs accessible to the current process (as reported by nvidia-smi), which respects cgroup and container restrictions.

Weight Presets

Preset	Flag	SM	Tensor	DRAM	GR	Best For
AI/ML (default)	`--ai-weights`	0.35	0.35	0.20	0.10	DL training, LLM inference, transformers
HPC	`--hpc-weights`	0.45	0.15	0.25	0.15	Scientific computing, mixed CUDA
Memory-bound	`--mem-weights`	0.35	0.10	0.40	0.15	Bandwidth-heavy workloads, stencil codes

Custom: --weights 0.40,0.30,0.20,0.10 (values are normalized automatically).

How It Works

KempnerPulse reads Prometheus text-format metrics from dcgm-exporter via HTTP (or a local file). It computes a Real Utilization score as a weighted combination of four DCGM profiling counters:

Real Util = clamp(0, 100,
              W_sm    × SM_ACTIVE
            + W_tensor × TENSOR_ACTIVE
            + W_dram   × DRAM_ACTIVE
            + W_gr     × GR_ENGINE_ACTIVE)

This gives a more accurate picture of GPU utilization than nvidia-smi's GPU-Util alone, which only reports kernel-launch duty cycle.

Workload Classification

Each GPU is classified into one of 12 categories every refresh cycle, based on thresholds from NVIDIA's DCGM profiling metric guidance. Categories are evaluated in order and the first matching rule wins.

Status	Thresholds	Rationale
idle	Real Util < 5 %, GR < 5 %, DRAM < 5 %, no I/O	Nothing running.
tensor-heavy compute	Tensor ≥ 50 % and SM ≥ 60 %	DL training / large-scale inference.
tensor compute	Tensor ≥ 15 % and SM ≥ 40 %	Mixed-precision, moderate tensor use.
FP64 / HPC compute	FP64 ≥ 20 % and SM ≥ 50 %	Scientific double-precision workload.
I/O or data-loading	Memcpy ≥ 40 % or PCIe ≥ 1 GB/s, SM < 30 %	Heavy transfer; SMs idle.
memory-bound	DRAM ≥ 50 % and SM < 50 %	Bandwidth limited.
compute-heavy	SM ≥ 80 %	Effective SM use (NVIDIA: ≥ 80 % needed).
compute-active	SM ≥ 50 %	Moderate compute, no tensor dominance.
memory-active	DRAM ≥ 40 %	Significant DRAM traffic.
busy, low SM use	GR ≥ 40 % and SM < 25 %	Overhead / sync / small kernels.
low utilization	GR < 15 %, SM < 15 %, DRAM < 15 %	Barely active.
mixed / moderate	(fallthrough)	No single dominant pattern.

Full details, bottleneck color key, and NVIDIA reference points: docs/classification.md

Health Monitoring

Status	Condition	Meaning
OK	(none of the below)	Normal operation.
WARN	PCIe replay rate > 0/s	PCIe link retransmissions occurring.
HOT	GPU or memory temp ≥ warning threshold	Approaching thermal throttling.
CRIT	Row-remap failure > 0 or uncorrectable remapped rows > 0	Hardware memory errors. Remove from production.

Temperature warning thresholds are per-model (A100: 93 °C, H100/H200: 95 °C, RTX 6000: 92 °C, default: 93 °C). Full threshold table: docs/classification.md

DCGM Metrics

KempnerPulse consumes ~30 DCGM fields covering profiling counters, memory, temperature, power, clocks, PCIe, NVLink, and error counters. The complete list with descriptions and NVIDIA doc links: docs/metrics.md

Performance Overhead

KempnerPulse introduces minimal runtime overhead, using approximately 8.2% of a single CPU core on an AMD EPYC 9374F processor, with negligible memory usage (below the reporting resolution of top).

License

MIT. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.1

Apr 15, 2026

0.4.0

Apr 13, 2026

0.3.0

Apr 10, 2026

0.2.1

Apr 5, 2026

0.2.0

Apr 4, 2026

This version

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kempnerpulse-0.1.0.tar.gz (33.2 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kempnerpulse-0.1.0-py3-none-any.whl (30.4 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file kempnerpulse-0.1.0.tar.gz.

File metadata

Download URL: kempnerpulse-0.1.0.tar.gz
Upload date: Apr 1, 2026
Size: 33.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for kempnerpulse-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ef7ecbc7e345a476cc4a8c8bd8fc3a2818211101a1d77a126a9200accd009a43`
MD5	`451b08d48cfdd113d9d7d6281527552a`
BLAKE2b-256	`4abff5fd7858afeaf66891699810c993f945bcaff459f31280b1b45a7914fb11`

See more details on using hashes here.

File details

Details for the file kempnerpulse-0.1.0-py3-none-any.whl.

File metadata

Download URL: kempnerpulse-0.1.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 30.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for kempnerpulse-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db0b80766ff12f5ea834fe7720ab14968b9f7c5213c956c76a38a00965f066eb`
MD5	`f685105229d4c3f806c16fddadccbefb`
BLAKE2b-256	`ed5cb12c63bc9b4f21d9a2e5d042e912929b0645c0fdfd9a06e7b4feffe92c31`

See more details on using hashes here.

kempnerpulse 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

KempnerPulse

Features

Screenshots

Fleet View

Focus View

Plot View

Job View

Requirements

Installation

Quick Start

Interactive Commands

CLI Reference

GPU Visibility Selection

Weight Presets

How It Works

Workload Classification

Health Monitoring

DCGM Metrics

Performance Overhead

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes