GPU, storage, thermal & infrastructure diagnostics toolkit
Project description
Brokkr Diagnostics
Comprehensive hardware diagnostics for NVIDIA GPU nodes. Detects GPU health issues, PCIe degradation, NVLink errors, Fabric Manager problems, ECC/memory faults, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — with a unified health check that auto-detects machine topology and produces a single HEALTHY / DEGRADED / UNHEALTHY verdict.
Built for data center operators, HPC administrators, and GPU cluster managers.
Installation
pip install brokkr-diagnostics
Quick Start
# Machine health check — single pass/fail verdict
brokkr-diagnostics --run health
# Interactive REPL with tab completion
brokkr-diagnostics --interactive
# Run a specific diagnostic
brokkr-diagnostics --run gpu
# Run all diagnostics
brokkr-diagnostics --run all
# Run and send all diagnostics to Brokkr
brokkr-diagnostics --send
Health Check
The health command auto-detects your machine topology (PCIe-only, NVLink, or NVSwitch) and runs targeted checks to produce a single verdict:
brokkr-diagnostics --run health
What it checks:
| Category | Checks | When |
|---|---|---|
| gpu_presence | All GPUs enumerable via NVML | Always |
| driver | Driver loaded, persistence mode | Always |
| memory_health | ECC errors, row remapping, pending repairs | Always |
| pcie | Link speed/width at max, zero fatal errors | Always |
| thermal_power | Temperature headroom, no HW throttling, violation time | Always |
| kernel_errors | No XID/SXid, OOM kills, GPU fallen off bus in dmesg | Always |
| p2p_communication | GPU-to-GPU memory copy works for all pairs | Multi-GPU |
| nvlink | All links active, zero CRC/replay errors | NVLink systems |
| fabric_manager | FM running, version matches driver, NVSwitch count | NVSwitch systems |
Output:
HEALTHY— all checks passDEGRADED— warnings present (e.g., thermal headroom low, correctable errors)UNHEALTHY— failures detected (e.g., PCIe x2 instead of x16, FM version mismatch, P2P broken)
Known issues detection: When P2P failures are detected, the tool cross-references the GPU architecture against a known-issues database and suggests specific fixes (e.g., Blackwell UVM HMM workaround).
Commands
| Command | Aliases | Description |
|---|---|---|
health |
status, check |
Unified machine health check (auto-detects topology) |
driver |
NVIDIA driver version & installation checks | |
gpu |
GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, utilization, violations, BAR1, architecture | |
nvlink |
NVLink status, error counters, topology, P2P status matrix, throughput measurement | |
lspci |
pcie |
PCIe topology, link status, AER error counters, ACS & IOMMU |
kernel |
logs |
Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks |
services |
NVIDIA systemd service status | |
system |
proc |
System info — /proc files, NUMA topology |
storage |
nvme, disk |
NVMe SMART, error logs, RAID, LVM, filesystem read-only detection |
thermal |
ipmi, sensors |
CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log |
ib |
InfiniBand device/port status, error & traffic counters | |
cuda |
cuda-tests |
CUDA memory allocation & P2P bandwidth tests |
fabric |
fm, nvswitch |
Fabric Manager health, NVSwitch enumeration, version validation |
ecc |
ras, memory-health |
Deep ECC/RAS — per-location errors, row remapper histogram, RMA recommendation |
gpu_debug |
debug, hw-debug |
Register-level GPU debug via gpu-admin-tools (requires root) |
reset |
GPU reset sequence (requires root) |
Usage Examples
Machine Health Check
brokkr-diagnostics --run health
{
"overall_status": "HEALTHY",
"topology": {
"gpu_count": 8,
"gpu_name": "NVIDIA A100-SXM4-40GB",
"topology_type": "nvswitch",
"nvlink_per_gpu": 12,
"nvswitch_count": 6
},
"categories": [
{"category": "gpu_presence", "status": "pass", "checks": [...]},
{"category": "memory_health", "status": "pass", "checks": [...]},
{"category": "p2p_communication", "status": "pass", "checks": [...]},
{"category": "nvlink", "status": "pass", "checks": [...]},
{"category": "fabric_manager", "status": "pass", "checks": [...]}
],
"issue_count": 0,
"issues": []
}
GPU Hardware State
brokkr-diagnostics --run gpu
Per-GPU: ECC errors, retired pages, remapped rows, memory usage, power (draw/limits/default/violations), clock speeds with full 9-bit throttle reason decode, PCIe link gen/width with error counters, temperature with shutdown/slowdown thresholds, BAR1 memory, GPU utilization, performance state, architecture, MIG mode, accounting mode.
Deep ECC/RAS Diagnostics
brokkr-diagnostics --run ecc
Per-memory-location ECC counters (DRAM vs SRAM), row remapper spare-capacity histogram (Ampere+), pending repairs, InfoROM validation, and automated RMA recommendation.
NVLink Diagnostics
brokkr-diagnostics --run nvlink
Per-GPU NVLink link state, version, remote device type, error counters. P2P status matrix with NVLink/atomics capability per GPU pair. NVLink throughput delta measurement. Per-link version validation and NVSwitch link count consistency check.
Fabric Manager & NVSwitch
brokkr-diagnostics --run fabric
NVSwitch device enumeration (count, BIOS version, UUID), Fabric Manager service health, driver/FM version match validation, per-GPU fabric state, FM log error extraction, NVSwitch count validation against known platform configurations.
Register-Level GPU Debug
sudo brokkr-diagnostics --run gpu_debug
Uses nvidia-gpu-admin-tools for hardware register access (BAR0 MMIO). Detects broken/unresponsive GPUs, security faults, and NVLink training states (ACTIVE/TRAIN/FAULT/RCVY/SHUTDOWN) that are invisible to NVML. Reports physical module ID for SXM GPUs.
Kernel Log Analysis
brokkr-diagnostics --run kernel
Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID errors (enriched with severity and recommended action from a 40+ code lookup table), NVSwitch SXid errors, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups.
PCIe Diagnostics
brokkr-diagnostics --run lspci
PCIe device enumeration, link speed/width degradation detection, AER error counters (correctable, fatal, nonfatal) from sysfs and lspci, ACS status on upstream bridges (P2P blocking), IOMMU status and GPU group mapping.
Storage & NVMe Health
brokkr-diagnostics --run storage
NVMe SMART data (critical warnings, media errors, spare capacity, wear), error logs, MD RAID status, LVM layout, read-only filesystem detection.
Thermal & IPMI
brokkr-diagnostics --run thermal
CPU/board thermals via lm-sensors, IPMI BMC sensor readings with threshold status, IPMI System Event Log for hardware fault history.
Sending Diagnostics to Brokkr
The --send (-s) flag runs diagnostics and POSTs the results to the Brokkr API for remote monitoring.
# Send all diagnostics
brokkr-diagnostics --send
# Send a specific diagnostic
brokkr-diagnostics --send gpu
Authentication
Credentials are resolved in this order:
- Creds file —
/var/lib/brokkr/phone-home-creds.json(keys:cipher,signature) - Fallback script — Parsed from
/var/lib/cloud/scripts/per-boot/90-phone-home.sh - Environment variables —
BROKKR_AUTH_CIPHERandBROKKR_AUTH_SIGNATURE
Configuration
| Environment Variable | Default | Description |
|---|---|---|
BROKKR_API_URL |
https://brokkr.hydrahost.com/api/v1 |
API base URL |
BROKKR_AUTH_CIPHER |
(from creds file) | Authentication cipher |
BROKKR_AUTH_SIGNATURE |
(from creds file) | Authentication signature |
BROKKR_TIMEOUT |
30 |
HTTP request timeout in seconds |
Optional System Tools
Missing tools are detected and skipped — never crashes.
| Tool | Package | Used by |
|---|---|---|
smartctl |
smartmontools |
NVMe SMART health |
nvme |
nvme-cli |
NVMe error & smart logs |
ipmitool |
ipmitool |
IPMI sensors & System Event Log |
sensors |
lm-sensors |
CPU/board thermals |
lspci |
pciutils |
PCIe topology & AER counters |
mdadm |
mdadm |
MD RAID status |
Dependencies
| Package | Purpose |
|---|---|
nvidia-ml-py |
NVML Python bindings (pynvml) — GPU metrics |
nvidia-gpu-admin-tools |
Register-level GPU access — broken GPU detection, NVLink training state |
termcolor |
Colored console output |
prompt_toolkit |
Interactive CLI |
aiohttp |
HTTP client for Brokkr API |
Requirements
- Python >= 3.10
- Linux
- NVIDIA GPU with drivers loaded
- Root/sudo for:
gpu_debug,reset, IPMI, dmesg
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file brokkr_diagnostics-0.6.1.tar.gz.
File metadata
- Download URL: brokkr_diagnostics-0.6.1.tar.gz
- Upload date:
- Size: 73.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2abb743479f982d6b827ed806ba5c36ddab122445fe4801b950625fffb8f3d1
|
|
| MD5 |
d8b3dae926e679550140d460bfdd3657
|
|
| BLAKE2b-256 |
8df3160ab8ef98c21765f1f6e8105e6a072c2e78fba2aa4cc94736e462214a0e
|
File details
Details for the file brokkr_diagnostics-0.6.1-py3-none-any.whl.
File metadata
- Download URL: brokkr_diagnostics-0.6.1-py3-none-any.whl
- Upload date:
- Size: 93.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e504a6c4782ff0f4e1fd90f442df3791d61b3405d7f0d9ddb16a2f55a120016b
|
|
| MD5 |
f52fe00656663f85df31732ef35c8ecd
|
|
| BLAKE2b-256 |
52e146170d8bee258dc86fb40acf35c8c438d6bb51dec4a029d335fb4d628acb
|