GPU, storage, thermal & infrastructure diagnostics toolkit
Project description
Brokkr Diagnostics
Comprehensive hardware diagnostics for NVIDIA GPU nodes. Collects GPU state, PCIe health, NVLink topology, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — structured JSON output, async execution, graceful degradation.
Built for data center operators, HPC administrators, and GPU cluster managers.
Installation
pip install brokkr-diagnostics
Quick Start
# Interactive REPL with tab completion
brokkr-diagnostics --interactive
# Run a specific diagnostic
brokkr-diagnostics --run gpu
# Run all diagnostics
brokkr-diagnostics --run all
Commands
| Command | Aliases | Description |
|---|---|---|
driver |
NVIDIA driver version & installation checks | |
gpu |
GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, processes | |
nvlink |
NVLink status, error counters & topology | |
lspci |
pcie |
PCIe topology, link status, AER error counters, ACS & IOMMU |
kernel |
logs |
Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks |
services |
NVIDIA systemd service status | |
system |
proc |
System info — /proc files, NUMA topology |
storage |
nvme, disk |
NVMe SMART, error logs, RAID, LVM, filesystem read-only detection |
thermal |
ipmi, sensors |
CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log |
ib |
InfiniBand device/port status, error & traffic counters | |
cuda |
cuda-tests |
CUDA memory allocation & P2P bandwidth tests |
reset |
GPU reset sequence (requires root) |
Usage Examples
GPU Hardware State
brokkr-diagnostics --run gpu
Returns per-GPU: ECC error counts (volatile + aggregate), retired pages (SBE/DBE/pending), remapped rows (correctable/uncorrectable/pending/failure), memory usage, power draw vs limits, clock speeds, PCIe link gen/width, temperature (GPU + HBM), running processes.
Kernel Log Analysis
brokkr-diagnostics --run kernel
Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID/SXid errors, NVSwitch failures, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups. Each error is categorized by domain and severity.
PCIe Diagnostics
brokkr-diagnostics --run lspci
Enumerates NVIDIA PCIe devices, checks link speed/width degradation, reads AER error counters (correctable, fatal, nonfatal) from sysfs, and checks ACS/IOMMU status for P2P readiness.
Storage & NVMe Health
brokkr-diagnostics --run storage
Collects NVMe SMART data (critical warnings, media errors, spare capacity, wear), NVMe error logs, MD RAID array status, LVM layout, and detects read-only filesystem remounts.
Thermal & IPMI
brokkr-diagnostics --run thermal
Reads CPU/board thermals via lm-sensors and collects IPMI BMC sensor readings (fans, temps, voltages, PSUs) with threshold status. Parses the IPMI System Event Log for hardware fault history.
NVLink Topology
brokkr-diagnostics --run nvlink
Per-GPU NVLink link state, version, remote device type, and error counters (CRC flit, CRC data, replay, recovery). Includes GPU-pair topology mapping.
NVIDIA Driver
brokkr-diagnostics --run driver
Driver version, VBIOS versions, kernel module info (vermagic, parameters), /proc/driver/nvidia state, DKMS logs, installation logs.
InfiniBand
brokkr-diagnostics --run ib
IB device enumeration with firmware/hardware info, per-port state and rate, error counters, and traffic stats from sysfs.
GPU Reset
sudo brokkr-diagnostics --run reset
Kills GPU processes, stops NVIDIA services, unloads kernel modules, performs PCIe bus reset, then reloads everything.
Optional System Tools
These enhance output when available. Missing tools are detected and skipped — never crashes.
| Tool | Package | Used by |
|---|---|---|
smartctl |
smartmontools |
NVMe SMART health |
nvme |
nvme-cli |
NVMe error & smart logs |
ipmitool |
ipmitool |
IPMI sensors & System Event Log |
sensors |
lm-sensors |
CPU/board thermals |
lspci |
pciutils |
PCIe topology & AER counters |
mdadm |
mdadm |
MD RAID status |
Requirements
- Python >= 3.10
- Linux
- NVIDIA GPU with drivers loaded (for GPU diagnostics)
- Root/sudo for some operations (GPU reset, IPMI, dmesg)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file brokkr_diagnostics-0.5.2.tar.gz.
File metadata
- Download URL: brokkr_diagnostics-0.5.2.tar.gz
- Upload date:
- Size: 50.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae7d8852728796835b450190dd265773ebe444d1fd9d143323efe5898f3d0095
|
|
| MD5 |
f247de6bedac4ea07ef5bd676d9bd776
|
|
| BLAKE2b-256 |
9eec159e92972324d0d024b32170ffab8dfc8549397881defd7cd60542f6a2a1
|
File details
Details for the file brokkr_diagnostics-0.5.2-py3-none-any.whl.
File metadata
- Download URL: brokkr_diagnostics-0.5.2-py3-none-any.whl
- Upload date:
- Size: 66.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a89f30d70f3a9b30432a375973cc7b409a797af17c35f583c77b0df3065e637b
|
|
| MD5 |
646c4d52d98a77759641688aee312e45
|
|
| BLAKE2b-256 |
72353516dda9cbcafcad67b9fe4a7a4f166abd2d2d1ce039d8aa1c655c2005d8
|