GPU, storage, thermal & infrastructure diagnostics toolkit
Project description
Brokkr Diagnostics
Comprehensive hardware diagnostics for NVIDIA GPU nodes. Collects GPU state, PCIe health, NVLink topology, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — structured JSON output, async execution, graceful degradation.
Built for data center operators, HPC administrators, and GPU cluster managers.
Installation
pip install brokkr-diagnostics
Quick Start
# Interactive REPL with tab completion
brokkr-diagnostics --interactive
# Run a specific diagnostic
brokkr-diagnostics --run gpu
# Run all diagnostics
brokkr-diagnostics --run all
# Run and send all diagnostics to Brokkr
brokkr-diagnostics --send
# Run and send a specific diagnostic to Brokkr
brokkr-diagnostics --send gpu
Commands
| Command | Aliases | Description |
|---|---|---|
driver |
NVIDIA driver version & installation checks | |
gpu |
GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, processes | |
nvlink |
NVLink status, error counters & topology | |
lspci |
pcie |
PCIe topology, link status, AER error counters, ACS & IOMMU |
kernel |
logs |
Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks |
services |
NVIDIA systemd service status | |
system |
proc |
System info — /proc files, NUMA topology |
storage |
nvme, disk |
NVMe SMART, error logs, RAID, LVM, filesystem read-only detection |
thermal |
ipmi, sensors |
CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log |
ib |
InfiniBand device/port status, error & traffic counters | |
cuda |
cuda-tests |
CUDA memory allocation & P2P bandwidth tests |
reset |
GPU reset sequence (requires root) |
Usage Examples
GPU Hardware State
brokkr-diagnostics --run gpu
Returns per-GPU: ECC error counts (volatile + aggregate), retired pages (SBE/DBE/pending), remapped rows (correctable/uncorrectable/pending/failure), memory usage, power draw vs limits, clock speeds, PCIe link gen/width, temperature (GPU + HBM), running processes.
Kernel Log Analysis
brokkr-diagnostics --run kernel
Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID/SXid errors, NVSwitch failures, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups. Each error is categorized by domain and severity.
PCIe Diagnostics
brokkr-diagnostics --run lspci
Enumerates NVIDIA PCIe devices, checks link speed/width degradation, reads AER error counters (correctable, fatal, nonfatal) from sysfs, and checks ACS/IOMMU status for P2P readiness.
Storage & NVMe Health
brokkr-diagnostics --run storage
Collects NVMe SMART data (critical warnings, media errors, spare capacity, wear), NVMe error logs, MD RAID array status, LVM layout, and detects read-only filesystem remounts.
Thermal & IPMI
brokkr-diagnostics --run thermal
Reads CPU/board thermals via lm-sensors and collects IPMI BMC sensor readings (fans, temps, voltages, PSUs) with threshold status. Parses the IPMI System Event Log for hardware fault history.
NVLink Topology
brokkr-diagnostics --run nvlink
Per-GPU NVLink link state, version, remote device type, and error counters (CRC flit, CRC data, replay, recovery). Includes GPU-pair topology mapping.
NVIDIA Driver
brokkr-diagnostics --run driver
Driver version, VBIOS versions, kernel module info (vermagic, parameters), /proc/driver/nvidia state, DKMS logs, installation logs.
InfiniBand
brokkr-diagnostics --run ib
IB device enumeration with firmware/hardware info, per-port state and rate, error counters, and traffic stats from sysfs.
GPU Reset
sudo brokkr-diagnostics --run reset
Kills GPU processes, stops NVIDIA services, unloads kernel modules, performs PCIe bus reset, then reloads everything.
Sending Diagnostics to Brokkr
The --send (-s) flag runs diagnostics and POSTs the results to the Brokkr API. This is used for remote monitoring and automated health reporting.
# Send all diagnostics
brokkr-diagnostics --send
# Send a specific diagnostic
brokkr-diagnostics --send gpu
When sending all diagnostics, commands that require root or are synchronous-only (e.g., reset) are automatically skipped.
Authentication
Credentials are resolved in this order:
- Creds file —
/var/lib/brokkr/phone-home-creds.json(keys:cipher,signature) - Fallback script — Parsed from
/var/lib/cloud/scripts/per-boot/90-phone-home.shif the creds file is missing or incomplete - Environment variables —
BROKKR_AUTH_CIPHERandBROKKR_AUTH_SIGNATUREoverride any file-based values
Configuration
| Environment Variable | Default | Description |
|---|---|---|
BROKKR_API_URL |
https://brokkr.hydrahost.com/api/v1 |
API base URL |
BROKKR_AUTH_CIPHER |
(from creds file) | Authentication cipher |
BROKKR_AUTH_SIGNATURE |
(from creds file) | Authentication signature |
BROKKR_TIMEOUT |
30 |
HTTP request timeout in seconds |
Optional System Tools
These enhance output when available. Missing tools are detected and skipped — never crashes.
| Tool | Package | Used by |
|---|---|---|
smartctl |
smartmontools |
NVMe SMART health |
nvme |
nvme-cli |
NVMe error & smart logs |
ipmitool |
ipmitool |
IPMI sensors & System Event Log |
sensors |
lm-sensors |
CPU/board thermals |
lspci |
pciutils |
PCIe topology & AER counters |
mdadm |
mdadm |
MD RAID status |
Requirements
- Python >= 3.10
- Linux
- NVIDIA GPU with drivers loaded (for GPU diagnostics)
- Root/sudo for some operations (GPU reset, IPMI, dmesg)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file brokkr_diagnostics-0.6.0.tar.gz.
File metadata
- Download URL: brokkr_diagnostics-0.6.0.tar.gz
- Upload date:
- Size: 50.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b54d20c69d33ceb4200cf3b6da11934a784705c9378083a78efd635908eca22e
|
|
| MD5 |
a4a19b74db5efb8797dfa3488cfef05b
|
|
| BLAKE2b-256 |
773a099e5354d07273ef4ad64953d84d752ee55bcf0aed4d90bf6b7305638f3f
|
File details
Details for the file brokkr_diagnostics-0.6.0-py3-none-any.whl.
File metadata
- Download URL: brokkr_diagnostics-0.6.0-py3-none-any.whl
- Upload date:
- Size: 67.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a69a269512fa6719215e64c8a4a59254453d948526698dbe41cdbbcb4f611ee9
|
|
| MD5 |
006054a8c58abb92321a0dbc259746c5
|
|
| BLAKE2b-256 |
73696b15a1119302989359fa7b4e9b1e13e8b5a810ef58c0e724164d18eab5a5
|