Skip to main content

GPU, storage, thermal & infrastructure diagnostics toolkit

Project description

Brokkr Diagnostics

Comprehensive hardware diagnostics for NVIDIA GPU nodes. Collects GPU state, PCIe health, NVLink topology, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — structured JSON output, async execution, graceful degradation.

Built for data center operators, HPC administrators, and GPU cluster managers.

Installation

pip install brokkr-diagnostics

Quick Start

# Interactive REPL with tab completion
brokkr-diagnostics --interactive

# Run a specific diagnostic
brokkr-diagnostics --run gpu

# Run all diagnostics
brokkr-diagnostics --run all

Commands

Command Aliases Description
driver NVIDIA driver version & installation checks
gpu GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, processes
nvlink NVLink status, error counters & topology
lspci pcie PCIe topology, link status, AER error counters, ACS & IOMMU
kernel logs Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks
services NVIDIA systemd service status
system proc System info — /proc files, NUMA topology
storage nvme, disk NVMe SMART, error logs, RAID, LVM, filesystem read-only detection
thermal ipmi, sensors CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log
ib InfiniBand device/port status, error & traffic counters
cuda cuda-tests CUDA memory allocation & P2P bandwidth tests
reset GPU reset sequence (requires root)

Usage Examples

GPU Hardware State

brokkr-diagnostics --run gpu

Returns per-GPU: ECC error counts (volatile + aggregate), retired pages (SBE/DBE/pending), remapped rows (correctable/uncorrectable/pending/failure), memory usage, power draw vs limits, clock speeds, PCIe link gen/width, temperature (GPU + HBM), running processes.

Kernel Log Analysis

brokkr-diagnostics --run kernel

Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID/SXid errors, NVSwitch failures, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups. Each error is categorized by domain and severity.

PCIe Diagnostics

brokkr-diagnostics --run lspci

Enumerates NVIDIA PCIe devices, checks link speed/width degradation, reads AER error counters (correctable, fatal, nonfatal) from sysfs, and checks ACS/IOMMU status for P2P readiness.

Storage & NVMe Health

brokkr-diagnostics --run storage

Collects NVMe SMART data (critical warnings, media errors, spare capacity, wear), NVMe error logs, MD RAID array status, LVM layout, and detects read-only filesystem remounts.

Thermal & IPMI

brokkr-diagnostics --run thermal

Reads CPU/board thermals via lm-sensors and collects IPMI BMC sensor readings (fans, temps, voltages, PSUs) with threshold status. Parses the IPMI System Event Log for hardware fault history.

NVLink Topology

brokkr-diagnostics --run nvlink

Per-GPU NVLink link state, version, remote device type, and error counters (CRC flit, CRC data, replay, recovery). Includes GPU-pair topology mapping.

NVIDIA Driver

brokkr-diagnostics --run driver

Driver version, VBIOS versions, kernel module info (vermagic, parameters), /proc/driver/nvidia state, DKMS logs, installation logs.

InfiniBand

brokkr-diagnostics --run ib

IB device enumeration with firmware/hardware info, per-port state and rate, error counters, and traffic stats from sysfs.

GPU Reset

sudo brokkr-diagnostics --run reset

Kills GPU processes, stops NVIDIA services, unloads kernel modules, performs PCIe bus reset, then reloads everything.

Optional System Tools

These enhance output when available. Missing tools are detected and skipped — never crashes.

Tool Package Used by
smartctl smartmontools NVMe SMART health
nvme nvme-cli NVMe error & smart logs
ipmitool ipmitool IPMI sensors & System Event Log
sensors lm-sensors CPU/board thermals
lspci pciutils PCIe topology & AER counters
mdadm mdadm MD RAID status

Requirements

  • Python >= 3.10
  • Linux
  • NVIDIA GPU with drivers loaded (for GPU diagnostics)
  • Root/sudo for some operations (GPU reset, IPMI, dmesg)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brokkr_diagnostics-0.5.2.tar.gz (50.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

brokkr_diagnostics-0.5.2-py3-none-any.whl (66.8 kB view details)

Uploaded Python 3

File details

Details for the file brokkr_diagnostics-0.5.2.tar.gz.

File metadata

  • Download URL: brokkr_diagnostics-0.5.2.tar.gz
  • Upload date:
  • Size: 50.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.5.2.tar.gz
Algorithm Hash digest
SHA256 ae7d8852728796835b450190dd265773ebe444d1fd9d143323efe5898f3d0095
MD5 f247de6bedac4ea07ef5bd676d9bd776
BLAKE2b-256 9eec159e92972324d0d024b32170ffab8dfc8549397881defd7cd60542f6a2a1

See more details on using hashes here.

File details

Details for the file brokkr_diagnostics-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: brokkr_diagnostics-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 66.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a89f30d70f3a9b30432a375973cc7b409a797af17c35f583c77b0df3065e637b
MD5 646c4d52d98a77759641688aee312e45
BLAKE2b-256 72353516dda9cbcafcad67b9fe4a7a4f166abd2d2d1ce039d8aa1c655c2005d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page