Skip to main content

GPU, storage, thermal & infrastructure diagnostics toolkit

Project description

Brokkr Diagnostics

Comprehensive hardware diagnostics for NVIDIA GPU nodes. Collects GPU state, PCIe health, NVLink topology, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — structured JSON output, async execution, graceful degradation.

Built for data center operators, HPC administrators, and GPU cluster managers.

Installation

pip install brokkr-diagnostics

Quick Start

# Interactive REPL with tab completion
brokkr-diagnostics --interactive

# Run a specific diagnostic
brokkr-diagnostics --run gpu

# Run all diagnostics
brokkr-diagnostics --run all

# Run and send all diagnostics to Brokkr
brokkr-diagnostics --send

# Run and send a specific diagnostic to Brokkr
brokkr-diagnostics --send gpu

Commands

Command Aliases Description
driver NVIDIA driver version & installation checks
gpu GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, processes
nvlink NVLink status, error counters & topology
lspci pcie PCIe topology, link status, AER error counters, ACS & IOMMU
kernel logs Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks
services NVIDIA systemd service status
system proc System info — /proc files, NUMA topology
storage nvme, disk NVMe SMART, error logs, RAID, LVM, filesystem read-only detection
thermal ipmi, sensors CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log
ib InfiniBand device/port status, error & traffic counters
cuda cuda-tests CUDA memory allocation & P2P bandwidth tests
reset GPU reset sequence (requires root)

Usage Examples

GPU Hardware State

brokkr-diagnostics --run gpu

Returns per-GPU: ECC error counts (volatile + aggregate), retired pages (SBE/DBE/pending), remapped rows (correctable/uncorrectable/pending/failure), memory usage, power draw vs limits, clock speeds, PCIe link gen/width, temperature (GPU + HBM), running processes.

Kernel Log Analysis

brokkr-diagnostics --run kernel

Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID/SXid errors, NVSwitch failures, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups. Each error is categorized by domain and severity.

PCIe Diagnostics

brokkr-diagnostics --run lspci

Enumerates NVIDIA PCIe devices, checks link speed/width degradation, reads AER error counters (correctable, fatal, nonfatal) from sysfs, and checks ACS/IOMMU status for P2P readiness.

Storage & NVMe Health

brokkr-diagnostics --run storage

Collects NVMe SMART data (critical warnings, media errors, spare capacity, wear), NVMe error logs, MD RAID array status, LVM layout, and detects read-only filesystem remounts.

Thermal & IPMI

brokkr-diagnostics --run thermal

Reads CPU/board thermals via lm-sensors and collects IPMI BMC sensor readings (fans, temps, voltages, PSUs) with threshold status. Parses the IPMI System Event Log for hardware fault history.

NVLink Topology

brokkr-diagnostics --run nvlink

Per-GPU NVLink link state, version, remote device type, and error counters (CRC flit, CRC data, replay, recovery). Includes GPU-pair topology mapping.

NVIDIA Driver

brokkr-diagnostics --run driver

Driver version, VBIOS versions, kernel module info (vermagic, parameters), /proc/driver/nvidia state, DKMS logs, installation logs.

InfiniBand

brokkr-diagnostics --run ib

IB device enumeration with firmware/hardware info, per-port state and rate, error counters, and traffic stats from sysfs.

GPU Reset

sudo brokkr-diagnostics --run reset

Kills GPU processes, stops NVIDIA services, unloads kernel modules, performs PCIe bus reset, then reloads everything.

Sending Diagnostics to Brokkr

The --send (-s) flag runs diagnostics and POSTs the results to the Brokkr API. This is used for remote monitoring and automated health reporting.

# Send all diagnostics
brokkr-diagnostics --send

# Send a specific diagnostic
brokkr-diagnostics --send gpu

When sending all diagnostics, commands that require root or are synchronous-only (e.g., reset) are automatically skipped.

Authentication

Credentials are resolved in this order:

  1. Creds file/var/lib/brokkr/phone-home-creds.json (keys: cipher, signature)
  2. Fallback script — Parsed from /var/lib/cloud/scripts/per-boot/90-phone-home.sh if the creds file is missing or incomplete
  3. Environment variablesBROKKR_AUTH_CIPHER and BROKKR_AUTH_SIGNATURE override any file-based values

Configuration

Environment Variable Default Description
BROKKR_API_URL https://brokkr.hydrahost.com/api/v1 API base URL
BROKKR_AUTH_CIPHER (from creds file) Authentication cipher
BROKKR_AUTH_SIGNATURE (from creds file) Authentication signature
BROKKR_TIMEOUT 30 HTTP request timeout in seconds

Optional System Tools

These enhance output when available. Missing tools are detected and skipped — never crashes.

Tool Package Used by
smartctl smartmontools NVMe SMART health
nvme nvme-cli NVMe error & smart logs
ipmitool ipmitool IPMI sensors & System Event Log
sensors lm-sensors CPU/board thermals
lspci pciutils PCIe topology & AER counters
mdadm mdadm MD RAID status

Requirements

  • Python >= 3.10
  • Linux
  • NVIDIA GPU with drivers loaded (for GPU diagnostics)
  • Root/sudo for some operations (GPU reset, IPMI, dmesg)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brokkr_diagnostics-0.6.0.tar.gz (50.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

brokkr_diagnostics-0.6.0-py3-none-any.whl (67.5 kB view details)

Uploaded Python 3

File details

Details for the file brokkr_diagnostics-0.6.0.tar.gz.

File metadata

  • Download URL: brokkr_diagnostics-0.6.0.tar.gz
  • Upload date:
  • Size: 50.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for brokkr_diagnostics-0.6.0.tar.gz
Algorithm Hash digest
SHA256 b54d20c69d33ceb4200cf3b6da11934a784705c9378083a78efd635908eca22e
MD5 a4a19b74db5efb8797dfa3488cfef05b
BLAKE2b-256 773a099e5354d07273ef4ad64953d84d752ee55bcf0aed4d90bf6b7305638f3f

See more details on using hashes here.

File details

Details for the file brokkr_diagnostics-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: brokkr_diagnostics-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 67.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for brokkr_diagnostics-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a69a269512fa6719215e64c8a4a59254453d948526698dbe41cdbbcb4f611ee9
MD5 006054a8c58abb92321a0dbc259746c5
BLAKE2b-256 73696b15a1119302989359fa7b4e9b1e13e8b5a810ef58c0e724164d18eab5a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page