Skip to main content

GPU, storage, thermal & infrastructure diagnostics toolkit

Project description

Brokkr Diagnostics

Comprehensive hardware diagnostics for NVIDIA GPU nodes. Collects GPU state, PCIe health, NVLink topology, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — structured JSON output, async execution, graceful degradation.

Built for data center operators, HPC administrators, and GPU cluster managers.

Installation

pip install brokkr-diagnostics

Quick Start

# Interactive REPL with tab completion
brokkr-diagnostics --interactive

# Run a specific diagnostic
brokkr-diagnostics --run gpu

# Run all diagnostics
brokkr-diagnostics --run all

# Run and send all diagnostics to Brokkr
brokkr-diagnostics --send

# Run and send a specific diagnostic to Brokkr
brokkr-diagnostics --send gpu

Commands

Command Aliases Description
driver NVIDIA driver version & installation checks
gpu GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, processes
nvlink NVLink status, error counters & topology
lspci pcie PCIe topology, link status, AER error counters, ACS & IOMMU
kernel logs Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks
services NVIDIA systemd service status
system proc System info — /proc files, NUMA topology
storage nvme, disk NVMe SMART, error logs, RAID, LVM, filesystem read-only detection, disk usage warnings
thermal ipmi, sensors CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log
ib InfiniBand device/port status, error & traffic counters
cuda cuda-tests CUDA memory allocation & P2P bandwidth tests
reset GPU reset sequence (requires root)

Usage Examples

GPU Hardware State

brokkr-diagnostics --run gpu

Returns per-GPU: ECC error counts (volatile + aggregate), retired pages (SBE/DBE/pending), remapped rows (correctable/uncorrectable/pending/failure), memory usage, power draw vs limits, clock speeds, PCIe link gen/width, temperature (GPU + HBM), running processes.

Kernel Log Analysis

brokkr-diagnostics --run kernel

Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID/SXid errors, NVSwitch failures, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups. Each error is categorized by domain and severity.

PCIe Diagnostics

brokkr-diagnostics --run lspci

Enumerates NVIDIA PCIe devices, checks link speed/width degradation, reads AER error counters (correctable, fatal, nonfatal) from sysfs, and checks ACS/IOMMU status for P2P readiness.

Storage & NVMe Health

brokkr-diagnostics --run storage

Collects NVMe SMART data (critical warnings, media errors, spare capacity, wear), NVMe error logs, MD RAID array status, LVM layout, detects read-only filesystem remounts, and warns when any filesystem reaches 90%+ disk usage.

Thermal & IPMI

brokkr-diagnostics --run thermal

Reads CPU/board thermals via lm-sensors and collects IPMI BMC sensor readings (fans, temps, voltages, PSUs) with threshold status. Parses the IPMI System Event Log for hardware fault history.

NVLink Topology

brokkr-diagnostics --run nvlink

Per-GPU NVLink link state, version, remote device type, and error counters (CRC flit, CRC data, replay, recovery). Includes GPU-pair topology mapping.

NVIDIA Driver

brokkr-diagnostics --run driver

Driver version, VBIOS versions, kernel module info (vermagic, parameters), /proc/driver/nvidia state, DKMS logs, installation logs.

InfiniBand

brokkr-diagnostics --run ib

IB device enumeration with firmware/hardware info, per-port state and rate, error counters, and traffic stats from sysfs.

GPU Reset

sudo brokkr-diagnostics --run reset

Kills GPU processes, stops NVIDIA services, unloads kernel modules, performs PCIe bus reset, then reloads everything.

Sending Diagnostics to Brokkr

The --send (-s) flag runs diagnostics and POSTs the results to the Brokkr API. This is used for remote monitoring and automated health reporting.

# Send all diagnostics
brokkr-diagnostics --send

# Send a specific diagnostic
brokkr-diagnostics --send gpu

When sending all diagnostics, commands that require root or are synchronous-only (e.g., reset) are automatically skipped.

Authentication

Credentials are resolved in this order:

  1. Creds file/var/lib/brokkr/phone-home-creds.json (keys: cipher, signature)
  2. Fallback script — Parsed from /var/lib/cloud/scripts/per-boot/90-phone-home.sh if the creds file is missing or incomplete
  3. Environment variablesBROKKR_AUTH_CIPHER and BROKKR_AUTH_SIGNATURE override any file-based values

Configuration

Environment Variable Default Description
BROKKR_API_URL https://brokkr.hydrahost.com/api/v1 API base URL
BROKKR_AUTH_CIPHER (from creds file) Authentication cipher
BROKKR_AUTH_SIGNATURE (from creds file) Authentication signature
BROKKR_TIMEOUT 30 HTTP request timeout in seconds

Optional System Tools

These enhance output when available. Missing tools are detected and skipped — never crashes.

Tool Package Used by
smartctl smartmontools NVMe SMART health
nvme nvme-cli NVMe error & smart logs
ipmitool ipmitool IPMI sensors & System Event Log
sensors lm-sensors CPU/board thermals
lspci pciutils PCIe topology & AER counters
mdadm mdadm MD RAID status

Requirements

  • Python >= 3.10
  • Linux
  • NVIDIA GPU with drivers loaded (for GPU diagnostics)
  • Root/sudo for some operations (GPU reset, IPMI, dmesg)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brokkr_diagnostics-0.5.6.tar.gz (51.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

brokkr_diagnostics-0.5.6-py3-none-any.whl (67.8 kB view details)

Uploaded Python 3

File details

Details for the file brokkr_diagnostics-0.5.6.tar.gz.

File metadata

  • Download URL: brokkr_diagnostics-0.5.6.tar.gz
  • Upload date:
  • Size: 51.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.5.6.tar.gz
Algorithm Hash digest
SHA256 5c8a203861699ff1fd3208e1a8198d3ad903e93461f404f2a5132720059941aa
MD5 6d84075acdeffd6604728884dd751189
BLAKE2b-256 29f44f553e5fd7053c03c8769240c155facde4c1ca3dbc65bfaa7aa86deeb796

See more details on using hashes here.

File details

Details for the file brokkr_diagnostics-0.5.6-py3-none-any.whl.

File metadata

  • Download URL: brokkr_diagnostics-0.5.6-py3-none-any.whl
  • Upload date:
  • Size: 67.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.5.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1794caa83025b9bfb25201dad68891a3aa45082e7c0ecc80e82cdd2c5fc71dfe
MD5 e811008558318920d6eaf587946ba66f
BLAKE2b-256 941b15a2caafb60933e0789bcaa5df36c4ea90c57bedefe803e25a32df8ec78c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page