Skip to main content

GPU, storage, thermal & infrastructure diagnostics toolkit

Project description

Brokkr Diagnostics

Comprehensive hardware diagnostics for NVIDIA GPU nodes. Detects GPU health issues, PCIe degradation, NVLink errors, Fabric Manager problems, ECC/memory faults, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — with a unified health check that auto-detects machine topology and produces a single HEALTHY / DEGRADED / UNHEALTHY verdict.

Built for data center operators, HPC administrators, and GPU cluster managers.

Installation

pip install brokkr-diagnostics

Quick Start

# Machine health check — single pass/fail verdict
brokkr-diagnostics --run health

# Interactive REPL with tab completion
brokkr-diagnostics --interactive

# Run a specific diagnostic
brokkr-diagnostics --run gpu

# Run all diagnostics
brokkr-diagnostics --run all

# Run and send all diagnostics to Brokkr
brokkr-diagnostics --send

Health Check

The health command auto-detects your machine topology (PCIe-only, NVLink, or NVSwitch) and runs targeted checks to produce a single verdict:

brokkr-diagnostics --run health

What it checks:

Category Checks When
gpu_presence All GPUs enumerable via NVML Always
driver Driver loaded, persistence mode Always
memory_health ECC errors, row remapping, pending repairs Always
pcie Link speed/width at max, zero fatal errors Always
thermal_power Temperature headroom, no HW throttling, violation time Always
kernel_errors No XID/SXid, OOM kills, GPU fallen off bus in dmesg Always
p2p_communication GPU-to-GPU memory copy works for all pairs Multi-GPU
nvlink All links active, zero CRC/replay errors NVLink systems
fabric_manager FM running, version matches driver, NVSwitch count NVSwitch systems

Output:

  • HEALTHY — all checks pass
  • DEGRADED — warnings present (e.g., thermal headroom low, correctable errors)
  • UNHEALTHY — failures detected (e.g., PCIe x2 instead of x16, FM version mismatch, P2P broken)

Known issues detection: When P2P failures are detected, the tool cross-references the GPU architecture against a known-issues database and suggests specific fixes (e.g., Blackwell UVM HMM workaround).

Commands

Command Aliases Description
health status, check Unified machine health check (auto-detects topology)
driver NVIDIA driver version & installation checks
gpu GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, utilization, violations, BAR1, architecture
nvlink NVLink status, error counters, topology, P2P status matrix, throughput measurement
lspci pcie PCIe topology, link status, AER error counters, ACS & IOMMU
kernel logs Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks
services NVIDIA systemd service status
system proc System info — /proc files, NUMA topology
storage nvme, disk NVMe SMART, error logs, RAID, LVM, filesystem read-only detection
thermal ipmi, sensors CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log
ib InfiniBand device/port status, error & traffic counters
cuda cuda-tests CUDA memory allocation & P2P bandwidth tests
fabric fm, nvswitch Fabric Manager health, NVSwitch enumeration, version validation
ecc ras, memory-health Deep ECC/RAS — per-location errors, row remapper histogram, RMA recommendation
gpu_debug debug, hw-debug Register-level GPU debug via gpu-admin-tools (requires root)
reset GPU reset sequence (requires root)

Usage Examples

Machine Health Check

brokkr-diagnostics --run health
{
  "overall_status": "HEALTHY",
  "topology": {
    "gpu_count": 8,
    "gpu_name": "NVIDIA A100-SXM4-40GB",
    "topology_type": "nvswitch",
    "nvlink_per_gpu": 12,
    "nvswitch_count": 6
  },
  "categories": [
    {"category": "gpu_presence", "status": "pass", "checks": [...]},
    {"category": "memory_health", "status": "pass", "checks": [...]},
    {"category": "p2p_communication", "status": "pass", "checks": [...]},
    {"category": "nvlink", "status": "pass", "checks": [...]},
    {"category": "fabric_manager", "status": "pass", "checks": [...]}
  ],
  "issue_count": 0,
  "issues": []
}

GPU Hardware State

brokkr-diagnostics --run gpu

Per-GPU: ECC errors, retired pages, remapped rows, memory usage, power (draw/limits/default/violations), clock speeds with full 9-bit throttle reason decode, PCIe link gen/width with error counters, temperature with shutdown/slowdown thresholds, BAR1 memory, GPU utilization, performance state, architecture, MIG mode, accounting mode.

Deep ECC/RAS Diagnostics

brokkr-diagnostics --run ecc

Per-memory-location ECC counters (DRAM vs SRAM), row remapper spare-capacity histogram (Ampere+), pending repairs, InfoROM validation, and automated RMA recommendation.

NVLink Diagnostics

brokkr-diagnostics --run nvlink

Per-GPU NVLink link state, version, remote device type, error counters. P2P status matrix with NVLink/atomics capability per GPU pair. NVLink throughput delta measurement. Per-link version validation and NVSwitch link count consistency check.

Fabric Manager & NVSwitch

brokkr-diagnostics --run fabric

NVSwitch device enumeration (count, BIOS version, UUID), Fabric Manager service health, driver/FM version match validation, per-GPU fabric state, FM log error extraction, NVSwitch count validation against known platform configurations.

Register-Level GPU Debug

sudo brokkr-diagnostics --run gpu_debug

Uses nvidia-gpu-admin-tools for hardware register access (BAR0 MMIO). Detects broken/unresponsive GPUs, security faults, and NVLink training states (ACTIVE/TRAIN/FAULT/RCVY/SHUTDOWN) that are invisible to NVML. Reports physical module ID for SXM GPUs.

Kernel Log Analysis

brokkr-diagnostics --run kernel

Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID errors (enriched with severity and recommended action from a 40+ code lookup table), NVSwitch SXid errors, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups.

PCIe Diagnostics

brokkr-diagnostics --run lspci

PCIe device enumeration, link speed/width degradation detection, AER error counters (correctable, fatal, nonfatal) from sysfs and lspci, ACS status on upstream bridges (P2P blocking), IOMMU status and GPU group mapping.

Storage & NVMe Health

brokkr-diagnostics --run storage

NVMe SMART data (critical warnings, media errors, spare capacity, wear), error logs, MD RAID status, LVM layout, read-only filesystem detection.

Thermal & IPMI

brokkr-diagnostics --run thermal

CPU/board thermals via lm-sensors, IPMI BMC sensor readings with threshold status, IPMI System Event Log for hardware fault history.

Sending Diagnostics to Brokkr

The --send (-s) flag runs diagnostics and POSTs the results to the Brokkr API for remote monitoring.

# Send all diagnostics
brokkr-diagnostics --send

# Send a specific diagnostic
brokkr-diagnostics --send gpu

Authentication

Credentials are resolved in this order:

  1. Creds file/var/lib/brokkr/phone-home-creds.json (keys: cipher, signature)
  2. Fallback script — Parsed from /var/lib/cloud/scripts/per-boot/90-phone-home.sh
  3. Environment variablesBROKKR_AUTH_CIPHER and BROKKR_AUTH_SIGNATURE

Configuration

Environment Variable Default Description
BROKKR_API_URL https://brokkr.hydrahost.com/api/v1 API base URL
BROKKR_AUTH_CIPHER (from creds file) Authentication cipher
BROKKR_AUTH_SIGNATURE (from creds file) Authentication signature
BROKKR_TIMEOUT 30 HTTP request timeout in seconds

Optional System Tools

Missing tools are detected and skipped — never crashes.

Tool Package Used by
smartctl smartmontools NVMe SMART health
nvme nvme-cli NVMe error & smart logs
ipmitool ipmitool IPMI sensors & System Event Log
sensors lm-sensors CPU/board thermals
lspci pciutils PCIe topology & AER counters
mdadm mdadm MD RAID status

Dependencies

Package Purpose
nvidia-ml-py NVML Python bindings (pynvml) — GPU metrics
nvidia-gpu-admin-tools Register-level GPU access — broken GPU detection, NVLink training state
termcolor Colored console output
prompt_toolkit Interactive CLI
aiohttp HTTP client for Brokkr API

Requirements

  • Python >= 3.10
  • Linux
  • NVIDIA GPU with drivers loaded
  • Root/sudo for: gpu_debug, reset, IPMI, dmesg

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brokkr_diagnostics-0.6.1.tar.gz (73.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

brokkr_diagnostics-0.6.1-py3-none-any.whl (93.2 kB view details)

Uploaded Python 3

File details

Details for the file brokkr_diagnostics-0.6.1.tar.gz.

File metadata

  • Download URL: brokkr_diagnostics-0.6.1.tar.gz
  • Upload date:
  • Size: 73.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.6.1.tar.gz
Algorithm Hash digest
SHA256 c2abb743479f982d6b827ed806ba5c36ddab122445fe4801b950625fffb8f3d1
MD5 d8b3dae926e679550140d460bfdd3657
BLAKE2b-256 8df3160ab8ef98c21765f1f6e8105e6a072c2e78fba2aa4cc94736e462214a0e

See more details on using hashes here.

File details

Details for the file brokkr_diagnostics-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: brokkr_diagnostics-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 93.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e504a6c4782ff0f4e1fd90f442df3791d61b3405d7f0d9ddb16a2f55a120016b
MD5 f52fe00656663f85df31732ef35c8ecd
BLAKE2b-256 52e146170d8bee258dc86fb40acf35c8c438d6bb51dec4a029d335fb4d628acb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page