Skip to main content

NVIDIA GPU and InfiniBand diagnostics toolkit

Project description

Brokkr Diagnostics

A comprehensive diagnostic toolkit for NVIDIA GPU and InfiniBand infrastructure. Designed for data center operators, HPC administrators, and GPU cluster managers who need deep visibility into hardware health and configuration.

Features

  • NVIDIA Driver Diagnostics - Version checks, module loading, ABI compatibility, installation logs
  • GPU Hardware Analysis - GPU identification, memory, clocks, power states, thermal monitoring
  • CUDA Environment - CUDA runtime, compute capability, library versions
  • PCIe Topology - Bus enumeration, link speed/width validation, topology mapping
  • Kernel Monitoring - SystemD services, kernel logs, error detection, dmesg analysis
  • System Information - NUMA topology, /proc filesystem analysis, memory configuration
  • InfiniBand Diagnostics - IB device enumeration, port status, link state
  • GPU Reset - Safe GPU reset functionality for stuck or hung devices
  • Confidential Computing - Enable/disable NVIDIA Confidential Compute mode (requires root)

Installation

pip install brokkr-diagnostics

Usage

Interactive Mode

Launch the interactive menu to explore diagnostics:

brokkr-diagnostics

Command-Line Mode

Run specific diagnostics non-interactively:

# Check NVIDIA driver installation and versions
brokkr-diagnostics --run driver

# Analyze GPU hardware state
brokkr-diagnostics --run gpu

# Inspect PCIe topology
brokkr-diagnostics --run lspci

# Check SystemD services
brokkr-diagnostics --run services

# Scan kernel logs for errors
brokkr-diagnostics --run kernel

# View system information (NUMA, /proc)
brokkr-diagnostics --run system

# InfiniBand diagnostics
brokkr-diagnostics --run ib

# CUDA environment checks
brokkr-diagnostics --run cuda

# Run all diagnostics
brokkr-diagnostics --run all

# Reset all GPUs (requires appropriate permissions)
brokkr-diagnostics --run reset

# Enable Confidential Compute mode (requires root)
sudo brokkr-diagnostics --run "cc on"

# Disable Confidential Compute mode (requires root)
sudo brokkr-diagnostics --run "cc off"

Diagnostic Modules

Driver Diagnostics (driver)

Validates NVIDIA driver installation and configuration:

  • nvidia-smi version and availability
  • Driver version from /proc/driver/nvidia and nvidia-smi
  • Kernel module version magic (vermagic) for ABI compatibility
  • Installation/uninstallation logs
  • DKMS compilation logs
  • Loaded modules status (nvidia, nvidia_drm, nvidia_modeset, nvidia_uvm, nvidia_peermem)
  • Module parameters and configuration

GPU Hardware (gpu)

Comprehensive GPU hardware state analysis:

  • GPU enumeration and identification
  • Memory capacity and utilization
  • Clock speeds (graphics, memory, SM)
  • Power state and consumption
  • Temperature monitoring
  • Fan speed
  • PCIe generation and link width
  • Compute mode
  • Persistence mode
  • ECC status and errors
  • Performance state (P-state)

PCIe Diagnostics (lspci)

PCIe bus topology and link analysis:

  • Device enumeration on PCIe bus
  • Link speed and width (current vs. maximum)
  • Bus topology mapping
  • Device class identification
  • Vendor and device IDs

Service Status (services)

SystemD service monitoring:

  • NVIDIA persistence daemon
  • Fabric manager
  • GPU driver services
  • Service health and uptime

Kernel Logs (kernel)

Kernel message analysis:

  • Recent kernel logs (dmesg)
  • NVIDIA-related messages
  • Error detection and filtering
  • Hardware error events
  • PCIe link errors
  • GPU initialization logs

System Information (system)

Linux system configuration:

  • NUMA topology
  • CPU information from /proc/cpuinfo
  • Memory configuration from /proc/meminfo
  • Kernel version from /proc/version
  • System uptime

InfiniBand (ib)

InfiniBand network diagnostics:

  • IB device enumeration
  • Port status and state
  • Link layer information
  • Physical state
  • Rate and link speed

CUDA Diagnostics (cuda)

CUDA environment validation:

  • CUDA runtime version
  • Compute capability per GPU
  • CUDA library paths
  • nvcc compiler availability
  • CUDA sample compilation tests

Requirements

  • Python >= 3.8
  • Linux operating system (Ubuntu, CentOS, RHEL, etc.)
  • NVIDIA GPU (for GPU-related diagnostics)
  • Root/sudo access (for some operations like GPU reset and Confidential Compute)

Dependencies

  • numpy>=1.24.4
  • nvidia-gpu-admin-tools>=2025.11.21
  • termcolor>=2.4.0
  • prompt_toolkit>=3.0.0

Use Cases

Data Center Operations

  • Pre-deployment hardware validation
  • Post-maintenance verification
  • Troubleshooting GPU issues
  • Configuration auditing

HPC Clusters

  • Node health checks
  • GPU allocation validation
  • Performance baseline verification
  • InfiniBand network validation

ML/AI Infrastructure

  • Training environment validation
  • Multi-GPU setup verification
  • CUDA environment checks
  • Memory and thermal monitoring

CI/CD Pipelines

  • Automated hardware testing
  • Regression detection
  • Configuration compliance

License

MIT License

Author

Brokkr by Hydra Host Contact: fabricio.policarpo@hydrahost.com

Contributing

Issues and pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brokkr_diagnostics-0.3.9.tar.gz (38.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

brokkr_diagnostics-0.3.9-py3-none-any.whl (51.1 kB view details)

Uploaded Python 3

File details

Details for the file brokkr_diagnostics-0.3.9.tar.gz.

File metadata

  • Download URL: brokkr_diagnostics-0.3.9.tar.gz
  • Upload date:
  • Size: 38.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.3.9.tar.gz
Algorithm Hash digest
SHA256 f60b23487acaf4e5446ad40c336d77e37652bab15e8ce45d87ca6990a36114d3
MD5 88a85b4ccc7bb45f944194e5a64e178e
BLAKE2b-256 0382395056a537ec5798fe96a3cd8914d01a20c14b2c9e3cc20a5c48181cec7d

See more details on using hashes here.

File details

Details for the file brokkr_diagnostics-0.3.9-py3-none-any.whl.

File metadata

  • Download URL: brokkr_diagnostics-0.3.9-py3-none-any.whl
  • Upload date:
  • Size: 51.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for brokkr_diagnostics-0.3.9-py3-none-any.whl
Algorithm Hash digest
SHA256 401cc7883cab19f9e6c7f52b519f48d1e5f9ad8568e9ac71631f8c77c0fb9490
MD5 9d139e90b815d5818877328cc6b654c7
BLAKE2b-256 bbb2f9439b620b2a2a4384a3e775cac1dbfaf538dc98cb788cdaccef703a91cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page