NVIDIA GPU and InfiniBand diagnostics toolkit
Project description
Brokkr Diagnostics
A comprehensive diagnostic toolkit for NVIDIA GPU and InfiniBand infrastructure. Designed for data center operators, HPC administrators, and GPU cluster managers who need deep visibility into hardware health and configuration.
Features
- NVIDIA Driver Diagnostics - Version checks, module loading, ABI compatibility, installation logs
- GPU Hardware Analysis - GPU identification, memory, clocks, power states, thermal monitoring
- CUDA Environment - CUDA runtime, compute capability, library versions
- PCIe Topology - Bus enumeration, link speed/width validation, topology mapping
- Kernel Monitoring - SystemD services, kernel logs, error detection, dmesg analysis
- System Information - NUMA topology, /proc filesystem analysis, memory configuration
- InfiniBand Diagnostics - IB device enumeration, port status, link state
- GPU Reset - Safe GPU reset functionality for stuck or hung devices
- Confidential Computing - Enable/disable NVIDIA Confidential Compute mode (requires root)
Installation
pip install brokkr-diagnostics
Usage
Interactive Mode
Launch the interactive menu to explore diagnostics:
brokkr-diagnostics
Command-Line Mode
Run specific diagnostics non-interactively:
# Check NVIDIA driver installation and versions
brokkr-diagnostics --run driver
# Analyze GPU hardware state
brokkr-diagnostics --run gpu
# Inspect PCIe topology
brokkr-diagnostics --run lspci
# Check SystemD services
brokkr-diagnostics --run services
# Scan kernel logs for errors
brokkr-diagnostics --run kernel
# View system information (NUMA, /proc)
brokkr-diagnostics --run system
# InfiniBand diagnostics
brokkr-diagnostics --run ib
# CUDA environment checks
brokkr-diagnostics --run cuda
# Run all diagnostics
brokkr-diagnostics --run all
# Reset all GPUs (requires appropriate permissions)
brokkr-diagnostics --run reset
# Enable Confidential Compute mode (requires root)
sudo brokkr-diagnostics --run "cc on"
# Disable Confidential Compute mode (requires root)
sudo brokkr-diagnostics --run "cc off"
Diagnostic Modules
Driver Diagnostics (driver)
Validates NVIDIA driver installation and configuration:
nvidia-smiversion and availability- Driver version from
/proc/driver/nvidiaandnvidia-smi - Kernel module version magic (vermagic) for ABI compatibility
- Installation/uninstallation logs
- DKMS compilation logs
- Loaded modules status (
nvidia,nvidia_drm,nvidia_modeset,nvidia_uvm,nvidia_peermem) - Module parameters and configuration
GPU Hardware (gpu)
Comprehensive GPU hardware state analysis:
- GPU enumeration and identification
- Memory capacity and utilization
- Clock speeds (graphics, memory, SM)
- Power state and consumption
- Temperature monitoring
- Fan speed
- PCIe generation and link width
- Compute mode
- Persistence mode
- ECC status and errors
- Performance state (P-state)
PCIe Diagnostics (lspci)
PCIe bus topology and link analysis:
- Device enumeration on PCIe bus
- Link speed and width (current vs. maximum)
- Bus topology mapping
- Device class identification
- Vendor and device IDs
Service Status (services)
SystemD service monitoring:
- NVIDIA persistence daemon
- Fabric manager
- GPU driver services
- Service health and uptime
Kernel Logs (kernel)
Kernel message analysis:
- Recent kernel logs (
dmesg) - NVIDIA-related messages
- Error detection and filtering
- Hardware error events
- PCIe link errors
- GPU initialization logs
System Information (system)
Linux system configuration:
- NUMA topology
- CPU information from
/proc/cpuinfo - Memory configuration from
/proc/meminfo - Kernel version from
/proc/version - System uptime
InfiniBand (ib)
InfiniBand network diagnostics:
- IB device enumeration
- Port status and state
- Link layer information
- Physical state
- Rate and link speed
CUDA Diagnostics (cuda)
CUDA environment validation:
- CUDA runtime version
- Compute capability per GPU
- CUDA library paths
- nvcc compiler availability
- CUDA sample compilation tests
Requirements
- Python >= 3.8
- Linux operating system (Ubuntu, CentOS, RHEL, etc.)
- NVIDIA GPU (for GPU-related diagnostics)
- Root/sudo access (for some operations like GPU reset and Confidential Compute)
Dependencies
numpy>=1.24.4nvidia-gpu-admin-tools>=2025.11.21termcolor>=2.4.0prompt_toolkit>=3.0.0
Use Cases
Data Center Operations
- Pre-deployment hardware validation
- Post-maintenance verification
- Troubleshooting GPU issues
- Configuration auditing
HPC Clusters
- Node health checks
- GPU allocation validation
- Performance baseline verification
- InfiniBand network validation
ML/AI Infrastructure
- Training environment validation
- Multi-GPU setup verification
- CUDA environment checks
- Memory and thermal monitoring
CI/CD Pipelines
- Automated hardware testing
- Regression detection
- Configuration compliance
License
MIT License
Author
Brokkr by Hydra Host Contact: fabricio.policarpo@hydrahost.com
Contributing
Issues and pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file brokkr_diagnostics-0.3.9.tar.gz.
File metadata
- Download URL: brokkr_diagnostics-0.3.9.tar.gz
- Upload date:
- Size: 38.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f60b23487acaf4e5446ad40c336d77e37652bab15e8ce45d87ca6990a36114d3
|
|
| MD5 |
88a85b4ccc7bb45f944194e5a64e178e
|
|
| BLAKE2b-256 |
0382395056a537ec5798fe96a3cd8914d01a20c14b2c9e3cc20a5c48181cec7d
|
File details
Details for the file brokkr_diagnostics-0.3.9-py3-none-any.whl.
File metadata
- Download URL: brokkr_diagnostics-0.3.9-py3-none-any.whl
- Upload date:
- Size: 51.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
401cc7883cab19f9e6c7f52b519f48d1e5f9ad8568e9ac71631f8c77c0fb9490
|
|
| MD5 |
9d139e90b815d5818877328cc6b654c7
|
|
| BLAKE2b-256 |
bbb2f9439b620b2a2a4384a3e775cac1dbfaf538dc98cb788cdaccef703a91cf
|