A fast, asynchronous GPU monitoring tool for multiple machines through SSH
Project description
SSH GPU Monitor 🖥️
A fast, asynchronous GPU monitoring tool that provides real-time status of NVIDIA GPUs across multiple machines through SSH, with support for jump hosts and per-machine credentials.
✨ Features
- Real-time Monitoring: Live updates of GPU status across multiple machines
- Asynchronous Operation: Fast, non-blocking checks using
asyncio
andasyncssh
- Jump Host Support: Access machines behind a bastion/jump host
- Rich Display: Beautiful terminal UI using the
rich
library - Flexible Configuration:
- YAML-based configuration
- Per-machine SSH credentials
- Pattern-based target generation
- Robust Error Handling: Graceful handling of network issues and timeouts
🚀 Quick Start
- Install the package or clone the repository:
pip install ssh-gpu-monitor
- Create a basic configuration file (
config/config.yaml
):
ssh:
username: "your_username"
key_path: "~/.ssh/id_rsa"
jump_host: "jump.example.com"
timeout: 10
targets:
individual:
- "gpu-server1"
- "gpu-server2"
display:
refresh_rate: 5
- Run the monitor:
python main.py
📖 Configuration
Basic Structure
ssh:
username: "default_user" # Default username
key_path: "~/.ssh/id_rsa" # Default SSH key
jump_host: "jump.example.com"
timeout: 10 # seconds
targets:
# Individual machines
individual:
- host: "gpu-server1"
username: "different_user" # Optional override
key_path: "~/.ssh/special_key" # Optional override
- "gpu-server2" # Uses default credentials
# Pattern-based groups
patterns:
- prefix: "gpu"
start: 1
end: 30
format: "{prefix}{number:02}" # Results in gpu01, gpu02, etc.
username: "gpu_user" # Optional override
key_path: "~/.ssh/gpu_key" # Optional override
display:
refresh_rate: 5 # seconds
debug:
enabled: false
log_dir: "logs"
log_file: "gpu_checker.log"
log_max_size: 1048576 # 1MB
log_backup_count: 3
Command Line Options
Override any configuration option via command line:
# Enable debug logging
python main.py --debug.enabled
# Override SSH settings
python main.py --ssh.username=other_user --ssh.key_path=~/.ssh/other_key
# Check specific targets
python main.py --targets gpu01 gpu02 special-server
🔧 Advanced Usage
Custom Target Patterns
Generate targets using patterns:
patterns:
- prefix: "compute"
start: 1
end: 100
format: "{prefix}-{number:03d}" # compute-001, compute-002, etc.
Per-Machine Credentials
Specify different credentials for specific machines:
individual:
- host: "special-gpu"
username: "admin"
key_path: "~/.ssh/admin_key"
Debug Logging
Enable detailed logging for troubleshooting:
debug:
enabled: true
log_dir: "logs"
log_file: "debug.log"
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
Original Contributors
Originally created as "some awful, brittle code to check GPU status of multiple machines at a given host address through an SSH jumpnode."
Special thanks to:
- @harrygcoppock and @minut1bc for their PRs on v1
- gpuobserver for earlier code concepts
- Stack Overflow answer for SSH connection handling insights
Libraries
- Rich for the beautiful terminal interface
- asyncssh for async SSH support
- PyYAML for configuration management
🔍 Similar Projects
⚠️ Known Issues
- SSH connection might timeout on very slow networks
- Some older NVIDIA drivers might return incompatible XML formats
📊 Roadmap
- Add support for AMD GPUs
- Implement process name filtering
- Add web interface
- Support for custom SSH config files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ssh_gpu_monitor-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c34288dc96ffdf9541d9351af98225c56e9dae5c82b33790278d7f51c0b22297 |
|
MD5 | 166e34d09dbcbb5f52d0e76af51ec451 |
|
BLAKE2b-256 | be300ca396b75f4e344ab2abef97079f91e2553bd1c0ab341aa4abd9b8f89639 |