A fast, asynchronous GPU monitoring tool for multiple machines through SSH
Project description
SSH GPU Monitor 🖥️
A fast, asynchronous GPU monitoring tool that provides real-time status of NVIDIA GPUs across multiple machines through SSH, with support for jump hosts and per-machine credentials.
✨ Features
- Real-time Monitoring: Live updates of GPU status across multiple machines
- Asynchronous Operation: Fast, non-blocking checks using
asyncio
andasyncssh
- Jump Host Support: Access machines behind a bastion/jump host
- Rich Display: Beautiful terminal UI using the
rich
library - Flexible Configuration:
- YAML-based configuration
- Per-machine SSH credentials
- Pattern-based target generation
- Robust Error Handling: Graceful handling of network issues and timeouts
🚀 Installation & Usage
Install from PyPI
pip install ssh-gpu-monitor
Run the Monitor
After installation, you can run the monitor in several ways:
# Run using the command-line tool
ssh-gpu-monitor
# Or run as a Python module
python -m ssh_gpu_monitor
# Use a custom config file
ssh-gpu-monitor --config /path/to/your/config.yaml
# Get the default config path
ssh-gpu-monitor --get_config_path
Configuration
- Get the default config path:
ssh-gpu-monitor --get_config_path
- Either:
- Copy the default config to your preferred location and use
--config
to specify it - Modify the default config directly
- Copy the default config to your preferred location and use
Example config file:
ssh:
username: "your_username"
key_path: "~/.ssh/id_rsa"
jump_host: "jump.example.com"
timeout: 10
targets:
individual:
- "gpu-server1"
- "gpu-server2"
display:
refresh_rate: 5
📖 Configuration
Basic Structure
ssh:
username: "default_user" # Default username
key_path: "~/.ssh/id_rsa" # Default SSH key
jump_host: "jump.example.com"
timeout: 10 # seconds
targets:
# Individual machines
individual:
- host: "gpu-server1"
username: "different_user" # Optional override
key_path: "~/.ssh/special_key" # Optional override
- "gpu-server2" # Uses default credentials
# Pattern-based groups
patterns:
- prefix: "gpu"
start: 1
end: 30
format: "{prefix}{number:02}" # Results in gpu01, gpu02, etc.
username: "gpu_user" # Optional override
key_path: "~/.ssh/gpu_key" # Optional override
display:
refresh_rate: 5 # seconds
debug:
enabled: false
log_dir: "logs"
log_file: "gpu_checker.log"
log_max_size: 1048576 # 1MB
log_backup_count: 3
Command Line Options
Override any configuration option via command line:
# Enable debug logging
python main.py --debug.enabled
# Override SSH settings
python main.py --ssh.username=other_user --ssh.key_path=~/.ssh/other_key
# Check specific targets
python main.py --targets gpu01 gpu02 special-server
🔧 Advanced Usage
Custom Target Patterns
Generate targets using patterns:
patterns:
- prefix: "compute"
start: 1
end: 100
format: "{prefix}-{number:03d}" # compute-001, compute-002, etc.
Per-Machine Credentials
Specify different credentials for specific machines:
individual:
- host: "special-gpu"
username: "admin"
key_path: "~/.ssh/admin_key"
Debug Logging
Enable detailed logging for troubleshooting:
debug:
enabled: true
log_dir: "logs"
log_file: "debug.log"
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
Original Contributors
Originally created as "some awful, brittle code to check GPU status of multiple machines at a given host address through an SSH jumpnode."
Special thanks to:
- @harrygcoppock and @minut1bc for their PRs on v1
- gpuobserver for earlier code concepts
- Stack Overflow answer for SSH connection handling insights
Libraries
- Rich for the beautiful terminal interface
- asyncssh for async SSH support
- PyYAML for configuration management
🔍 Similar Projects
⚠️ Known Issues
- SSH connection might timeout on very slow networks
- Some older NVIDIA drivers might return incompatible XML formats
📊 Roadmap
- Add support for AMD GPUs
- Implement process name filtering
- Add web interface
- Support for custom SSH config files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ssh_gpu_monitor-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | de4b91a1055fd9a6fb286b0128627e05f5403bba02728f44fb4f0c806f54dd7d |
|
MD5 | b22e212ccbc404f6d71f4106c53a066a |
|
BLAKE2b-256 | 93444375f4bf297e59a7119208b4c28f2f419dd854eae667db735515e11e9d83 |