Skip to main content

A fast, asynchronous GPU monitoring tool for multiple machines through SSH

Project description

SSH GPU Monitor 🖥️

Python 3.8+ License: MIT

A fast, asynchronous GPU monitoring tool that provides real-time status of NVIDIA GPUs across multiple machines through SSH, with support for jump hosts and per-machine credentials.

Example Output

✨ Features

  • Real-time Monitoring: Live updates of GPU status across multiple machines
  • Asynchronous Operation: Fast, non-blocking checks using asyncio and asyncssh
  • Jump Host Support: Access machines behind a bastion/jump host
  • Rich Display: Beautiful terminal UI using the rich library
  • Flexible Configuration:
    • YAML-based configuration
    • Per-machine SSH credentials
    • Pattern-based target generation
  • Robust Error Handling: Graceful handling of network issues and timeouts

🚀 Installation & Usage

Install from PyPI

pip install ssh-gpu-monitor

Run the Monitor

After installation, you can run the monitor in several ways:

# Run using the command-line tool
ssh-gpu-monitor

# Or run as a Python module
python -m ssh_gpu_monitor

# Use a custom config file
ssh-gpu-monitor --config /path/to/your/config.yaml

# Get the default config path
ssh-gpu-monitor --get_config_path

Configuration

  1. Get the default config path:
ssh-gpu-monitor --get_config_path
  1. Either:
    • Copy the default config to your preferred location and use --config to specify it
    • Modify the default config directly

Example config file:

ssh:
  username: "your_username"
  key_path: "~/.ssh/id_rsa"
  jump_host: "jump.example.com"
  timeout: 10

targets:
  individual:
    - "gpu-server1"
    - "gpu-server2"

display:
  refresh_rate: 5

📖 Configuration

Basic Structure

ssh:
  username: "default_user"  # Default username
  key_path: "~/.ssh/id_rsa"  # Default SSH key
  jump_host: "jump.example.com"
  timeout: 10  # seconds

targets:
  # Individual machines
  individual:
    - host: "gpu-server1"
      username: "different_user"  # Optional override
      key_path: "~/.ssh/special_key"  # Optional override
    - "gpu-server2"  # Uses default credentials
  
  # Pattern-based groups
  patterns:
    - prefix: "gpu"
      start: 1
      end: 30
      format: "{prefix}{number:02}"  # Results in gpu01, gpu02, etc.
      username: "gpu_user"  # Optional override
      key_path: "~/.ssh/gpu_key"  # Optional override

display:
  refresh_rate: 5  # seconds

debug:
  enabled: false
  log_dir: "logs"
  log_file: "gpu_checker.log"
  log_max_size: 1048576  # 1MB
  log_backup_count: 3

Command Line Options

Override any configuration option via command line:

# Enable debug logging
python main.py --debug.enabled

# Override SSH settings
python main.py --ssh.username=other_user --ssh.key_path=~/.ssh/other_key

# Check specific targets
python main.py --targets gpu01 gpu02 special-server

🔧 Advanced Usage

Custom Target Patterns

Generate targets using patterns:

patterns:
  - prefix: "compute"
    start: 1
    end: 100
    format: "{prefix}-{number:03d}"  # compute-001, compute-002, etc.

Per-Machine Credentials

Specify different credentials for specific machines:

individual:
  - host: "special-gpu"
    username: "admin"
    key_path: "~/.ssh/admin_key"

Debug Logging

Enable detailed logging for troubleshooting:

debug:
  enabled: true
  log_dir: "logs"
  log_file: "debug.log"

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Original Contributors

Originally created as "some awful, brittle code to check GPU status of multiple machines at a given host address through an SSH jumpnode."

Special thanks to:

Libraries

  • Rich for the beautiful terminal interface
  • asyncssh for async SSH support
  • PyYAML for configuration management

🔍 Similar Projects

⚠️ Known Issues

  • SSH connection might timeout on very slow networks
  • Some older NVIDIA drivers might return incompatible XML formats

📊 Roadmap

  • Add support for AMD GPUs
  • Implement process name filtering
  • Add web interface
  • Support for custom SSH config files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssh_gpu_monitor-1.0.2.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ssh_gpu_monitor-1.0.2-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file ssh_gpu_monitor-1.0.2.tar.gz.

File metadata

  • Download URL: ssh_gpu_monitor-1.0.2.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.12

File hashes

Hashes for ssh_gpu_monitor-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c6145b11e8ef3da6cec497afd8b79e56e785236865739ee821e32ce09d01ca6d
MD5 c96e6ebc1c4c6c66a269278a6888d4b0
BLAKE2b-256 378823bb0ec566d2fd6918a015eabca21165b660b96132f92aba4f0c20606001

See more details on using hashes here.

File details

Details for the file ssh_gpu_monitor-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ssh_gpu_monitor-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 de4b91a1055fd9a6fb286b0128627e05f5403bba02728f44fb4f0c806f54dd7d
MD5 b22e212ccbc404f6d71f4106c53a066a
BLAKE2b-256 93444375f4bf297e59a7119208b4c28f2f419dd854eae667db735515e11e9d83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page