Skip to main content

A CLI dashboard to monitor GPU utilization and other metrics on remote hosts via SSH.

Project description

GPU Cluster Monitor

A CLI dashboard to monitor GPU utilization, temperature, memory, and power usage on remote hosts via SSH. It provides a live-updating table view, summarizing GPU status across multiple machines defined in a cluster configuration file.

Features

  • Live monitoring of multiple GPUs across multiple hosts.
  • Color-coded thresholds for critical and warning states (utilization, temperature).
  • Displays GPU ID, name, utilization, memory (used/total), temperature, and power draw/limit.
  • Supports SSH connection via system ssh command, leveraging ~/.ssh/config for host specifics (including ProxyCommand).
  • Configurable refresh interval.
  • Host summary table for a quick overview.
  • Problematic GPUs table highlighting GPUs with errors or high temperatures.
  • Optional detailed table for all GPUs.
  • Natural sorting for hostnames (e.g., h1, h2, h10).

Prerequisites

  • Python 3.10+
  • OpenSSH client installed and configured (i.e., ssh command works and can connect to target hosts, potentially using ~/.ssh/config).
  • nvidia-smi installed on all remote GPU hosts.

Installation

It is highly recommended to install gpu-cluster-monitor in a virtual environment.

  1. Create and activate a virtual environment (recommended):

    python -m venv .venv
    source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
    

    Important: Ensure the virtual environment is activated in your shell before proceeding to the next steps involving make or pip install -e.

  2. Install gpu-cluster-monitor from PyPI:

    pip install gpu-cluster-monitor
    

Configuration

gpu-cluster-monitor uses two main configuration files, typically located in ~/.gpu-cluster-monitor/ (this path is created if it doesn't exist upon first run or when initializing settings):

  1. clusters.yaml: Defines the clusters and hosts to monitor.
  2. settings.yaml: Configures application behavior like refresh intervals, display emojis, and warning thresholds.

1. Cluster Configuration (clusters.yaml)

You need to manually create or edit ~/.gpu-cluster-monitor/clusters.yaml. Here's an example structure:

clusters:
  - name: "my_main_cluster"  # Internal name, used for selecting with `gpu-monitor monitor <name>`
    display_name: "My Awesome GPU Cluster" # Display name for the dashboard
    ssh_user: "default_user_for_cluster" # Optional: Default SSH user for all hosts in this cluster
    ssh_key: "~/.ssh/id_rsa_cluster"    # Optional: Default SSH key for all hosts in this cluster
    ssh_port: 2222                     # Optional: Default SSH port
    hosts:
      - server1.example.com
      - server2
      - name: gpu-node-01 # Can be a simple string or a dict for host-specific overrides
        ssh_user: "specific_user" # Host-specific SSH user
      - name: gpu-node-02
        ssh_key: "~/.ssh/id_rsa_node02"
      # Add more hosts as needed
  
  - name: "another_cluster"
    display_name: "Secondary GPU Farm"
    hosts:
      - worker1
      - worker2
  • clusters: A list of cluster definitions.
    • name: An internal identifier for the cluster. If you want to monitor only this specific cluster, you'll use this name with the monitor command (e.g., gpu-monitor monitor my_main_cluster).
    • display_name: A user-friendly name shown in the dashboard title for this cluster.
    • ssh_user (optional): Default SSH username for all hosts in this cluster. Can be overridden per host.
    • ssh_key (optional): Path to the default SSH private key for this cluster. Can be overridden per host.
    • ssh_port (optional): Default SSH port for this cluster. Defaults to 22 if not specified. Can be overridden per host.
    • hosts: A list of host entries. Each entry can be:
      • A simple string (hostname or IP).
      • A dictionary with a name key (hostname or IP) and optional ssh_user, ssh_key, ssh_port to override cluster defaults or system SSH config for that specific host.

Your system's ~/.ssh/config will still be respected for connection details if not specified in clusters.yaml (e.g., for ProxyCommand, User if not set in clusters.yaml, IdentityFile if not set).

2. Application Settings (settings.yaml)

This file controls various aspects of the monitor's appearance and behavior. To create a default settings.yaml file, run:

gpu-monitor settings init

This will create ~/.gpu-cluster-monitor/settings.yaml with default values. You can then edit this file to customize:

  • Refresh intervals
  • SSH and nvidia-smi command timeouts
  • Utilization and temperature thresholds for warnings and critical alerts
  • Emojis used for status indicators

If settings.yaml is not present, the application will use built-in default values.

Usage

After installation and configuration, you can run the monitor using the gpu-monitor command.

Commands:

  • gpu-monitor monitor [cluster_name] [options]

    • Monitors the specified cluster by its name from clusters.yaml.
    • If cluster_name is omitted, all clusters defined in clusters.yaml are monitored.
    • --interval SECONDS: Refresh interval (overrides settings.yaml or default).
    • --show-all-gpus: Show detailed GPU table in addition to summaries.
    • --config PATH_TO_CLUSTERS.YAML: Path to the cluster configuration YAML file. Default: ~/.gpu-cluster-monitor/clusters.yaml.
    • --config-dir DIRECTORY: Path to the configuration directory where clusters.yaml and settings.yaml are located. Default: ~/.gpu-cluster-monitor.
    • --ssh-debug: Enable detailed SSH command debugging output.
  • gpu-monitor settings init

    • Creates a default settings.yaml file in the configuration directory (~/.gpu-cluster-monitor/settings.yaml) if one doesn't already exist.

Example: Monitoring clusters

  1. Ensure ~/.gpu-cluster-monitor/clusters.yaml is configured with your cluster(s).
  2. (Optional) Initialize and customize settings.yaml:
    gpu-monitor settings init
    # Now edit ~/.gpu-cluster-monitor/settings.yaml if desired
    
  3. Monitor all clusters defined in clusters.yaml:
    gpu-monitor monitor
    
  4. Monitor a specific cluster named my_main_cluster (assuming it's defined in clusters.yaml):
    gpu-monitor monitor my_main_cluster
    

Troubleshooting

  • Permission Denied: Ensure your SSH keys are set up correctly, your SSH agent is running with the right keys, or your ~/.ssh/config has the correct User and IdentityFile for the target hosts. Host-specific or cluster-specific ssh_user and ssh_key in clusters.yaml can also be used.
  • Could not resolve hostname: Check that the hostname is correct and resolvable from the machine running the monitor.
  • Connection timed out: Verify network connectivity to the host and that the SSH port (usually 22, unless overridden) is open. Check ProxyCommand settings in ~/.ssh/config if you use a bastion/jump host.
  • nvidia-smi not found on host: Ensure nvidia-smi is installed and in the PATH for the SSH user on the remote machine.
  • 'ssh' command not found locally: Make sure the OpenSSH client is installed on the machine where you are running gpu-monitor.
  • YAML parsing errors: Carefully check the syntax of your clusters.yaml or settings.yaml files. Online YAML validators can be helpful.

Contributing & Development

Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.

Setting up for Development

  1. Clone the repository:

    git clone https://github.com/AgrawalAmey/gpu-cluster-monitor.git
    cd gpu-cluster-monitor
    
  2. Create and activate a virtual environment (recommended):

    python -m venv .venv
    source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
    

    Important: Ensure the virtual environment is activated in your shell before proceeding to the next steps involving make or pip install -e.

  3. Install the package in editable mode with development dependencies: The Makefile simplifies this. Ensure you have make installed and your virtual environment is active.

    make install 
    

    This target installs the package in editable mode and development tools like build and twine. All subsequent make targets (like build, lint, publish) also assume the virtual environment is active.

    If you don't have make or prefer manual steps (ensure venv is active):

    pip install -e ".[dev]" # Installs in editable mode with dev dependencies
    pip install --upgrade build twine # Installs packaging tools
    

    (Ensure pyproject.toml has a [project.optional-dependencies] table for dev if using the .[dev] syntax).

Running from Source (for development)

If you have cloned the repository, activated your virtual environment, and installed dependencies in editable mode, you can invoke the CLI directly:

gpu-monitor --help

Alternatively, to run the module directly without relying on the entry point (useful for some debugging scenarios):

python -m gpu_cluster_monitor.main monitor <cluster_config_name> [options]
# Example for adding a cluster using local config files:
# python -m gpu_cluster_monitor.main add-cluster dev_cluster --config-dir ./clusters_config

Note: When running with python -m, if you want to use local clusters_config files from the project root for testing, you'll need to specify --config-dir ./clusters_config as the default will still be ~/.config/gpu-cluster-monitor/clusters/.

Makefile for Development

A Makefile is provided to simplify common development tasks. Important: Before running targets like install, build, lint, publish_test, or publish, ensure you have activated your virtual environment (e.g., source .venv/bin/activate). The make venv target only creates the environment.

Common Makefile Targets:

  • make venv: Creates a Python virtual environment in .venv/.
  • make install: Installs the package in editable mode and development dependencies. Assumes virtual environment is active.
  • make build: Builds the package (sdist and wheel) into the dist/ directory. Assumes virtual environment is active.
  • make clean: Removes build artifacts and __pycache__ directories.
  • make publish_test: Uploads the package to TestPyPI from the dist/ directory. Assumes virtual environment is active.
  • make publish: Uploads the package to PyPI from the dist/ directory. Assumes virtual environment is active.
  • make lint: Runs linters and formatters (e.g., Ruff). Assumes virtual environment is active.
  • make format: Runs formatters (e.g., Ruff). Assumes virtual environment is active.

Typical Development Workflow:

  1. make venv (first time, or if .venv is deleted)
  2. source .venv/bin/activate (or your shell's equivalent) - Crucial step!
  3. make install (to set up editable install and dev tools)
  4. (Make your code changes)
  5. (Optionally, run make lint or make format)
  6. make build
  7. make publish_test (to test packaging and upload to TestPyPI)
  8. make publish (to release to PyPI)

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu-cluster-monitor-0.0.2.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpu_cluster_monitor-0.0.2-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file gpu-cluster-monitor-0.0.2.tar.gz.

File metadata

  • Download URL: gpu-cluster-monitor-0.0.2.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gpu-cluster-monitor-0.0.2.tar.gz
Algorithm Hash digest
SHA256 4ad70aa12b9a18715896ea48717c87d29c461bb4e576573efd58b3246943d069
MD5 fd788d91ba976db855ac65485f09db7f
BLAKE2b-256 deee99997ee50d3e0e2dd79cbd889b1ad3b70673b20807fa12c5191abd71a510

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu-cluster-monitor-0.0.2.tar.gz:

Publisher: release.yaml on AgrawalAmey/gpu-cluster-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpu_cluster_monitor-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for gpu_cluster_monitor-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3b2ac9fb8fb99e4a673d985e74684234766807a4536e6477c84d975162479df0
MD5 847878dfdaf6dcbb717773510aca3ae9
BLAKE2b-256 cab230cae3432a70b0519990f061331eaaf0a946288eabef3cc3ba89ccfb3435

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_cluster_monitor-0.0.2-py3-none-any.whl:

Publisher: release.yaml on AgrawalAmey/gpu-cluster-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page