A CLI dashboard to monitor GPU utilization and other metrics on remote hosts via SSH.
Project description
GPU Cluster Monitor
A CLI dashboard to monitor GPU utilization, temperature, memory, and power usage on remote hosts via SSH. It provides a live-updating table view, summarizing GPU status across multiple machines defined in a cluster configuration file.
Features
- Live monitoring of multiple GPUs across multiple hosts.
- Color-coded thresholds for critical and warning states (utilization, temperature).
- Displays GPU ID, name, utilization, memory (used/total), temperature, and power draw/limit.
- Supports SSH connection via system
sshcommand, leveraging~/.ssh/configfor host specifics (includingProxyCommand). - Configurable refresh interval.
- Host summary table for a quick overview.
- Problematic GPUs table highlighting GPUs with errors or high temperatures.
- Optional detailed table for all GPUs.
- Natural sorting for hostnames (e.g., h1, h2, h10).
Prerequisites
- Python 3.10+
- OpenSSH client installed and configured (i.e.,
sshcommand works and can connect to target hosts, potentially using~/.ssh/config). nvidia-smiinstalled on all remote GPU hosts.
Installation
It is highly recommended to install gpu-cluster-monitor in a virtual environment.
-
Create and activate a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
Important: Ensure the virtual environment is activated in your shell before proceeding to the next steps involving
makeorpip install -e. -
Install
gpu-cluster-monitorfrom PyPI:pip install gpu-cluster-monitor
Configuration
gpu-cluster-monitor uses two main configuration files, typically located in ~/.gpu-cluster-monitor/ (this path is created if it doesn't exist upon first run or when initializing settings):
clusters.yaml: Defines the clusters and hosts to monitor.settings.yaml: Configures application behavior like refresh intervals, display emojis, and warning thresholds.
1. Cluster Configuration (clusters.yaml)
You need to manually create or edit ~/.gpu-cluster-monitor/clusters.yaml. Here's an example structure:
clusters:
- name: "my_main_cluster" # Internal name, used for selecting with `gpu-monitor monitor <name>`
display_name: "My Awesome GPU Cluster" # Display name for the dashboard
ssh_user: "default_user_for_cluster" # Optional: Default SSH user for all hosts in this cluster
ssh_key: "~/.ssh/id_rsa_cluster" # Optional: Default SSH key for all hosts in this cluster
ssh_port: 2222 # Optional: Default SSH port
hosts:
- server1.example.com
- server2
- name: gpu-node-01 # Can be a simple string or a dict for host-specific overrides
ssh_user: "specific_user" # Host-specific SSH user
- name: gpu-node-02
ssh_key: "~/.ssh/id_rsa_node02"
# Add more hosts as needed
- name: "another_cluster"
display_name: "Secondary GPU Farm"
hosts:
- worker1
- worker2
clusters: A list of cluster definitions.name: An internal identifier for the cluster. If you want to monitor only this specific cluster, you'll use this name with themonitorcommand (e.g.,gpu-monitor monitor my_main_cluster).display_name: A user-friendly name shown in the dashboard title for this cluster.ssh_user(optional): Default SSH username for all hosts in this cluster. Can be overridden per host.ssh_key(optional): Path to the default SSH private key for this cluster. Can be overridden per host.ssh_port(optional): Default SSH port for this cluster. Defaults to 22 if not specified. Can be overridden per host.hosts: A list of host entries. Each entry can be:- A simple string (hostname or IP).
- A dictionary with a
namekey (hostname or IP) and optionalssh_user,ssh_key,ssh_portto override cluster defaults or system SSH config for that specific host.
Your system's ~/.ssh/config will still be respected for connection details if not specified in clusters.yaml (e.g., for ProxyCommand, User if not set in clusters.yaml, IdentityFile if not set).
2. Application Settings (settings.yaml)
This file controls various aspects of the monitor's appearance and behavior. To create a default settings.yaml file, run:
gpu-monitor settings init
This will create ~/.gpu-cluster-monitor/settings.yaml with default values. You can then edit this file to customize:
- Refresh intervals
- SSH and
nvidia-smicommand timeouts - Utilization and temperature thresholds for warnings and critical alerts
- Emojis used for status indicators
If settings.yaml is not present, the application will use built-in default values.
Usage
After installation and configuration, you can run the monitor using the gpu-monitor command.
Commands:
-
gpu-monitor monitor [cluster_name] [options]- Monitors the specified cluster by its
namefromclusters.yaml. - If
cluster_nameis omitted, all clusters defined inclusters.yamlare monitored. --interval SECONDS: Refresh interval (overridessettings.yamlor default).--show-all-gpus: Show detailed GPU table in addition to summaries.--config PATH_TO_CLUSTERS.YAML: Path to the cluster configuration YAML file. Default:~/.gpu-cluster-monitor/clusters.yaml.--config-dir DIRECTORY: Path to the configuration directory whereclusters.yamlandsettings.yamlare located. Default:~/.gpu-cluster-monitor.--ssh-debug: Enable detailed SSH command debugging output.
- Monitors the specified cluster by its
-
gpu-monitor settings init- Creates a default
settings.yamlfile in the configuration directory (~/.gpu-cluster-monitor/settings.yaml) if one doesn't already exist.
- Creates a default
Example: Monitoring clusters
- Ensure
~/.gpu-cluster-monitor/clusters.yamlis configured with your cluster(s). - (Optional) Initialize and customize
settings.yaml:gpu-monitor settings init # Now edit ~/.gpu-cluster-monitor/settings.yaml if desired
- Monitor all clusters defined in
clusters.yaml:gpu-monitor monitor - Monitor a specific cluster named
my_main_cluster(assuming it's defined inclusters.yaml):gpu-monitor monitor my_main_cluster
Troubleshooting
- Permission Denied: Ensure your SSH keys are set up correctly, your SSH agent is running with the right keys, or your
~/.ssh/confighas the correctUserandIdentityFilefor the target hosts. Host-specific or cluster-specificssh_userandssh_keyinclusters.yamlcan also be used. - Could not resolve hostname: Check that the hostname is correct and resolvable from the machine running the monitor.
- Connection timed out: Verify network connectivity to the host and that the SSH port (usually 22, unless overridden) is open. Check
ProxyCommandsettings in~/.ssh/configif you use a bastion/jump host. nvidia-sminot found on host: Ensurenvidia-smiis installed and in thePATHfor the SSH user on the remote machine.'ssh' command not found locally: Make sure the OpenSSH client is installed on the machine where you are runninggpu-monitor.- YAML parsing errors: Carefully check the syntax of your
clusters.yamlorsettings.yamlfiles. Online YAML validators can be helpful.
Contributing & Development
Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.
Setting up for Development
-
Clone the repository:
git clone https://github.com/AgrawalAmey/gpu-cluster-monitor.git cd gpu-cluster-monitor
-
Create and activate a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
Important: Ensure the virtual environment is activated in your shell before proceeding to the next steps involving
makeorpip install -e. -
Install the package in editable mode with development dependencies: The
Makefilesimplifies this. Ensure you havemakeinstalled and your virtual environment is active.make install
This target installs the package in editable mode and development tools like
buildandtwine. All subsequentmaketargets (likebuild,lint,publish) also assume the virtual environment is active.If you don't have
makeor prefer manual steps (ensure venv is active):pip install -e ".[dev]" # Installs in editable mode with dev dependencies pip install --upgrade build twine # Installs packaging tools
(Ensure
pyproject.tomlhas a[project.optional-dependencies]table fordevif using the.[dev]syntax).
Running from Source (for development)
If you have cloned the repository, activated your virtual environment, and installed dependencies in editable mode, you can invoke the CLI directly:
gpu-monitor --help
Alternatively, to run the module directly without relying on the entry point (useful for some debugging scenarios):
python -m gpu_cluster_monitor.main monitor <cluster_config_name> [options]
# Example for adding a cluster using local config files:
# python -m gpu_cluster_monitor.main add-cluster dev_cluster --config-dir ./clusters_config
Note: When running with python -m, if you want to use local clusters_config files from the project root for testing, you'll need to specify --config-dir ./clusters_config as the default will still be ~/.config/gpu-cluster-monitor/clusters/.
Makefile for Development
A Makefile is provided to simplify common development tasks.
Important: Before running targets like install, build, lint, publish_test, or publish, ensure you have activated your virtual environment (e.g., source .venv/bin/activate). The make venv target only creates the environment.
Common Makefile Targets:
make venv: Creates a Python virtual environment in.venv/.make install: Installs the package in editable mode and development dependencies. Assumes virtual environment is active.make build: Builds the package (sdist and wheel) into thedist/directory. Assumes virtual environment is active.make clean: Removes build artifacts and__pycache__directories.make publish_test: Uploads the package to TestPyPI from thedist/directory. Assumes virtual environment is active.make publish: Uploads the package to PyPI from thedist/directory. Assumes virtual environment is active.make lint: Runs linters and formatters (e.g., Ruff). Assumes virtual environment is active.make format: Runs formatters (e.g., Ruff). Assumes virtual environment is active.
Typical Development Workflow:
make venv(first time, or if.venvis deleted)source .venv/bin/activate(or your shell's equivalent) - Crucial step!make install(to set up editable install and dev tools)- (Make your code changes)
- (Optionally, run
make lintormake format) make buildmake publish_test(to test packaging and upload to TestPyPI)make publish(to release to PyPI)
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpu-cluster-monitor-0.0.1.tar.gz.
File metadata
- Download URL: gpu-cluster-monitor-0.0.1.tar.gz
- Upload date:
- Size: 29.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f5514f658fb894c5a27ee22acc88546bcc8db8680342ed950708bf34ede5fc3
|
|
| MD5 |
54d284dcba4cbf35b13d50fa4bbb95ea
|
|
| BLAKE2b-256 |
5f3e309a4d63eee02ee5ba42170d5e887c0b84c61887be2b6a1439aba7de5b27
|
Provenance
The following attestation bundles were made for gpu-cluster-monitor-0.0.1.tar.gz:
Publisher:
release.yaml on AgrawalAmey/gpu-cluster-monitor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpu_cluster_monitor-0.0.1.tar.gz -
Subject digest:
8f5514f658fb894c5a27ee22acc88546bcc8db8680342ed950708bf34ede5fc3 - Sigstore transparency entry: 221433856
- Sigstore integration time:
-
Permalink:
AgrawalAmey/gpu-cluster-monitor@a4ae6497fae83ff9568b249c110fbe338e1540e2 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/AgrawalAmey
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@a4ae6497fae83ff9568b249c110fbe338e1540e2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gpu_cluster_monitor-0.0.1-py3-none-any.whl.
File metadata
- Download URL: gpu_cluster_monitor-0.0.1-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c0cf7f0493d37f327586a51f379548924d4b86b81355d0c6c676397088a18f3
|
|
| MD5 |
54e2352b19b04f172ce3e9488724372e
|
|
| BLAKE2b-256 |
dc552e45270555edead5bcf57c22a9887ed97f4d598532ae4add25ef442b4c87
|
Provenance
The following attestation bundles were made for gpu_cluster_monitor-0.0.1-py3-none-any.whl:
Publisher:
release.yaml on AgrawalAmey/gpu-cluster-monitor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpu_cluster_monitor-0.0.1-py3-none-any.whl -
Subject digest:
8c0cf7f0493d37f327586a51f379548924d4b86b81355d0c6c676397088a18f3 - Sigstore transparency entry: 221433861
- Sigstore integration time:
-
Permalink:
AgrawalAmey/gpu-cluster-monitor@a4ae6497fae83ff9568b249c110fbe338e1540e2 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/AgrawalAmey
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@a4ae6497fae83ff9568b249c110fbe338e1540e2 -
Trigger Event:
release
-
Statement type: