A CLI dashboard to monitor GPU utilization and other metrics on remote hosts via SSH.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

agrawalamey

These details have not been verified by PyPI

Project description

GPU Cluster Monitor

A CLI dashboard to monitor GPU utilization, temperature, memory, and power usage on remote hosts via SSH. It provides a live-updating table view, summarizing GPU status across multiple machines defined in a cluster configuration file.

Features

Live monitoring of multiple GPUs across multiple hosts.
Color-coded thresholds for critical and warning states (utilization, temperature).
Displays GPU ID, name, utilization, memory (used/total), temperature, and power draw/limit.
Supports SSH connection via system ssh command, leveraging ~/.ssh/config for host specifics (including ProxyCommand).
Configurable refresh interval.
Host summary table for a quick overview.
Problematic GPUs table highlighting GPUs with errors or high temperatures.
Optional detailed table for all GPUs.
Natural sorting for hostnames (e.g., h1, h2, h10).

Prerequisites

Python 3.10+
OpenSSH client installed and configured (i.e., ssh command works and can connect to target hosts, potentially using ~/.ssh/config).
nvidia-smi installed on all remote GPU hosts.

Installation

It is highly recommended to install gpu-cluster-monitor in a virtual environment.

Create and activate a virtual environment (recommended):
```
python -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
```
Important: Ensure the virtual environment is activated in your shell before proceeding to the next steps involving make or pip install -e.
Install gpu-cluster-monitor from PyPI:
```
pip install gpu-cluster-monitor
```

Configuration

gpu-cluster-monitor uses two main configuration files, typically located in ~/.gpu-cluster-monitor/ (this path is created if it doesn't exist upon first run or when initializing settings):

clusters.yaml: Defines the clusters and hosts to monitor.
settings.yaml: Configures application behavior like refresh intervals, display emojis, and warning thresholds.

1. Cluster Configuration (`clusters.yaml`)

You need to manually create or edit ~/.gpu-cluster-monitor/clusters.yaml. Here's an example structure:

clusters:
  - name: "my_main_cluster"  # Internal name, used for selecting with `gpu-monitor monitor <name>`
    display_name: "My Awesome GPU Cluster" # Display name for the dashboard
    ssh_user: "default_user_for_cluster" # Optional: Default SSH user for all hosts in this cluster
    ssh_key: "~/.ssh/id_rsa_cluster"    # Optional: Default SSH key for all hosts in this cluster
    ssh_port: 2222                     # Optional: Default SSH port
    hosts:
      - server1.example.com
      - server2
      - name: gpu-node-01 # Can be a simple string or a dict for host-specific overrides
        ssh_user: "specific_user" # Host-specific SSH user
      - name: gpu-node-02
        ssh_key: "~/.ssh/id_rsa_node02"
      # Add more hosts as needed
  
  - name: "another_cluster"
    display_name: "Secondary GPU Farm"
    hosts:
      - worker1
      - worker2

clusters: A list of cluster definitions.
- name: An internal identifier for the cluster. If you want to monitor only this specific cluster, you'll use this name with the monitor command (e.g., gpu-monitor monitor my_main_cluster).
- display_name: A user-friendly name shown in the dashboard title for this cluster.
- ssh_user (optional): Default SSH username for all hosts in this cluster. Can be overridden per host.
- ssh_key (optional): Path to the default SSH private key for this cluster. Can be overridden per host.
- ssh_port (optional): Default SSH port for this cluster. Defaults to 22 if not specified. Can be overridden per host.
- hosts: A list of host entries. Each entry can be:
  - A simple string (hostname or IP).
  - A dictionary with a name key (hostname or IP) and optional ssh_user, ssh_key, ssh_port to override cluster defaults or system SSH config for that specific host.

Your system's ~/.ssh/config will still be respected for connection details if not specified in clusters.yaml (e.g., for ProxyCommand, User if not set in clusters.yaml, IdentityFile if not set).

2. Application Settings (`settings.yaml`)

This file controls various aspects of the monitor's appearance and behavior. To create a default settings.yaml file, run:

gpu-monitor settings init

This will create ~/.gpu-cluster-monitor/settings.yaml with default values. You can then edit this file to customize:

Refresh intervals
SSH and nvidia-smi command timeouts
Utilization and temperature thresholds for warnings and critical alerts
Emojis used for status indicators

If settings.yaml is not present, the application will use built-in default values.

Usage

After installation and configuration, you can run the monitor using the gpu-monitor command.

Commands:

gpu-monitor monitor [cluster_name] [options]
- Monitors the specified cluster by its name from clusters.yaml.
- If cluster_name is omitted, all clusters defined in clusters.yaml are monitored.
- --interval SECONDS: Refresh interval (overrides settings.yaml or default).
- --show-all-gpus: Show detailed GPU table in addition to summaries.
- --config PATH_TO_CLUSTERS.YAML: Path to the cluster configuration YAML file. Default: ~/.gpu-cluster-monitor/clusters.yaml.
- --config-dir DIRECTORY: Path to the configuration directory where clusters.yaml and settings.yaml are located. Default: ~/.gpu-cluster-monitor.
- --ssh-debug: Enable detailed SSH command debugging output.
gpu-monitor settings init
- Creates a default settings.yaml file in the configuration directory (~/.gpu-cluster-monitor/settings.yaml) if one doesn't already exist.

Example: Monitoring clusters

Ensure ~/.gpu-cluster-monitor/clusters.yaml is configured with your cluster(s).

(Optional) Initialize and customize settings.yaml:

gpu-monitor settings init
# Now edit ~/.gpu-cluster-monitor/settings.yaml if desired

Monitor all clusters defined in clusters.yaml:
```
gpu-monitor monitor
```
Monitor a specific cluster named my_main_cluster (assuming it's defined in clusters.yaml):
```
gpu-monitor monitor my_main_cluster
```

Troubleshooting

Permission Denied: Ensure your SSH keys are set up correctly, your SSH agent is running with the right keys, or your ~/.ssh/config has the correct User and IdentityFile for the target hosts. Host-specific or cluster-specific ssh_user and ssh_key in clusters.yaml can also be used.
Could not resolve hostname: Check that the hostname is correct and resolvable from the machine running the monitor.
Connection timed out: Verify network connectivity to the host and that the SSH port (usually 22, unless overridden) is open. Check ProxyCommand settings in ~/.ssh/config if you use a bastion/jump host.
nvidia-smi not found on host: Ensure nvidia-smi is installed and in the PATH for the SSH user on the remote machine.
'ssh' command not found locally: Make sure the OpenSSH client is installed on the machine where you are running gpu-monitor.
YAML parsing errors: Carefully check the syntax of your clusters.yaml or settings.yaml files. Online YAML validators can be helpful.

Contributing & Development

Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.

Setting up for Development

Clone the repository:

git clone https://github.com/AgrawalAmey/gpu-cluster-monitor.git
cd gpu-cluster-monitor

Create and activate a virtual environment (recommended):
```
python -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
```
Important: Ensure the virtual environment is activated in your shell before proceeding to the next steps involving make or pip install -e.
Install the package in editable mode with development dependencies: The Makefile simplifies this. Ensure you have make installed and your virtual environment is active.
```
make install 
```
This target installs the package in editable mode and development tools like build and twine. All subsequent make targets (like build, lint, publish) also assume the virtual environment is active.

If you don't have make or prefer manual steps (ensure venv is active):
```
pip install -e ".[dev]" # Installs in editable mode with dev dependencies
pip install --upgrade build twine # Installs packaging tools
```
(Ensure pyproject.toml has a [project.optional-dependencies] table for dev if using the .[dev] syntax).

Running from Source (for development)

If you have cloned the repository, activated your virtual environment, and installed dependencies in editable mode, you can invoke the CLI directly:

gpu-monitor --help

Alternatively, to run the module directly without relying on the entry point (useful for some debugging scenarios):

python -m gpu_cluster_monitor.main monitor <cluster_config_name> [options]
# Example for adding a cluster using local config files:
# python -m gpu_cluster_monitor.main add-cluster dev_cluster --config-dir ./clusters_config

Note: When running with python -m, if you want to use local clusters_config files from the project root for testing, you'll need to specify --config-dir ./clusters_config as the default will still be ~/.config/gpu-cluster-monitor/clusters/.

Makefile for Development

A Makefile is provided to simplify common development tasks. Important: Before running targets like install, build, lint, publish_test, or publish, ensure you have activated your virtual environment (e.g., source .venv/bin/activate). The make venv target only creates the environment.

Common Makefile Targets:

make venv: Creates a Python virtual environment in .venv/.
make install: Installs the package in editable mode and development dependencies. Assumes virtual environment is active.
make build: Builds the package (sdist and wheel) into the dist/ directory. Assumes virtual environment is active.
make clean: Removes build artifacts and __pycache__ directories.
make publish_test: Uploads the package to TestPyPI from the dist/ directory. Assumes virtual environment is active.
make publish: Uploads the package to PyPI from the dist/ directory. Assumes virtual environment is active.
make lint: Runs linters and formatters (e.g., Ruff). Assumes virtual environment is active.
make format: Runs formatters (e.g., Ruff). Assumes virtual environment is active.

Typical Development Workflow:

make venv (first time, or if .venv is deleted)
source .venv/bin/activate (or your shell's equivalent) - Crucial step!
make install (to set up editable install and dev tools)
(Make your code changes)
(Optionally, run make lint or make format)
make build
make publish_test (to test packaging and upload to TestPyPI)
make publish (to release to PyPI)

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

agrawalamey

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.2

May 28, 2025

0.0.1

May 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu-cluster-monitor-0.0.2.tar.gz (29.6 kB view details)

Uploaded May 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpu_cluster_monitor-0.0.2-py3-none-any.whl (22.6 kB view details)

Uploaded May 28, 2025 Python 3

File details

Details for the file gpu-cluster-monitor-0.0.2.tar.gz.

File metadata

Download URL: gpu-cluster-monitor-0.0.2.tar.gz
Upload date: May 28, 2025
Size: 29.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gpu-cluster-monitor-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`4ad70aa12b9a18715896ea48717c87d29c461bb4e576573efd58b3246943d069`
MD5	`fd788d91ba976db855ac65485f09db7f`
BLAKE2b-256	`deee99997ee50d3e0e2dd79cbd889b1ad3b70673b20807fa12c5191abd71a510`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu-cluster-monitor-0.0.2.tar.gz:

Publisher: release.yaml on AgrawalAmey/gpu-cluster-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpu_cluster_monitor-0.0.2.tar.gz
- Subject digest: 4ad70aa12b9a18715896ea48717c87d29c461bb4e576573efd58b3246943d069
- Sigstore transparency entry: 221437077
- Sigstore integration time: May 28, 2025
Source repository:
- Permalink: AgrawalAmey/gpu-cluster-monitor@ebbd5f5705c4db5e6439e387b137f9075d999e41
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/AgrawalAmey
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@ebbd5f5705c4db5e6439e387b137f9075d999e41
- Trigger Event: release

File details

Details for the file gpu_cluster_monitor-0.0.2-py3-none-any.whl.

File metadata

Download URL: gpu_cluster_monitor-0.0.2-py3-none-any.whl
Upload date: May 28, 2025
Size: 22.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gpu_cluster_monitor-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b2ac9fb8fb99e4a673d985e74684234766807a4536e6477c84d975162479df0`
MD5	`847878dfdaf6dcbb717773510aca3ae9`
BLAKE2b-256	`cab230cae3432a70b0519990f061331eaaf0a946288eabef3cc3ba89ccfb3435`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_cluster_monitor-0.0.2-py3-none-any.whl:

Publisher: release.yaml on AgrawalAmey/gpu-cluster-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpu_cluster_monitor-0.0.2-py3-none-any.whl
- Subject digest: 3b2ac9fb8fb99e4a673d985e74684234766807a4536e6477c84d975162479df0
- Sigstore transparency entry: 221437079
- Sigstore integration time: May 28, 2025
Source repository:
- Permalink: AgrawalAmey/gpu-cluster-monitor@ebbd5f5705c4db5e6439e387b137f9075d999e41
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/AgrawalAmey
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@ebbd5f5705c4db5e6439e387b137f9075d999e41
- Trigger Event: release

gpu-cluster-monitor 0.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GPU Cluster Monitor

Features

Prerequisites

Installation

Configuration

1. Cluster Configuration (clusters.yaml)

2. Application Settings (settings.yaml)

Usage

Troubleshooting

Contributing & Development

Setting up for Development

Running from Source (for development)

Makefile for Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. Cluster Configuration (`clusters.yaml`)

2. Application Settings (`settings.yaml`)