Lightweight NVIDIA GPU monitor — 20 notification channels (Slack, Discord, Telegram, ntfy, Teams, PagerDuty, Zulip, OpenClaw, and more), Prometheus/InfluxDB/Datadog metrics, crash/ECC detection, Kubernetes, GitHub Pages dashboard

These details have not been verified by PyPI

Project links

Project description

GPU Monitor

Stop losing GPU-hours to silent crashes. gpu-monitor runs in the background and instantly alerts you on Slack, Discord, Telegram, or 19 other channels the moment your training job crashes, your GPU goes idle, or your machine overheats — while you sleep, travel, or work on something else.

pip install gpu-watchdog
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
gpu-monitor

That's it. You're protected.

If gpu-monitor saved your training run, please star it — it helps other researchers find the tool.

What Happens When...
Quick Start
Example Output
Why gpu-monitor?
Features
Supported Notification Channels
Environment Variables
- General
- Per-channel variables
Prometheus Metrics
Alertmanager Webhook Receiver
Kubernetes
GitHub Pages Dashboard
Multi-Machine Setup
Setting Up Specific Channels
Who Uses gpu-monitor?
Author

What Happens When...

Real scenarios gpu-monitor handles automatically:

Your training job crashes at 3 AM:

gpu-cluster-1 | GPUs went idle — processes exited: 12345, 12346, 12347 | avg 1% | 38°C | mem 2G/320G (1%)

You wake up to this Slack message and can restart immediately, instead of discovering 8 lost hours in the morning.

A GPU overheats during a long run:

gpu-cluster-1 | GPU 2 temperature CRITICAL: 94°C (limit 92°C) | util 88% | fan 98%

You get paged before hardware damage or throttling ruins your results.

Memory is quietly leaking across epochs:

gpu-cluster-1 | GPU 0 memory leak detected: 18G → 31G (+72%) over 10min | process python3[alice]

Caught before you OOM-crash at epoch 47.

One GPU goes idle while others are busy (hung worker):

gpu-cluster-1 | GPU 3 idle (2%) while others active (87-91%) — possible hung worker

ECC errors silently corrupting your gradients:

gpu-cluster-1 | GPU 1 uncorrected ECC errors: +3 since last check | retire this GPU before it corrupts results

Quick Start

Step 1 — Install:

# Option A: pip (recommended)
pip install gpu-watchdog

# Option B: single file, zero install
curl -O https://raw.githubusercontent.com/reacher-z/gpu-monitor/main/gpu_monitor.py

Step 2 — Pick your notification channel:

# Slack
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

# Discord
export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/WEBHOOK"

# Telegram
export TELEGRAM_BOT_TOKEN="your-bot-token"
export TELEGRAM_CHAT_ID="your-chat-id"

# ntfy — zero signup, push to your phone right now
export NTFY_URL="https://ntfy.sh/my-gpu-cluster-abc123"

Step 3 — Run:

gpu-monitor
# or: python gpu_monitor.py

Set multiple env vars to fan out to multiple channels simultaneously.

Useful CLI flags:

gpu-monitor --once          # check once, print status, exit
gpu-monitor --json          # current GPU stats as JSON (pipe to jq, scripts, etc.)
gpu-monitor --watch 2       # live color terminal table, 2-second refresh
gpu-monitor --channels      # show which notification channels are currently configured
gpu-monitor --test-notify   # send a test alert to all configured channels
gpu-monitor --web 8080      # dashboard + Prometheus /metrics at :8080
gpu-monitor --version       # print version and exit

Run as a persistent background service (systemd):

curl -O https://raw.githubusercontent.com/reacher-z/gpu-monitor/main/gpu-monitor.service
# Edit the Environment= lines with your credentials, then:
sudo cp gpu-monitor.service /etc/systemd/system/gpu-monitor@$USER.service
sudo systemctl daemon-reload
sudo systemctl enable --now gpu-monitor@$USER
sudo journalctl -u gpu-monitor@$USER -f   # follow logs

Full monitoring stack (Prometheus + Grafana + Alertmanager):

cp .env.example .env && $EDITOR .env   # add your notification credentials
docker compose -f docker-compose.monitoring.yml up -d
# Grafana at http://localhost:3000  (admin/admin)
# Import grafana/dashboard.json for the pre-built GPU dashboard

Kubernetes — monitor every GPU node automatically:

# Edit kubernetes/secret.yaml with your credentials
kubectl apply -k kubernetes/

Example Output

--watch live terminal view (runs in your terminal like htop for GPUs):

gpu-cluster-1          2026-03-07 14:32
GPU  Name                 Util   Mem         Temp   Power   Procs
  0  NVIDIA A100-SXM4-80  87%    18G/80G     72°C   312W    python3[alice]
  1  NVIDIA A100-SXM4-80  91%    22G/80G     75°C   318W    torchrun[bob]
  2  NVIDIA A100-SXM4-80  83%    18G/80G     69°C   305W    python3[carol]
  3  NVIDIA A100-SXM4-80  88%    21G/80G     71°C   310W    torchrun[bob]

--once quick status check:

gpu-cluster-1 | 2026-03-07 14:32 | avg 87% | 72°C | 1820W | mem 188G/320G (59%) | up 6h12m
[87% 91% 83% 88% 92% 79% 85% 90%]
GPU0: python3(18G)[alice] | GPU1: torchrun(22G)[bob] | GPU3: python3(18G)[carol]

Slack/Discord alert — all GPUs went idle (crash detected):

gpu-cluster-1 | GPUs went idle — processes exited: 12345, 12346, 12347 | avg 1% | 38°C | mem 2G/320G (1%)

Slack/Discord alert — extended idle:

gpu-cluster-1 | 2026-03-07 15:01 | avg 2% | 38°C | idle 8min
All GPUs idle for 8 minutes. Last active: training job (alice)

--test-notify output:

Test notification sent to: Slack, Discord, ntfy
Not configured:           Telegram, Email, SMS, iMessage, WeCom, Feishu, DingTalk, Bark,
                          Teams, Pushover, Gotify, Mattermost, Google Chat, Zulip, OpenClaw

Why gpu-monitor?

gpu-monitor fills a gap that existing tools don't: unattended background monitoring with instant multi-channel alerts. gpustat and nvitop are excellent for interactive inspection — gpu-monitor is what runs while you're not watching.

Feature	gpu-monitor	gpustat	nvitop	wandb
Background alerts	✅	❌	❌	❌
Multi-channel notifications	✅ 20 built-in + 80 via Apprise	❌	❌	Slack only
Zero dependencies	✅ stdlib only	❌	❌	❌
Single file deploy	✅	❌	❌	❌
Crash detection	✅	❌	❌	❌
Temperature alerting	✅	❌	❌	❌
Memory leak detection	✅	❌	❌	❌
ECC error detection	✅	❌	❌	❌
Power throttle alert	✅	❌	❌	❌
Prometheus `/metrics`	✅ 11 metrics	❌	✅	❌
InfluxDB / Datadog / OTLP	✅	❌	❌	❌
Alertmanager receiver	✅	❌	❌	❌
Live terminal view	✅ `--watch`	✅	✅	❌
Kubernetes DaemonSet	✅	❌	❌	❌
Multi-machine dashboard	✅ GitHub Pages (free)	❌	❌	✅ paid

Features

Alerting — know before things go wrong

Crash detection — GPUs suddenly go idle while processes were running → instant alert
Idle alert — all GPUs below 10% utilization for 5 min → alert
Partial idle — some GPUs idle while others are busy (hung worker) → warning
Recovery notification — GPUs become active again after an idle period → notify
Temperature alerting — configurable GPU_TEMP_WARN / GPU_TEMP_CRIT thresholds, no Prometheus required
Power throttle alert — fires when power draw hits 95% of TDP limit
ECC error detection — alert on uncorrected volatile ECC errors (A100/H100/V100); prevents silent training corruption
Memory leak detection — alert when GPU memory grows unexpectedly without process changes

Visibility — always know what your GPUs are doing

Periodic status — active: every 10 min, idle: every 30 min
Startup notification — know when the monitor comes online
GPU processes — shows which processes are using each GPU with username
Power draw — watts per GPU in status messages
Per-machine color — auto-assigned color bar in Slack/Discord for multi-machine setups
Uptime tracking — shows up 2h30m or idle 15min in status
--watch — live ANSI color terminal table (lightweight nvtop alternative)
--json — machine-readable output: gpu-monitor --json | jq '.gpus[].util'

Observability integrations

Prometheus /metrics — 11 metrics when WEB_PORT is set; Grafana-ready
InfluxDB export — line protocol to InfluxDB v1/v2 (INFLUXDB_URL)
Datadog export — DogStatsD gauges (DATADOG_STATSD_HOST)
OpenTelemetry OTLP — export to any OTel-compatible backend (OTEL_EXPORTER_OTLP_ENDPOINT)
Alertmanager receiver — route any Prometheus alert to all 20 channels via POST /webhook
ALERT_WEBHOOK_URL — POST JSON to any HTTP endpoint on every alert (CI/CD, custom integrations)
Web dashboard sparklines — --web PORT shows per-GPU utilization history over time

Deployment

20 notification channels — Slack, Discord, Telegram, Email, SMS, iMessage, WeCom, Feishu, DingTalk, Bark, Rocket.Chat, ntfy, Gotify, Pushover, Mattermost, Teams, Google Chat, Zulip, OpenClaw, PagerDuty (+ 80+ more via Apprise)
--test-notify — verify all configured channels with one command
Kubernetes DaemonSet — deploy to every GPU node with kubectl apply -k kubernetes/
GitHub Pages dashboard — multi-machine status page, no extra server needed
Watchdog — auto-restart on crash
Log rotation — 5 MB × 3 backups

Supported Notification Channels

20 channels built in. Configure any combination — only channels with credentials set are used.

Channel	Env var(s) needed
Slack	`SLACK_WEBHOOK_URL`
Discord	`DISCORD_WEBHOOK_URL`
Telegram	`TELEGRAM_BOT_TOKEN` + `TELEGRAM_CHAT_ID`
Email (SMTP)	`EMAIL_SMTP_HOST`, `EMAIL_USER`, `EMAIL_PASS`, `EMAIL_TO`
SMS (Twilio)	`TWILIO_ACCOUNT_SID`, `TWILIO_AUTH_TOKEN`, `TWILIO_FROM`, `TWILIO_TO`
iMessage	`IMESSAGE_TO` (macOS only)
WeCom (企业微信)	`WECOM_WEBHOOK_URL`
Feishu (飞书)	`FEISHU_WEBHOOK_URL`
DingTalk (钉钉)	`DINGTALK_WEBHOOK_URL`
Bark	`BARK_URL` (self-hosted or api.day.app)
ntfy	`NTFY_URL` (+ optional `NTFY_TOKEN`)
Gotify	`GOTIFY_URL` + `GOTIFY_TOKEN`
Pushover	`PUSHOVER_TOKEN` + `PUSHOVER_USER`
Rocket.Chat	`ROCKETCHAT_WEBHOOK_URL`
Google Chat	`GOOGLE_CHAT_WEBHOOK_URL`
Zulip	`ZULIP_SITE` + `ZULIP_EMAIL` + `ZULIP_API_KEY`
Mattermost	`MATTERMOST_WEBHOOK_URL`
Microsoft Teams	`TEAMS_WEBHOOK_URL`
OpenClaw	`OPENCLAW_WEBHOOK_URL` — routes to WhatsApp, Signal, LINE, Matrix, Zalo, 20+ more
PagerDuty	`PAGERDUTY_INTEGRATION_KEY` (Events API v2)
Apprise (80+ more)	`APPRISE_URLS` — requires `pip install apprise`

Environment Variables

General

Variable	Default	Description
`CHECK_INTERVAL`	`60`	Seconds between GPU checks
`IDLE_THRESHOLD`	`10`	Alert when utilization drops below this %
`IDLE_MINUTES`	`5`	Minutes idle before the first alert fires
`ALERT_COOLDOWN`	`30`	Minutes between repeated alerts
`STATUS_ACTIVE`	`10`	Periodic status interval when active (minutes)
`STATUS_IDLE`	`30`	Periodic status interval when idle (minutes)
`MACHINE_COLOR`	auto	Hex color for Slack/Discord messages
`LOG_FILE`	—	Log file path (enables rotation)
`WEB_PORT`	—	Enables local dashboard + `/metrics` on this port
`MEMLEAK_THRESHOLD`	`30`	GPU memory growth % to trigger a leak alert
`MEMLEAK_MINUTES`	`10`	Window (minutes) for memory leak detection
`GPU_TEMP_WARN`	`85`	°C threshold for high-temperature warning alert
`GPU_TEMP_CRIT`	`92`	°C threshold for critical temperature alert
`ALERT_WEBHOOK_URL`	—	HTTP endpoint to POST JSON on every alert
`INFLUXDB_URL`	—	InfluxDB server URL (e.g. `http://influxdb:8086`)
`INFLUXDB_TOKEN`	—	API token (v2) or `user:password` (v1)
`INFLUXDB_BUCKET`	`gpu_metrics`	InfluxDB v2 bucket or v1 `db/rp`
`INFLUXDB_ORG`	—	InfluxDB v2 organization name
`DATADOG_STATSD_HOST`	—	Hostname of Datadog agent (enables DogStatsD export)
`DATADOG_STATSD_PORT`	`8125`	DogStatsD port
`OTEL_EXPORTER_OTLP_ENDPOINT`	—	OTel Collector URL (e.g. `http://otel-collector:4318`)
`OTEL_SERVICE_NAME`	`gpu-monitor`	Service name for OTLP resource attributes
`OTEL_EXPORTER_OTLP_HEADERS`	—	Extra headers as `key=val,key2=val2`
`APPRISE_URLS`	—	Space/comma-separated Apprise URLs (`pip install apprise` required)

Per-channel variables

Slack

Variable	Description
`SLACK_WEBHOOK_URL`	Slack incoming webhook URL

Discord

Variable	Description
`DISCORD_WEBHOOK_URL`	Discord webhook URL

Variable	Description
`TELEGRAM_BOT_TOKEN`	Bot token from @BotFather
`TELEGRAM_CHAT_ID`	Target chat/group/channel ID

Email (SMTP)

Variable	Default	Description
`EMAIL_SMTP_HOST`	—	SMTP server hostname
`EMAIL_SMTP_PORT`	`587`	SMTP port (STARTTLS)
`EMAIL_USER`	—	Login username
`EMAIL_PASS`	—	Login password or app password
`EMAIL_TO`	—	Recipient(s), comma-separated

SMS (Twilio)

Variable	Description
`TWILIO_ACCOUNT_SID`	Twilio account SID
`TWILIO_AUTH_TOKEN`	Twilio auth token
`TWILIO_FROM`	Twilio phone number (E.164 format)
`TWILIO_TO`	Recipient number(s), comma-separated

iMessage (macOS only)

Variable	Description
`IMESSAGE_TO`	Recipient phone/email, comma-separated

WeCom (企业微信)

Variable	Description
`WECOM_WEBHOOK_URL`	WeCom group bot webhook URL

Feishu (飞书 / Lark)

Variable	Description
`FEISHU_WEBHOOK_URL`	Feishu bot webhook URL

DingTalk (钉钉)

Variable	Description
`DINGTALK_WEBHOOK_URL`	DingTalk group robot webhook URL

Bark (iOS push)

Variable	Description
`BARK_URL`	Bark server URL, e.g. `https://api.day.app/YOUR_KEY`

ntfy

Variable	Description
`NTFY_URL`	ntfy topic URL, e.g. `https://ntfy.sh/my-gpu-alerts`
`NTFY_TOKEN`	Auth token (optional, for protected topics)

Gotify

Variable	Description
`GOTIFY_URL`	Gotify server URL, e.g. `http://gotify.example.com`
`GOTIFY_TOKEN`	App token from Gotify dashboard

Pushover

Variable	Description
`PUSHOVER_TOKEN`	App API token from pushover.net
`PUSHOVER_USER`	Your user/group key

Rocket.Chat

Variable	Description
`ROCKETCHAT_WEBHOOK_URL`	Incoming webhook URL (Administration → Integrations → Incoming WebHook)

Google Chat

Variable	Description
`GOOGLE_CHAT_WEBHOOK_URL`	Google Chat space webhook URL (Space → Manage webhooks)

Zulip

Variable	Default	Description
`ZULIP_SITE`	—	Your Zulip server URL, e.g. `https://yourorg.zulipchat.com`
`ZULIP_EMAIL`	—	Bot email address
`ZULIP_API_KEY`	—	Bot API key
`ZULIP_STREAM`	`general`	Stream to post to
`ZULIP_TOPIC`	`GPU Monitor`	Topic/thread name

Mattermost

Variable	Description
`MATTERMOST_WEBHOOK_URL`	Incoming webhook URL (Main Menu → Integrations → Incoming Webhooks)

Microsoft Teams

Variable	Description
`TEAMS_WEBHOOK_URL`	Teams incoming webhook URL (channel → ... → Connectors → Incoming Webhook)

OpenClaw

Variable	Description
`OPENCLAW_WEBHOOK_URL`	Your OpenClaw webhook URL, e.g. `http://your-host:18789/hooks/wake`
`OPENCLAW_WEBHOOK_SECRET`	Bearer token (from OpenClaw settings), if auth is enabled

PagerDuty

Variable	Description
`PAGERDUTY_INTEGRATION_KEY`	32-character Events API v2 integration key from PagerDuty

Create an integration in PagerDuty: Service → Integrations → Add integration → Events API v2. Copy the integration key.

Prometheus Metrics

Enable with WEB_PORT:

export WEB_PORT=8080
gpu-monitor
# Metrics at http://localhost:8080/metrics
# Dashboard at http://localhost:8080/

11 exposed metrics, all labeled with gpu index and host:

gpu_utilization_percent, gpu_memory_used_mib, gpu_memory_total_mib, gpu_memory_utilization_percent, gpu_temperature_celsius, gpu_power_watts, gpu_power_limit_watts, gpu_clock_sm_mhz, gpu_fan_speed_percent, gpu_ecc_errors_uncorrected, gpu_process_count

Add to prometheus.yml:

scrape_configs:
  - job_name: gpu
    static_configs:
      - targets: ['your-server:8080']

Pre-built Grafana dashboard is at grafana/dashboard.json — import via Dashboards → Import → Upload JSON. Includes utilization, memory, temperature, and power panels with host and GPU variable filters.

Prometheus alerting rules are at grafana/alerts.yml:

rule_files:
  - rules/gpu-monitor-alerts.yml

Alert	Condition	Severity
`GPUAllIdle`	avg util < 10% for 5m	warning
`GPUHighTemperature`	temp > 85°C for 2m	warning
`GPUCriticalTemperature`	temp > 92°C for 1m	critical
`GPUMemoryHigh`	mem util > 90% for 5m	warning
`GPUMemoryFull`	mem util > 98% for 2m	critical
`GPUMonitorDown`	no metrics for 3m	critical

Alertmanager Webhook Receiver

When WEB_PORT is set, gpu-monitor also acts as an Alertmanager webhook receiver — forwarding any Prometheus alert (GPU or otherwise) to all 20 configured notification channels.

Configure in Alertmanager:

receivers:
  - name: gpu-monitor
    webhook_configs:
      - url: http://your-server:8080/webhook
        send_resolved: true

Alerts arrive with severity-appropriate formatting (fire icon for critical, warning icon for warning). Resolved alerts are announced separately.

A pre-configured grafana/alertmanager.yml is included that routes all Prometheus alerts through gpu-monitor's webhook receiver automatically.

Kubernetes

Deploy as a DaemonSet to monitor every GPU node:

# Edit kubernetes/secret.yaml with your notification channel credentials
kubectl apply -k kubernetes/

The DaemonSet:

Schedules on nodes labeled nvidia.com/gpu: "true"
Exposes /metrics on port 8080 with Prometheus scraping annotations
Uses spec.nodeName as hostname for per-node identification in alerts
Reads credentials from a gpu-monitor-secrets Secret

For Prometheus pod auto-discovery:

# In prometheus.yml:
- job_name: gpu-monitor
  kubernetes_sd_configs:
    - role: pod
      namespaces:
        names: [gpu-monitor]
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: "true"
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: (.+)
      replacement: ${1}:8080

GitHub Pages Dashboard

Real-time GPU dashboard hosted on GitHub Pages — no extra server needed.

Setup:

Enable GitHub Pages in your repo: Settings → Pages → Source: main branch, /docs folder
Create a fine-grained personal access token with Contents: read and write on that repo
Set env vars on each machine:

export GITHUB_PAGES_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxx
export GITHUB_PAGES_REPO=your-username/your-repo
gpu-monitor

The monitor pushes docs/data/{hostname}.json every check interval. The dashboard at https://your-username.github.io/your-repo/ auto-refreshes every 30 seconds.

Multi-machine: each machine pushes its own file. The dashboard shows all machines side-by-side with online/stale/offline badges.

Variable	Description
`GITHUB_PAGES_TOKEN`	Fine-grained token with Contents read+write
`GITHUB_PAGES_REPO`	Repo to push stats to, e.g. `owner/repo`

Multi-Machine Setup

Deploy to each machine — each gets an auto-assigned color in Slack/Discord and appears on the GitHub Pages dashboard. All report to the same webhook/channel with their hostname clearly labeled in every message.

Setting Up Specific Channels

Setting Up Telegram

Message @BotFather → /newbot
Copy the token → TELEGRAM_BOT_TOKEN
Send a message to your bot, then visit https://api.telegram.org/bot<TOKEN>/getUpdates to find your TELEGRAM_CHAT_ID

Setting Up Chinese Notification Channels

WeCom (企业微信)

Open WeCom → Group Chat → Add Group Robot
Copy the webhook URL → WECOM_WEBHOOK_URL

Feishu (飞书 / Lark)

Open Feishu group → Settings → Bots → Add Bot → Custom Bot
Copy the webhook URL → FEISHU_WEBHOOK_URL

DingTalk (钉钉)

Open DingTalk group → Group Settings → Bots → Add Robot → Custom
Set a keyword (e.g. GPU) in security settings
Copy the webhook URL → DINGTALK_WEBHOOK_URL

Bark (iOS)

Install Bark from the App Store
Copy your device URL → BARK_URL (e.g. https://api.day.app/YOUR_DEVICE_KEY)

Setting Up ntfy

ntfy is a zero-signup push notification service. Subscribe via the ntfy app (Android/iOS), web UI, or any HTTP client.

# No account needed — just pick any topic name
export NTFY_URL="https://ntfy.sh/my-gpu-cluster-abc123"
gpu-monitor

Subscribe to the same topic in the ntfy app on your phone to receive alerts instantly. For private topics, generate a token at ntfy.sh/app and set NTFY_TOKEN.

Self-hosted: replace https://ntfy.sh/ with your own server URL.

Setting Up Apprise (80+ Extra Services)

Apprise is an optional dependency that adds 80+ additional services — AWS SNS, Pushbullet, Home Assistant, Matrix, SparkPost, and more — through URL-based configuration.

pip install apprise
export APPRISE_URLS="slack://TokenA/TokenB/TokenC/#channel tgram://bot_token/chat_id"
gpu-monitor

The core gpu-monitor has zero dependencies — Apprise is only activated when installed and APPRISE_URLS is set.

See the full list of URL formats in the Apprise wiki.

Setting Up OpenClaw

OpenClaw is a self-hosted notification router that delivers to 20+ chat platforms — WhatsApp, Teams, Signal, LINE, Mattermost, Matrix, Zalo, and more.

Install and start OpenClaw (see openclaw.ai)
In OpenClaw settings, enable the webhook gateway and copy the URL
Configure:

export OPENCLAW_WEBHOOK_URL="http://your-openclaw-host:18789/hooks/wake"
export OPENCLAW_WEBHOOK_SECRET="your-bearer-token"  # optional, if auth enabled
gpu-monitor

Who Uses gpu-monitor?

gpu-monitor is used by ML researchers, PhD students, and infrastructure engineers who run long training jobs and can't watch their machines around the clock.

Common setups:

Single researcher monitoring a local workstation with Telegram alerts
Lab with 4–8 GPU nodes, all reporting to a shared Slack channel with per-machine colors
Cloud cluster on Kubernetes, with PagerDuty integration for on-call rotation
Self-hosted Prometheus + Grafana stack with Alertmanager routing through gpu-monitor's webhook receiver

Have a setup you're proud of? Open an issue with the showcase label and share it — setups get featured here.

Author

Built and maintained by reacher-z.

If this tool saved your GPU-hours or helped you catch a crash before it ruined a training run, consider giving it a star on GitHub — it helps other researchers and engineers discover the project.

Bugs, feature requests, and channel integrations: open an issue or submit a PR. Contributions are welcome.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Mar 8, 2026

This version

0.3.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpuwatch-0.3.0-py3-none-any.whl (33.8 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file gpuwatch-0.3.0-py3-none-any.whl.

File metadata

Download URL: gpuwatch-0.3.0-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 33.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gpuwatch-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e23c33ef189b886c5a9258f011c59d3e6ab438aeca3c1f7c966af1c5f5c2994b`
MD5	`82b7226a8028f7b6a3bedd38e823b61f`
BLAKE2b-256	`b4f65d31066c61ad8d301bcb46bf0126e9b3a4408b16f8f131cb75c9f66b0013`

See more details on using hashes here.

gpuwatch 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GPU Monitor

Table of Contents

What Happens When...

Quick Start

Example Output

Why gpu-monitor?

Features

Supported Notification Channels

Environment Variables

General

Per-channel variables

Slack

Discord

Telegram

Email (SMTP)

SMS (Twilio)

iMessage (macOS only)

WeCom (企业微信)

Feishu (飞书 / Lark)

DingTalk (钉钉)

Bark (iOS push)

ntfy

Gotify

Pushover

Rocket.Chat

Google Chat

Zulip

Mattermost

Microsoft Teams

OpenClaw

PagerDuty

Prometheus Metrics

Alertmanager Webhook Receiver

Kubernetes

GitHub Pages Dashboard

Multi-Machine Setup

Setting Up Specific Channels

Setting Up Telegram

Setting Up Chinese Notification Channels

Setting Up ntfy

Setting Up Apprise (80+ Extra Services)

Setting Up OpenClaw

Who Uses gpu-monitor?

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes