CLI tool for GPU/Slurm job notifications with automatic log and artifact delivery
Project description
GPUAlert
A CLI for long-running GPU and Slurm jobs that emails you when they finish — with the full stdout/stderr logs and any output artifacts attached.
pip install gpualert
gpualert config --init
gpualert run -- python train.py
Why
You've kicked off training, it'll take twelve hours, and you want to know whether it crashed at hour two or finished cleanly at hour eleven. SSH'ing back in to find out is a tax. GPUAlert wraps the job, writes structured logs to disk, classifies common failure modes (CUDA OOM, NCCL, NaN loss, OOMKiller, etc.), and emails you the result with logs attached.
Features
- Wraps any command and emails on completion: success, failure, timeout, or Ctrl+C.
- Polls Slurm jobs via
sacctso you can monitor jobs you already submitted withsbatch. - Writes log files to disk before the process starts, so they exist even on segfault.
- Always attaches logs to failure emails. Non-negotiable.
- Auto-detects ML metrics in successful runs (
accuracy,loss,F1,mAP, ...) and surfaces them in the email body. - Scans the working directory for output artifacts after the job ends; budgets the email and zips the overflow.
--dry-runprints the email it would send without touching SMTP — useful for debugging.
Quick start
Install and configure:
pip install gpualert
gpualert config --init # interactive SMTP wizard
gpualert test-email # verify it actually works
For Gmail, generate an App Password at https://myaccount.google.com/apppasswords (requires 2FA on the account). Paste it at the password prompt.
Wrap a local job:
gpualert run -- python train.py --epochs 50
gpualert run --timeout 7200 -- bash train.sh
gpualert run --dry-run -- python smoke.py
Monitor a Slurm job you've already submitted:
gpualert slurm 12345
gpualert slurm 12345 --interval 30 --timeout 86400
List recent log directories:
gpualert logs --last 20
Configuration
Stored at ~/.gpualert/config.toml (mode 600), created on first run.
[smtp]
server = "smtp.gmail.com"
port = 587
use_tls = true
username = "you@gmail.com"
password = "your-app-password"
[email]
to_addresses = ["you@gmail.com"]
attach_logs_on_success = true
[artifacts]
patterns = ["*.csv", "*.png", "*.json", "*.log", "*.npz"]
max_single_file_mb = 25
max_total_mb = 45
Full reference: docs/configuration.md.
Documentation
Requirements
- Python 3.10+
- Linux or macOS
- An SMTP account you can authenticate to
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpualert-0.1.0.tar.gz.
File metadata
- Download URL: gpualert-0.1.0.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
341bb48faad64247731701f38fe748548d96a455f26ade45170fcbab7fb887d1
|
|
| MD5 |
17742533acdc8a9cf6abea53ba6cec18
|
|
| BLAKE2b-256 |
427e0db22d0efd9361c87f6409b1f65d9614405a130a4d3c137bcef7beb72064
|
File details
Details for the file gpualert-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gpualert-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19e3deef1611b302cf7ecb52b9e9fcb087f89e53a141fc14f69e9f65b91bb921
|
|
| MD5 |
fe57644c8ee933cfa9a9b2d7201b1c7c
|
|
| BLAKE2b-256 |
169c481827297b02911a7ec74629aa654f6256d5d3c03fc7ad6b3d4b3e5782ff
|