Skip to main content

CLI tool for GPU/Slurm job notifications with automatic log and artifact delivery

Project description

GPUAlert

A CLI for long-running GPU and Slurm jobs that emails you when they finish — with the full stdout/stderr logs and any output artifacts attached.

pip install gpualert
gpualert config --init
gpualert run -- python train.py

Why

You've kicked off training, it'll take twelve hours, and you want to know whether it crashed at hour two or finished cleanly at hour eleven. SSH'ing back in to find out is a tax. GPUAlert wraps the job, writes structured logs to disk, classifies common failure modes (CUDA OOM, NCCL, NaN loss, OOMKiller, etc.), and emails you the result with logs attached.

Features

  • Wraps any command and emails on completion: success, failure, timeout, or Ctrl+C.
  • Polls Slurm jobs via sacct so you can monitor jobs you already submitted with sbatch.
  • Writes log files to disk before the process starts, so they exist even on segfault.
  • Always attaches logs to failure emails. Non-negotiable.
  • Auto-detects ML metrics in successful runs (accuracy, loss, F1, mAP, ...) and surfaces them in the email body.
  • Scans the working directory for output artifacts after the job ends; budgets the email and zips the overflow.
  • --dry-run prints the email it would send without touching SMTP — useful for debugging.

Quick start

Install and configure:

pip install gpualert
gpualert config --init     # interactive SMTP wizard
gpualert test-email        # verify it actually works

For Gmail, generate an App Password at https://myaccount.google.com/apppasswords (requires 2FA on the account). Paste it at the password prompt.

Wrap a local job:

gpualert run -- python train.py --epochs 50
gpualert run --timeout 7200 -- bash train.sh
gpualert run --dry-run -- python smoke.py

Monitor a Slurm job you've already submitted:

gpualert slurm 12345
gpualert slurm 12345 --interval 30 --timeout 86400

List recent log directories:

gpualert logs --last 20

Configuration

Stored at ~/.gpualert/config.toml (mode 600), created on first run.

[smtp]
server = "smtp.gmail.com"
port = 587
use_tls = true
username = "you@gmail.com"
password = "your-app-password"

[email]
to_addresses = ["you@gmail.com"]
attach_logs_on_success = true

[artifacts]
patterns = ["*.csv", "*.png", "*.json", "*.log", "*.npz"]
max_single_file_mb = 25
max_total_mb = 45

Full reference: docs/configuration.md.

Documentation

Requirements

  • Python 3.10+
  • Linux or macOS
  • An SMTP account you can authenticate to

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpualert-0.1.0.tar.gz (35.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpualert-0.1.0-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file gpualert-0.1.0.tar.gz.

File metadata

  • Download URL: gpualert-0.1.0.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gpualert-0.1.0.tar.gz
Algorithm Hash digest
SHA256 341bb48faad64247731701f38fe748548d96a455f26ade45170fcbab7fb887d1
MD5 17742533acdc8a9cf6abea53ba6cec18
BLAKE2b-256 427e0db22d0efd9361c87f6409b1f65d9614405a130a4d3c137bcef7beb72064

See more details on using hashes here.

File details

Details for the file gpualert-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gpualert-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gpualert-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 19e3deef1611b302cf7ecb52b9e9fcb087f89e53a141fc14f69e9f65b91bb921
MD5 fe57644c8ee933cfa9a9b2d7201b1c7c
BLAKE2b-256 169c481827297b02911a7ec74629aa654f6256d5d3c03fc7ad6b3d4b3e5782ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page