CLI tool for GPU/Slurm job notifications with automatic log and artifact delivery

These details have not been verified by PyPI

Project links

Project description

GPUAlert

A CLI for long-running GPU and Slurm jobs that emails you when they finish — with the full stdout/stderr logs and any output artifacts attached.

pip install gpualert
gpualert config --init
gpualert run -- python train.py

Why

You've kicked off training, it'll take twelve hours, and you want to know whether it crashed at hour two or finished cleanly at hour eleven. SSH'ing back in to find out is a tax. GPUAlert wraps the job, writes structured logs to disk, classifies common failure modes (CUDA OOM, NCCL, NaN loss, OOMKiller, etc.), and emails you the result with logs attached.

Features

Wraps any command and emails on completion: success, failure, timeout, or Ctrl+C.
Polls Slurm jobs via sacct so you can monitor jobs you already submitted with sbatch.
Writes log files to disk before the process starts, so they exist even on segfault.
Always attaches logs to failure emails. Non-negotiable.
Auto-detects ML metrics in successful runs (accuracy, loss, F1, mAP, ...) and surfaces them in the email body.
Scans the working directory for output artifacts after the job ends; budgets the email and zips the overflow.
--dry-run prints the email it would send without touching SMTP — useful for debugging.

Quick start

Install and configure:

pip install gpualert
gpualert config --init     # interactive SMTP wizard
gpualert test-email        # verify it actually works

For Gmail, generate an App Password at https://myaccount.google.com/apppasswords (requires 2FA on the account). Paste it at the password prompt.

Wrap a local job:

gpualert run -- python train.py --epochs 50
gpualert run --timeout 7200 -- bash train.sh
gpualert run --dry-run -- python smoke.py

Monitor a Slurm job you've already submitted:

gpualert slurm 12345
gpualert slurm 12345 --interval 30 --timeout 86400

List recent log directories:

gpualert logs --last 20

Configuration

Stored at ~/.gpualert/config.toml (mode 600), created on first run.

[smtp]
server = "smtp.gmail.com"
port = 587
use_tls = true
username = "you@gmail.com"
password = "your-app-password"

[email]
to_addresses = ["you@gmail.com"]
attach_logs_on_success = true

[artifacts]
patterns = ["*.csv", "*.png", "*.json", "*.log", "*.npz"]
max_single_file_mb = 25
max_total_mb = 45

Full reference: docs/configuration.md.

Documentation

Requirements

Python 3.10+
Linux or macOS
An SMTP account you can authenticate to

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 29, 2026

This version

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpualert-0.1.0.tar.gz (35.6 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpualert-0.1.0-py3-none-any.whl (27.8 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file gpualert-0.1.0.tar.gz.

File metadata

Download URL: gpualert-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 35.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gpualert-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`341bb48faad64247731701f38fe748548d96a455f26ade45170fcbab7fb887d1`
MD5	`17742533acdc8a9cf6abea53ba6cec18`
BLAKE2b-256	`427e0db22d0efd9361c87f6409b1f65d9614405a130a4d3c137bcef7beb72064`

See more details on using hashes here.

File details

Details for the file gpualert-0.1.0-py3-none-any.whl.

File metadata

Download URL: gpualert-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 27.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gpualert-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`19e3deef1611b302cf7ecb52b9e9fcb087f89e53a141fc14f69e9f65b91bb921`
MD5	`fe57644c8ee933cfa9a9b2d7201b1c7c`
BLAKE2b-256	`169c481827297b02911a7ec74629aa654f6256d5d3c03fc7ad6b3d4b3e5782ff`

See more details on using hashes here.

gpualert 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GPUAlert

Why

Features

Quick start

Configuration

Documentation

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes