Real-time TUI for monitoring cloud GPU training instances
Project description
vigil
Real-time TUI for monitoring cloud GPU training instances.
TL;DR —
pip install vigil-gpu && vigil. A single terminal command gives you live SSH log streaming from every cloud GPU instance, regex-parsed metrics with sparklines, NaN/stall/plateau alerts to desktop or Slack/Discord, nvidia-smi on demand, persistent logs that survive crashes, and a setup wizard that auto-detects your framework. All from a keyboard-driven TUI.
The Problem
Cloud GPU consoles (Vast.ai, RunPod, Lambda, etc.) provide delayed, truncated output and offer no log persistence when instances crash or are preempted. If you are running multiple GPU training jobs across rented instances, you need a tool that streams logs in real time, survives disconnections, and alerts you before a stalled run burns through your budget.
Supported Providers
- Vast.ai — full support (auto-discovery, instance management, cost tracking)
- RunPod — full support (auto-discovery via GraphQL, instance management, cost tracking)
Features
Live Monitoring
- Auto-discovers all running instances via provider APIs
- Streams logs in real time over SSH with automatic reconnection and exponential backoff
- Dynamic grid layout that adapts columns to instance count (1/2/3 columns, configurable max)
- Per-instance panels showing GPU type, count, label, and hourly cost
- Aggregate cost display across all instances in the subtitle bar
- Focus mode: select a panel to expand it to full width
- Pause/resume log rendering and toggle auto-scroll per panel
Metric Parsing
- Configurable regex patterns with named capture groups
- 10 built-in patterns: loss, step, epoch, learning rate, reward, accuracy, RSSM, VC, TOM, actor
- Inline sparkline charts (Unicode block elements) showing metric trends over time
- Color-coded direction indicators (green/red based on whether up or down is desirable)
- NaN/Inf detection with immediate alerting
- Loss plateau detection using rolling standard deviation
- Keyword pre-filtering for efficient pattern matching at scale
Framework Presets
- HuggingFace Transformers: eval_loss, train_loss, learning_rate, eval_accuracy, grad_norm, samples_sec
- PyTorch Lightning: val_loss, train_loss, val_acc, train_acc
- DreamerV3 / RL: ep_return, episode, fps, policy_loss, value_loss, model_loss
- Auto-detection from running instance logs (samples 5 seconds of output to identify framework)
- Apply via setup wizard or configure manually
Alerts
- Stall detection when no output is received for a configurable duration
- NaN/Inf alerts with visual panel highlighting (red border)
- Loss plateau alerts using sliding-window standard deviation (configurable window and threshold)
- Desktop notifications on macOS (osascript) and Linux (notify-send)
- Webhook integration with Slack (Block Kit), Discord (embeds), and raw JSON formats
- Per-alert-type notification control (NaN, plateau, stall independently togglable)
- Alerts auto-clear when values return to normal
- All notifications are best-effort and non-blocking (never disrupt the TUI)
Instance Management
- Instance manager view with GPU details, costs, connection status, and SSH commands
- Live nvidia-smi output over SSH (with manual refresh)
- Change the log command per instance at runtime (persisted to config)
- Stop/destroy instances directly from the TUI (with confirmation dialog showing instance details)
- Force SSH reconnect with manual trigger
- Configurable SSH username (global and per-instance)
- SSH keepalive and connection timeout settings
Log Persistence
- All logs written to
~/.vigil/logs/with per-instance directories - Timestamped raw log files with atomic
latest.logsymlink - Metrics JSONL sidecar files for post-hoc analysis
- Historical log browser with session list and search
- Crash-resilient: logs are buffered and flushed periodically (every 10s) and on disconnect
- Configurable log retention by age (days) and total size (MB)
- Automatic hourly cleanup of expired log sessions
Search
- Inline log filtering with
/(regex-supported, debounced) - Global cross-instance search with
G(searches all in-memory log buffers, up to 100 results per instance) - Historical log search in the log viewer
- Graceful regex fallback to substring matching on invalid patterns
Metrics Comparison
- Full-screen metrics table comparing all instances side by side
- Dynamic columns based on collected metrics across all instances
- Sparklines and direction indicators per metric per instance
- Auto-refreshes every 2 seconds with manual refresh option
Setup Wizard
- Guided 5-step onboarding wizard for first-time setup
- Step 1: API key input with live validation (tests connectivity and shows instance count)
- Step 2: SSH key selection with auto-detection of common key paths
- Step 3: Framework preset selection with auto-detection from running instances
- Step 4: Alert and notification configuration (desktop, stall threshold, webhooks)
- Step 5: Summary and confirmation before launch
- Progress bar and back/skip/next navigation
- Accessible anytime with
S(Shift+S)
Contextual Hints
- Non-intrusive hint bar with sequenced tips for new users
- Covers: focusing panels, searching logs, viewing help, comparing metrics
- Auto-dismisses after 30 seconds, respects minimum instance requirements
- Tracks completion in persistent state; reset with
--reset-hints
Quick Start
pip install vigil-gpu
vigil
If this is your first run, the setup wizard will guide you through configuration. Otherwise, provide your API key via any of these methods (checked in order):
- CLI flag:
vigil --api-key YOUR_KEY - Environment variable:
VIGIL_API_KEY=YOUR_KEY vigil(also acceptsVAST_API_KEY/RUNPOD_API_KEY) - Config file: Set
api_keyin~/.config/vigil/config.yaml - Vast CLI default: Place your key in
~/.vast_api_key
For RunPod: set RUNPOD_API_KEY or pass vigil --provider runpod --api-key YOUR_KEY
CLI Options
vigil [OPTIONS]
--config, -c PATH Config file path (default: ~/.config/vigil/config.yaml)
--api-key KEY Provider API key (overrides config/env)
--provider NAME GPU provider (default: vast) [vast, runpod]
--demo Run with simulated instances (no API key or SSH needed)
--reset-hints Reset onboarding hints so they appear again
Configuration
Configuration lives at ~/.config/vigil/config.yaml. All fields are optional with sensible defaults. See config.example.yaml for the full reference.
Key Settings
# GPU cloud provider
provider: vast
# Provider API key (prefer env var for security)
api_key: null
# Seconds between API polls for instance changes
poll_interval: 30
# Local directory for persistent logs
log_dir: "~/.vigil/logs"
# SSH private key path
ssh_key_path: "~/.ssh/id_ed25519"
# Minutes of silence before stall alert fires
stall_threshold_minutes: 5
# Desktop notifications for NaN/plateau/stall alerts
desktop_notifications: true
# Webhook URL for alerts (receives JSON POST)
alert_webhook_url: null
SSH Settings
ssh_username: "root"
ssh_login_timeout: 10 # seconds
ssh_keepalive_interval: 15 # seconds (0 to disable)
reconnect_backoff_max: 60 # max seconds between reconnect attempts
Log Retention
log_retention_days: 0 # delete logs older than N days (0 = keep forever)
log_max_size_mb: 0 # max total log storage in MB (0 = unlimited)
Layout & Buffers
max_grid_columns: 3 # max columns in the instance grid
log_buffer_lines: 5000 # in-memory log buffer per instance
log_display_lines: 2000 # max lines rendered in the panel
highlight_logs: false # Rich syntax highlighting on log lines
sparkline_history: 60 # metric readings kept for sparkline rendering
Custom Metric Patterns
Metrics are extracted from log lines using regex patterns with named capture groups. The group name becomes the metric key displayed in the panel.
metric_patterns:
- 'loss[:\s=]+(?P<loss>[\d.]+)'
- 'step[:\s=]+(?P<step>\d+(?:,\d{3})*)'
- 'epoch[:\s=]+(?P<epoch>[\d./]+)'
- '(?<![a-z])lr[:\s=]+(?P<lr>[\d.e\-]+)'
- 'reward[:\s=]+(?P<reward>[\d.\-e]+)'
- 'accuracy[:\s=]+(?P<accuracy>[\d.]+)'
These are a subset of the 10 built-in defaults. Setting metric_patterns in your config replaces all defaults. To add patterns without replacing, use extra_metric_patterns:
extra_metric_patterns:
- 'val_loss[:\s=]+(?P<val_loss>[\d.]+)'
- 'perplexity[:\s=]+(?P<ppl>[\d.]+)'
Metric Semantics
Control which metrics get green/red coloring and which are treated as counters:
decrease_good: [loss, val_loss, rssm, vc, tom, ppl]
increase_good: [reward, accuracy, val_accuracy]
counters: [step, epoch]
plateau_metrics: [loss, val_loss]
Plateau Detection
plateau_window: 8 # number of readings to check
plateau_threshold: 0.0001 # stdev below this = plateau
plateau_metrics: [loss] # which metrics to watch
Webhook Alerts
Supports raw JSON, Slack Block Kit, and Discord embed formats:
alert_webhook_url: "https://hooks.slack.com/services/..."
alert_webhook_format: "slack" # "raw", "slack", or "discord"
notifications:
nan: true # NaN/Inf detected in metrics
plateau: true # Loss plateau detected
stall: true # No output for configured minutes
desktop: true # Master switch for OS desktop notifications
webhook: true # Master switch for webhook alerts
Per-Instance Overrides
instances:
12345:
log_command: "tail -f /workspace/training.log"
stall_threshold_minutes: 10
ssh_username: "trainer"
67890:
log_command: "docker logs -f training_container"
Keyboard Shortcuts
Navigation
| Key | Action |
|---|---|
1-9 |
Focus/unfocus instance panel by number |
Tab |
Cycle to next panel |
Escape |
Back / unfocus panel / close search |
? |
Toggle help overlay |
q |
Quit |
Views
| Key | Action |
|---|---|
i |
Instance manager (details, SSH commands) |
l |
Historical log viewer (with search) |
m |
Metrics comparison table (auto-refreshes) |
G |
Global search across all instances |
n |
nvidia-smi output for focused instance |
S |
Open setup wizard |
Instance Actions (focus a panel first)
| Key | Action |
|---|---|
c |
Change log command for focused instance |
r |
Force SSH reconnect |
D |
Stop/destroy instance (requires confirmation) |
Log Control (focus a panel first)
| Key | Action |
|---|---|
/ |
Search / filter logs (regex supported) |
p |
Pause / resume log rendering |
f |
Toggle auto-scroll follow mode |
How It Works
┌─────────────────────────────────────────────────────┐
│ Provider API (Vast.ai, ...) │
│ (instance discovery loop) │
└──────────────────────┬──────────────────────────────┘
│ poll every N seconds
▼
┌─────────────────────────────────────────────────────┐
│ Discovery / Reconciler │
│ adds new instances, removes terminated ones │
└──────────────────────┬──────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ SSH Log │ │ SSH Log │ │ SSH Log │ ... per instance
│ Stream │ │ Stream │ │ Stream │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ Metric Parser │
│ regex extraction → metric state + history │
└──────────────────────┬───────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Log │ │ Alert │ │ TUI Panels │
│ Storage │ │ Engine │ │ (Textual) │
│ (disk) │ │ │ │ │
└──────────┘ └────┬─────┘ └──────────────┘
│
┌─────────┼──────────┐
▼ ▼ ▼
┌────────┐ ┌───────┐ ┌────────┐
│Desktop │ │Webhook│ │ Visual │
│Notif. │ │(Slack/│ │ Alert │
│ │ │Discord│ │(border)│
└────────┘ └───────┘ └────────┘
Vigil runs a discovery loop that polls your GPU provider's API at a configurable interval to find running instances. For each discovered instance, it opens a persistent SSH connection and executes a log command (a smart heuristic that finds training process stdout, recent log files, or provider-specific logs). Each line is fed through a regex-based metric parser, written to local storage (raw log + metrics JSONL sidecar), and rendered in a Textual TUI panel with sparklines and color-coded metric indicators. A background alert engine monitors each stream for stalls, NaN values, and plateaus. When instances are added or removed, the grid layout and SSH connections are reconciled automatically.
All I/O is async. SSH connections use keepalive heartbeats and reconnect with exponential backoff. Log files are buffered and flushed periodically. Notifications and webhooks are fire-and-forget to never block the UI.
Requirements
- Python 3.10+
- SSH key registered with your GPU provider (default:
~/.ssh/id_ed25519) - Dependencies: textual, asyncssh, httpx, pyyaml
Development
git clone https://github.com/vigil-gpu/vigil.git
cd vigil
python -m venv .venv && source .venv/bin/activate
pip install -e .
vigil --config config.example.yaml
Contributing
Contributions are welcome. Please see CONTRIBUTING.md for guidelines.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vigil_gpu-0.2.0.tar.gz.
File metadata
- Download URL: vigil_gpu-0.2.0.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33a04d0bf49fbf855a75036c90604fc0693d9b270cbe65d0fb3d661eb3331ded
|
|
| MD5 |
d33fe4c0e52c92103c648009cece3fcb
|
|
| BLAKE2b-256 |
54b4ffad5b18bd8795e5659c2833ab3ae83f764fd3a69d91943f83731a47ad32
|
Provenance
The following attestation bundles were made for vigil_gpu-0.2.0.tar.gz:
Publisher:
publish.yml on helpmatteo/vigil
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vigil_gpu-0.2.0.tar.gz -
Subject digest:
33a04d0bf49fbf855a75036c90604fc0693d9b270cbe65d0fb3d661eb3331ded - Sigstore transparency entry: 1207240571
- Sigstore integration time:
-
Permalink:
helpmatteo/vigil@f447ff06d60c28bfd7982795579381669ad2ffab -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/helpmatteo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f447ff06d60c28bfd7982795579381669ad2ffab -
Trigger Event:
push
-
Statement type:
File details
Details for the file vigil_gpu-0.2.0-py3-none-any.whl.
File metadata
- Download URL: vigil_gpu-0.2.0-py3-none-any.whl
- Upload date:
- Size: 59.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
202a177e23639a1b505ee7385452f7f6b1b600f8b7f895715b944558673a7ada
|
|
| MD5 |
a7f0547df218d79f34fbb20f0e3ccf23
|
|
| BLAKE2b-256 |
a6341f2fe72d545883c2fa746bc560a3f8145b3f187840789212bb32064c80d2
|
Provenance
The following attestation bundles were made for vigil_gpu-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on helpmatteo/vigil
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vigil_gpu-0.2.0-py3-none-any.whl -
Subject digest:
202a177e23639a1b505ee7385452f7f6b1b600f8b7f895715b944558673a7ada - Sigstore transparency entry: 1207240671
- Sigstore integration time:
-
Permalink:
helpmatteo/vigil@f447ff06d60c28bfd7982795579381669ad2ffab -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/helpmatteo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f447ff06d60c28bfd7982795579381669ad2ffab -
Trigger Event:
push
-
Statement type: