Real-time TUI for monitoring cloud GPU training instances

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

helpmatteo

These details have not been verified by PyPI

Project description

vigil

Real-time TUI for monitoring cloud GPU training instances.

Python 3.10+ License MIT PyPI

TL;DR — pip install vigil-gpu && vigil. A single terminal command gives you live SSH log streaming from every cloud GPU instance, regex-parsed metrics with sparklines, NaN/stall/plateau alerts to desktop or Slack/Discord, nvidia-smi on demand, persistent logs that survive crashes, and a setup wizard that auto-detects your framework. All from a keyboard-driven TUI.

vigil demo

The Problem

Cloud GPU consoles (Vast.ai, RunPod, Lambda, etc.) provide delayed, truncated output and offer no log persistence when instances crash or are preempted. If you are running multiple GPU training jobs across rented instances, you need a tool that streams logs in real time, survives disconnections, and alerts you before a stalled run burns through your budget.

Supported Providers

Vast.ai — full support (auto-discovery, instance management, cost tracking)
RunPod — full support (auto-discovery via GraphQL, instance management, cost tracking)

Features

Live Monitoring

Auto-discovers all running instances via provider APIs
Streams logs in real time over SSH with automatic reconnection and exponential backoff
Dynamic grid layout that adapts columns to instance count (1/2/3 columns, configurable max)
Per-instance panels showing GPU type, count, label, and hourly cost
Aggregate cost display across all instances in the subtitle bar
Focus mode: select a panel to expand it to full width
Pause/resume log rendering and toggle auto-scroll per panel

Metric Parsing

Configurable regex patterns with named capture groups
10 built-in patterns: loss, step, epoch, learning rate, reward, accuracy, RSSM, VC, TOM, actor
Inline sparkline charts (Unicode block elements) showing metric trends over time
Color-coded direction indicators (green/red based on whether up or down is desirable)
NaN/Inf detection with immediate alerting
Loss plateau detection using rolling standard deviation
Keyword pre-filtering for efficient pattern matching at scale

Framework Presets

HuggingFace Transformers: eval_loss, train_loss, learning_rate, eval_accuracy, grad_norm, samples_sec
PyTorch Lightning: val_loss, train_loss, val_acc, train_acc
DreamerV3 / RL: ep_return, episode, fps, policy_loss, value_loss, model_loss
Auto-detection from running instance logs (samples 5 seconds of output to identify framework)
Apply via setup wizard or configure manually

Alerts

Stall detection when no output is received for a configurable duration
NaN/Inf alerts with visual panel highlighting (red border)
Loss plateau alerts using sliding-window standard deviation (configurable window and threshold)
Desktop notifications on macOS (osascript) and Linux (notify-send)
Webhook integration with Slack (Block Kit), Discord (embeds), and raw JSON formats
Per-alert-type notification control (NaN, plateau, stall independently togglable)
Alerts auto-clear when values return to normal
All notifications are best-effort and non-blocking (never disrupt the TUI)

Instance Management

Instance manager view with GPU details, costs, connection status, and SSH commands
Live nvidia-smi output over SSH (with manual refresh)
Change the log command per instance at runtime (persisted to config)
Stop/destroy instances directly from the TUI (with confirmation dialog showing instance details)
Force SSH reconnect with manual trigger
Configurable SSH username (global and per-instance)
SSH keepalive and connection timeout settings

Log Persistence

All logs written to ~/.vigil/logs/ with per-instance directories
Timestamped raw log files with atomic latest.log symlink
Metrics JSONL sidecar files for post-hoc analysis
Historical log browser with session list and search
Crash-resilient: logs are buffered and flushed periodically (every 10s) and on disconnect
Configurable log retention by age (days) and total size (MB)
Automatic hourly cleanup of expired log sessions

Search

Inline log filtering with / (regex-supported, debounced)
Global cross-instance search with G (searches all in-memory log buffers, up to 100 results per instance)
Historical log search in the log viewer
Graceful regex fallback to substring matching on invalid patterns

Metrics Comparison

Full-screen metrics table comparing all instances side by side
Dynamic columns based on collected metrics across all instances
Sparklines and direction indicators per metric per instance
Auto-refreshes every 2 seconds with manual refresh option

Setup Wizard

Guided 5-step onboarding wizard for first-time setup
Step 1: API key input with live validation (tests connectivity and shows instance count)
Step 2: SSH key selection with auto-detection of common key paths
Step 3: Framework preset selection with auto-detection from running instances
Step 4: Alert and notification configuration (desktop, stall threshold, webhooks)
Step 5: Summary and confirmation before launch
Progress bar and back/skip/next navigation
Accessible anytime with S (Shift+S)

Contextual Hints

Non-intrusive hint bar with sequenced tips for new users
Covers: focusing panels, searching logs, viewing help, comparing metrics
Auto-dismisses after 30 seconds, respects minimum instance requirements
Tracks completion in persistent state; reset with --reset-hints

Quick Start

pip install vigil-gpu
vigil

If this is your first run, the setup wizard will guide you through configuration. Otherwise, provide your API key via any of these methods (checked in order):

CLI flag: vigil --api-key YOUR_KEY
Environment variable: VIGIL_API_KEY=YOUR_KEY vigil (also accepts VAST_API_KEY / RUNPOD_API_KEY)
Config file: Set api_key in ~/.config/vigil/config.yaml
Vast CLI default: Place your key in ~/.vast_api_key

For RunPod: set RUNPOD_API_KEY or pass vigil --provider runpod --api-key YOUR_KEY

CLI Options

vigil [OPTIONS]

  --config, -c PATH    Config file path (default: ~/.config/vigil/config.yaml)
  --api-key KEY        Provider API key (overrides config/env)
  --provider NAME      GPU provider (default: vast) [vast, runpod]
  --demo               Run with simulated instances (no API key or SSH needed)
  --reset-hints        Reset onboarding hints so they appear again

Configuration

Configuration lives at ~/.config/vigil/config.yaml. All fields are optional with sensible defaults. See config.example.yaml for the full reference.

Key Settings

# GPU cloud provider
provider: vast

# Provider API key (prefer env var for security)
api_key: null

# Seconds between API polls for instance changes
poll_interval: 30

# Local directory for persistent logs
log_dir: "~/.vigil/logs"

# SSH private key path
ssh_key_path: "~/.ssh/id_ed25519"

# Minutes of silence before stall alert fires
stall_threshold_minutes: 5

# Desktop notifications for NaN/plateau/stall alerts
desktop_notifications: true

# Webhook URL for alerts (receives JSON POST)
alert_webhook_url: null

SSH Settings

ssh_username: "root"
ssh_login_timeout: 10        # seconds
ssh_keepalive_interval: 15   # seconds (0 to disable)
reconnect_backoff_max: 60    # max seconds between reconnect attempts

Log Retention

log_retention_days: 0   # delete logs older than N days (0 = keep forever)
log_max_size_mb: 0      # max total log storage in MB (0 = unlimited)

Layout & Buffers

max_grid_columns: 3       # max columns in the instance grid
log_buffer_lines: 5000    # in-memory log buffer per instance
log_display_lines: 2000   # max lines rendered in the panel
highlight_logs: false     # Rich syntax highlighting on log lines
sparkline_history: 60     # metric readings kept for sparkline rendering

Custom Metric Patterns

Metrics are extracted from log lines using regex patterns with named capture groups. The group name becomes the metric key displayed in the panel.

metric_patterns:
  - 'loss[:\s=]+(?P<loss>[\d.]+)'
  - 'step[:\s=]+(?P<step>\d+(?:,\d{3})*)'
  - 'epoch[:\s=]+(?P<epoch>[\d./]+)'
  - '(?<![a-z])lr[:\s=]+(?P<lr>[\d.e\-]+)'
  - 'reward[:\s=]+(?P<reward>[\d.\-e]+)'
  - 'accuracy[:\s=]+(?P<accuracy>[\d.]+)'

These are a subset of the 10 built-in defaults. Setting metric_patterns in your config replaces all defaults. To add patterns without replacing, use extra_metric_patterns:

extra_metric_patterns:
  - 'val_loss[:\s=]+(?P<val_loss>[\d.]+)'
  - 'perplexity[:\s=]+(?P<ppl>[\d.]+)'

Metric Semantics

Control which metrics get green/red coloring and which are treated as counters:

decrease_good: [loss, val_loss, rssm, vc, tom, ppl]
increase_good: [reward, accuracy, val_accuracy]
counters: [step, epoch]
plateau_metrics: [loss, val_loss]

Plateau Detection

plateau_window: 8          # number of readings to check
plateau_threshold: 0.0001  # stdev below this = plateau
plateau_metrics: [loss]    # which metrics to watch

Webhook Alerts

Supports raw JSON, Slack Block Kit, and Discord embed formats:

alert_webhook_url: "https://hooks.slack.com/services/..."
alert_webhook_format: "slack"  # "raw", "slack", or "discord"

notifications:
  nan: true        # NaN/Inf detected in metrics
  plateau: true    # Loss plateau detected
  stall: true      # No output for configured minutes
  desktop: true    # Master switch for OS desktop notifications
  webhook: true    # Master switch for webhook alerts

Per-Instance Overrides

instances:
  12345:
    log_command: "tail -f /workspace/training.log"
    stall_threshold_minutes: 10
    ssh_username: "trainer"
  67890:
    log_command: "docker logs -f training_container"

Keyboard Shortcuts

Navigation

Key	Action
`1`-`9`	Focus/unfocus instance panel by number
`Tab`	Cycle to next panel
`Escape`	Back / unfocus panel / close search
`?`	Toggle help overlay
`q`	Quit

Views

Key	Action
`i`	Instance manager (details, SSH commands)
`l`	Historical log viewer (with search)
`m`	Metrics comparison table (auto-refreshes)
`G`	Global search across all instances
`n`	nvidia-smi output for focused instance
`S`	Open setup wizard

Instance Actions (focus a panel first)

Key	Action
`c`	Change log command for focused instance
`r`	Force SSH reconnect
`D`	Stop/destroy instance (requires confirmation)

Log Control (focus a panel first)

Key	Action
`/`	Search / filter logs (regex supported)
`p`	Pause / resume log rendering
`f`	Toggle auto-scroll follow mode

How It Works

┌─────────────────────────────────────────────────────┐
│              Provider API (Vast.ai, ...)            │
│            (instance discovery loop)                │
└──────────────────────┬──────────────────────────────┘
                       │ poll every N seconds
                       ▼
┌─────────────────────────────────────────────────────┐
│              Discovery / Reconciler                 │
│     adds new instances, removes terminated ones     │
└──────────────────────┬──────────────────────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
   ┌─────────┐  ┌─────────┐  ┌─────────┐
   │ SSH Log │  │ SSH Log │  │ SSH Log │  ... per instance
   │ Stream  │  │ Stream  │  │ Stream  │
   └────┬────┘  └────┬────┘  └────┬────┘
        │             │            │
        ▼             ▼            ▼
   ┌──────────────────────────────────────────────────┐
   │              Metric Parser                       │
   │   regex extraction → metric state + history      │
   └──────────────────────┬───────────────────────────┘
        │                 │                │
        ▼                 ▼                ▼
   ┌──────────┐    ┌──────────┐    ┌──────────────┐
   │ Log      │    │ Alert    │    │ TUI Panels   │
   │ Storage  │    │ Engine   │    │ (Textual)    │
   │ (disk)   │    │          │    │              │
   └──────────┘    └────┬─────┘    └──────────────┘
                        │
              ┌─────────┼──────────┐
              ▼         ▼          ▼
         ┌────────┐ ┌───────┐ ┌────────┐
         │Desktop │ │Webhook│ │ Visual │
         │Notif.  │ │(Slack/│ │ Alert  │
         │        │ │Discord│ │(border)│
         └────────┘ └───────┘ └────────┘

Vigil runs a discovery loop that polls your GPU provider's API at a configurable interval to find running instances. For each discovered instance, it opens a persistent SSH connection and executes a log command (a smart heuristic that finds training process stdout, recent log files, or provider-specific logs). Each line is fed through a regex-based metric parser, written to local storage (raw log + metrics JSONL sidecar), and rendered in a Textual TUI panel with sparklines and color-coded metric indicators. A background alert engine monitors each stream for stalls, NaN values, and plateaus. When instances are added or removed, the grid layout and SSH connections are reconciled automatically.

All I/O is async. SSH connections use keepalive heartbeats and reconnect with exponential backoff. Log files are buffered and flushed periodically. Notifications and webhooks are fire-and-forget to never block the UI.

Requirements

Python 3.10+
SSH key registered with your GPU provider (default: ~/.ssh/id_ed25519)
Dependencies: textual, asyncssh, httpx, pyyaml

Development

git clone https://github.com/vigil-gpu/vigil.git
cd vigil
python -m venv .venv && source .venv/bin/activate
pip install -e .
vigil --config config.example.yaml

Contributing

Contributions are welcome. Please see CONTRIBUTING.md for guidelines.

License

MIT. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

helpmatteo

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 1, 2026

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vigil_gpu-0.2.0.tar.gz (2.5 MB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vigil_gpu-0.2.0-py3-none-any.whl (59.5 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file vigil_gpu-0.2.0.tar.gz.

File metadata

Download URL: vigil_gpu-0.2.0.tar.gz
Upload date: Apr 1, 2026
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vigil_gpu-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`33a04d0bf49fbf855a75036c90604fc0693d9b270cbe65d0fb3d661eb3331ded`
MD5	`d33fe4c0e52c92103c648009cece3fcb`
BLAKE2b-256	`54b4ffad5b18bd8795e5659c2833ab3ae83f764fd3a69d91943f83731a47ad32`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vigil_gpu-0.2.0.tar.gz:

Publisher: publish.yml on helpmatteo/vigil

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vigil_gpu-0.2.0.tar.gz
- Subject digest: 33a04d0bf49fbf855a75036c90604fc0693d9b270cbe65d0fb3d661eb3331ded
- Sigstore transparency entry: 1207240571
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: helpmatteo/vigil@f447ff06d60c28bfd7982795579381669ad2ffab
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/helpmatteo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f447ff06d60c28bfd7982795579381669ad2ffab
- Trigger Event: push

File details

Details for the file vigil_gpu-0.2.0-py3-none-any.whl.

File metadata

Download URL: vigil_gpu-0.2.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 59.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vigil_gpu-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`202a177e23639a1b505ee7385452f7f6b1b600f8b7f895715b944558673a7ada`
MD5	`a7f0547df218d79f34fbb20f0e3ccf23`
BLAKE2b-256	`a6341f2fe72d545883c2fa746bc560a3f8145b3f187840789212bb32064c80d2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vigil_gpu-0.2.0-py3-none-any.whl:

Publisher: publish.yml on helpmatteo/vigil

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vigil_gpu-0.2.0-py3-none-any.whl
- Subject digest: 202a177e23639a1b505ee7385452f7f6b1b600f8b7f895715b944558673a7ada
- Sigstore transparency entry: 1207240671
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: helpmatteo/vigil@f447ff06d60c28bfd7982795579381669ad2ffab
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/helpmatteo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f447ff06d60c28bfd7982795579381669ad2ffab
- Trigger Event: push

vigil-gpu 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

vigil

The Problem

Supported Providers

Features

Live Monitoring

Metric Parsing

Framework Presets

Alerts

Instance Management

Log Persistence

Search

Metrics Comparison

Setup Wizard

Contextual Hints

Quick Start

CLI Options

Configuration

Key Settings

SSH Settings

Log Retention

Layout & Buffers

Custom Metric Patterns

Metric Semantics

Plateau Detection

Webhook Alerts

Per-Instance Overrides

Keyboard Shortcuts

Navigation

Views

Instance Actions (focus a panel first)

Log Control (focus a panel first)

How It Works

Requirements

Development

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance