Skip to main content

Always-on Windows hardware watchdog for GPU rigs: kills runaway workloads (AI training, renders, sims) before thermal or memory runaway damages your machine. Monitors NVIDIA GPU/VRAM temps, RAM, pagefile, CPU temp and PSU rails.

Project description

AT-Field — Windows GPU watchdog that kills runaway workloads before they cook your rig

CI Release PyPI License: MIT Platform: Windows 10/11 Python 3.10+

Always-on Windows watchdog for your GPU rig. It monitors NVIDIA GPU + VRAM temperatures, GPU/VRAM/RAM usage, pagefile/commit charge, CPU package temp, and PSU rail voltages — and when a workload runs away, it kills the offending process tree before thermal or memory runaway crashes (or damages) your machine. Runs as a Windows service. Sustained-window thresholds ignore harmless load spikes.

Built for AI rigs — useful for any heavy GPU/CPU workload. Overnight model training is the headline case (and where the process-tree kill shines), but the same protection covers 3D/video renders, scientific/CUDA compute, overclock stability testing, or simply keeping an expensive GPU alive and your machine responsive. The watchdog runs headless; an optional tray dashboard gives you live sparklines and a pause toggle.

In Neon Genesis Evangelion, an AT Field is an absolute defensive barrier that prevents catastrophic damage to a high-power system. Here it's a backronym — Absolute Thermal-and-memory Field — a process-aware Windows service that intercepts the jobs trying to melt your rig.

Status: v0.4.2 (pre-release). pip install atfield for the headless watchdog, or grab the one-click installer for the watchdog + tray dashboard in a single .exe. See CHANGELOG.md and docs/sensors.md for the full layered sensor strategy.

Two pieces, one tool

┌────────────────────────────────┐          ┌─────────────────────────────────┐
│  AT-Field Watchdog Service     │          │  AT-Field Tray + Dashboard      │
│  Python, NSSM, LocalSystem     │ ◄──HTTP──│  Tauri (Rust + React),          │
│  Always-on, no UI              │  loopback │  user-mode, optional            │
│  src/atfield/                  │   :8765   │  at-field-tray/                 │
└────────────────────────────────┘          └─────────────────────────────────┘

The service (this repo, src/atfield/) is the engine — it does all the sensing, deciding, and killing. It runs as a Windows Service so it stays alive without a logged-in user.

The tray app (at-field-tray/) is a thin lens over the service. It puts a status dot in your system tray (green / yellow / red / gray), pops open a dashboard with live signal sparklines and recent events, and gives you a one-click pause toggle. It's optional — the service works fine headless.

Project goals

These are the north stars every design decision is measured against:

  1. One-command install. pip install atfield && atf install should be the entire setup. No manual NSSM download, no manual sensor-daemon install, no editing service definitions in services.msc.
  2. Works on most setups out of the box. Any Windows 10/11 box with stock Python 3.10+ and an NVIDIA GPU should get useful protection immediately. Single-GPU, multi-GPU, AMD-CPU, Intel-CPU — capability is detected at startup; rules whose sensors aren't available are auto-disabled with a clear log message rather than failing the service.
  3. Zero config for the common case. The shipped defaults protect a typical AI rig without the user opening config.toml. Configuration exists for power users, not as a prerequisite. Adding new hardware support must not require existing users to migrate their config.

Why?

If any of these are your search query, you're in the right place:

  • "earlyoom for Windows" / "Windows OOM killer for runaway processes"
  • "automatic process killer when GPU VRAM temperature too high"
  • "NVIDIA RTX VRAM thermal protection sustained threshold"
  • "kill python process when RAM exceeds threshold Windows service"
  • "PyTorch / accelerate / torchrun killed my machine again"
  • "how do I protect dual-GPU AI rig from runaway training jobs"
  • "shut down Blender / render when GPU overheats Windows"
  • "GPU thermal protection auto-kill for rendering / mining / overclock"
  • "warn me before a long job crashes my workstation overnight"

Existing tools either monitor without acting (System-Resource-Monitor), throttle a single VRAM signal (VRAM-Guard, VRAM Shield), or are manual primitives (killall). AT-Field combines multi-signal sustained triggers, process-tree-aware Python targeting, and true Windows Service deployment.

What it does

  • Watches every signal that matters: per-GPU core + memory junction temp, VRAM used %, system RAM %, pagefile / commit charge, CPU package temp. v0.3 adds PSU rail voltages (+12V / +5V / +3.3V / VCore) when LibreHardwareMonitor enumerates them — useful for catching voltage sag patterns that correlate with NVIDIA TDR / Kernel-Power 41 events on high-transient cards.
  • Sustained-window logic, not instantaneous: a 2-second spike during model warmup never triggers. An actual sustained problem triggers within ~15 seconds.
  • Process-tree-aware kill: when a runaway job is detected, walks the parent chain past known AI launchers (torchrun, accelerate, deepspeed, mpiexec, ray, jupyter) to find the dispatcher, then terminates the whole tree. Self-healing workers can't respawn.
  • Rolling forensic buffer (v0.3): every per-tick sample is flushed to forensics.jsonl every 5 seconds and rotated on service start, so the next BSOD / power loss / hard reboot doesn't take pre-crash signal history with it. atf forensics --include-prev --since 10m reads back the last few minutes regardless of whether the watchdog survived. Append-only JSONL — the only format guaranteed partially readable after a power loss.
  • Layered sensor coverage (v0.3): NVML for NVIDIA, ROCm-SMI for AMD, psutil for system signals, and bundled LibreHardwareMonitor for VRAM-junction temp + CPU temp + voltages. LHM is read headlessly via its library — a tiny bundled atfield-sensors.exe loads LibreHardwareMonitorLib.dll and streams readings to the service — so there's no web server, port, URL ACL, or GUI to break. Full confidence matrix per signal × hardware × OS combo in docs/sensors.md.
  • Runs as a Windows Service (via NSSM) under LocalSystem — works without an interactive login, survives reboots, no tray icon required.
  • Capability-negotiated: at startup, every collector probes its source. Anything missing (no LHM? AMD GPU? CPU sensor doesn't expose package temp?) downgrades to "rule disabled, with a clear log line" — never to "watchdog crashed" or "rule silently never fires".
  • Audit trail: every action lands in %ProgramData%\ATField\events.jsonl with full process tree, signal values that triggered the rule, and rule name.
  • Safe by default: malformed config → observe-only mode (kills demoted to log entries), never kills the service's own PID, never_kill_names filter for explorer.exe, services.exe, etc.
  • Cheap to leave running: measured ~0.1–0.3% of one CPU core, ~55–85 MB RAM (bounded — fixed-size history rings), and ~40–90 MB/day of auto-rotating forensic log (hard-capped 250 MB). GPU telemetry is read-only NVML — the same counters nvidia-smi reads, so it doesn't slow your training. Electricity works out to **$1–4/year**. Full data-backed breakdown with reproducible benchmarks in docs/footprint.md.

Install

Recommended — pip (no installer, no SmartScreen prompt)

If you have Python 3.10+ (most AI/dev boxes do), this is the fastest, friction-free path — no unsigned-installer SmartScreen warning, just a package install:

pip install atfield        # or: pipx install atfield
atf install                # run once from an elevated PowerShell — registers the service
atf status                 # confirm it's armed

atf install downloads NSSM into %ProgramData%\ATField\, registers the LocalSystem auto-start Windows service, builds the headless sensor helper, drops a starter config.toml, and starts the service. Existing config is preserved on reinstall. Want VRAM-junction / CPU-package temps too? Run atf install-lhm first to fetch the LibreHardwareMonitor DLLs.

This installs the headless watchdog + CLI — the engine that does all the work. For the live tray dashboard, use the one-click installer below.

Full experience — one-click installer (watchdog + tray dashboard)

  1. Download AT-Field_x64-setup.exe from the latest release.
  2. Double-click it. Because the binary isn't code-signed yet, Windows SmartScreen may say "Windows protected your PC" — click More info → Run anyway. (Signing is on the v1.0 roadmap; the pip route above avoids this entirely.)
  3. Accept the single UAC prompt. The installer needs admin once to register the watchdog as a Windows service; that one consent covers the whole setup.
  4. Done. In its post-install step the installer registers the AT-Field Watchdog service (auto-start, LocalSystem), bundles the sensor helper + LibreHardwareMonitor DLLs, and drops a starter config.toml. A tray icon appears; click it for the dashboard.

No Python, no PATH, no separate elevated PowerShell. Step-by-step install, verification, and troubleshooting for a fresh machine live in docs/install.md.

LibreHardwareMonitor — bundled, read headlessly (v0.3+)

The board-level sensors (VRAM temp, CPU package temp, PSU rail voltages) come from LibreHardwareMonitor, but AT-Field reads it through its library, not its GUI web server. The bundled installer ships LibreHardwareMonitorLib.dll plus a tiny atfield-sensors.exe helper that loads the DLL and streams readings to the service over stdout — there's no web server, port, URL ACL, or GUI to misconfigure or break. (This replaced the fragile HTTP transport that a v0.9.4 → v0.9.6 LHM bump silently broke in v0.2.)

If you installed via pip install atfield instead of the bundled .exe, run atf install-lhm to fetch the LHM DLLs into %ProgramData%\ATField\; atf install then builds atfield-sensors.exe next to them using the in-box .NET Framework compiler (no extra toolchain needed). Set ATFIELD_SENSOR_EXE to override the helper location.

For all the gory details — confidence matrix per signal × hardware combo, and why we read LibreHardwareMonitor through its library instead of its web server — see docs/sensors.md.

Usage

atf status          # service health, working signal map, disabled rules
atf inputs          # one-shot probe + sample dump (use this to verify setup)
atf tail            # follow events.jsonl
atf pause 30m       # suspend kill actions for 30 minutes
atf unpause
atf test-kill <PID> # dry-run the kill walk-up against a real PID
atf run             # foreground run (for debugging without NSSM)
atf forensics       # read forensics.jsonl (rolling crash buffer); --include-prev pulls last run
atf install-lhm     # download LibreHardwareMonitor v0.9.6 into %ProgramData%\ATField\
atf uninstall

atf inputs is the verification tool — run it after installing to confirm which sensors are visible and which rules will be active:

$ atf inputs
system: OK
  reason: psutil + Win32 GlobalMemoryStatusEx OK
    system.commit_percent = 11.681 percent
    system.ram_used_percent = 18.600 percent

nvml: OK
  reason: NVML driver 596.36, 2 GPU(s): NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 5090
    gpu.0.core_temp_c = 38.000 celsius
    gpu.0.vram_used_percent = 8.891 percent
    gpu.1.core_temp_c = 35.000 celsius
    ...

lhmlib: OK
  reason: atfield-sensors.exe streaming 6 sensor(s) via LibreHardwareMonitorLib
    gpu.0.mem_junction_temp_c = 52.000 celsius
    system.cpu_package_temp_c = 41.000 celsius

(If lhmlib shows unavailable, the helper binary is missing — build it with scripts\build_helper.ps1 or set ATFIELD_SENSOR_EXE. The NVML and system collectors still protect you in the meantime.)

Configuration

Defaults are conservative and protect a typical AI rig with no edits. To customize, see scripts/config.example.toml. The installer drops this in %ProgramData%\ATField\config.toml if no config exists.

A rule has the form:

[[rules]]
name              = "gpu-core-hot"
signal            = "gpu.*.core_temp_c"   # glob expands per detected GPU
threshold         = 83
window_s          = 30
min_fraction_over = 0.67                  # 67% of last 30s over threshold
action            = "kill"                # log | throttle | kill

Rules referencing signals no probed collector provides are auto-disabled with a startup log line — adding new hardware support never breaks existing configs.

Status

Pre-release v0.4.2. End-to-end verified on the development rig (2× RTX 5090) and clean-installed on additional machines. CI runs the test suite on Windows + Linux × Python 3.10/3.11/3.12 plus a wheel install + CLI smoke test.

Primary development rig:

  • CPU: AMD Ryzen 9 9950X3D
  • GPU: 2× NVIDIA RTX 5090 (32 GB GDDR7)
  • RAM: 128 GB DDR5-5600
  • OS: Windows 11

Should work on any Windows 10/11 box with NVIDIA driver ≥ 535. AMD-GPU support is best-effort via LibreHardwareMonitor; a dedicated AMD/ROCm collector is a v0.2 candidate (PRs welcome — the Collector protocol in src/atfield/collectors/__init__.py is the public extension point).

Architecture

┌──────────────────── Windows Service (NSSM-wrapped) ──────────────────┐
│                                                                       │
│   ┌─────────────┐   ┌────────────────────────┐   ┌─────────────────┐ │
│   │ Collectors  │──►│ SignalStore + windows  │──►│ PolicyEngine    │ │
│   │ • system    │   │ - per-rule sliding     │   │ - glob expand   │ │
│   │ • nvml      │   │ - latest-sample        │   │ - rule eval     │ │
│   │ • lhm       │   │   liveness check       │   │ - cooldowns     │ │
│   │ • plugins   │   └────────────────────────┘   └────────┬────────┘ │
│   └─────────────┘                                          │          │
│                                                            ▼          │
│                                              ┌───────────────────┐    │
│                                              │ Actuator          │    │
│                                              │ - launcher walk-up│    │
│                                              │ - tree kill       │    │
│                                              │ - self-protection │    │
│                                              └─────────┬─────────┘    │
│                                                        ▼              │
│                                              %ProgramData%\ATField\   │
│                                              ├── config.toml          │
│                                              ├── watchdog.log         │
│                                              ├── events.jsonl         │
│                                              └── heartbeat.txt        │
└───────────────────────────────────────────────────────────────────────┘

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atfield-0.4.4.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atfield-0.4.4-py3-none-any.whl (130.6 kB view details)

Uploaded Python 3

File details

Details for the file atfield-0.4.4.tar.gz.

File metadata

  • Download URL: atfield-0.4.4.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for atfield-0.4.4.tar.gz
Algorithm Hash digest
SHA256 06f516c139f4fb1b1dbbaebe09e9ad4a4e314712b85247e64d62ea3798ccbdff
MD5 4e49f76710803c595c66c0a18f9bc655
BLAKE2b-256 a5de917f25c47069ef72c9bfd86521e9ab7d13315f4fecc4bcc2b7d27d192e84

See more details on using hashes here.

Provenance

The following attestation bundles were made for atfield-0.4.4.tar.gz:

Publisher: release.yml on alonsorobots/at-field

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file atfield-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: atfield-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 130.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for atfield-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5a05d93cf25bf339e70c72a8f3e2f614717fc1e598e968f057100fa498498621
MD5 81bcf026e16b4e4261956cbbc40c9044
BLAKE2b-256 b36741d6f52b98c4cd78d07ae7e7ca8695a0f066d48b6ab02f44b117ed9c4937

See more details on using hashes here.

Provenance

The following attestation bundles were made for atfield-0.4.4-py3-none-any.whl:

Publisher: release.yml on alonsorobots/at-field

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page