Always-on Windows hardware watchdog for GPU rigs: kills runaway workloads (AI training, renders, sims) before thermal or memory runaway damages your machine. Monitors NVIDIA GPU/VRAM temps, RAM, pagefile, CPU temp and PSU rails.
Project description
AT-Field — Windows GPU watchdog that kills runaway workloads before they cook your rig
Always-on Windows watchdog for your GPU rig. It monitors NVIDIA GPU + VRAM temperatures, GPU/VRAM/RAM usage, pagefile/commit charge, CPU package temp, and PSU rail voltages — and when a workload runs away, it kills the offending process tree before thermal or memory runaway crashes (or damages) your machine. Runs as a Windows service. Sustained-window thresholds ignore harmless load spikes.
Built for AI rigs — useful for any heavy GPU/CPU workload. Overnight model training is the headline case (and where the process-tree kill shines), but the same protection covers 3D/video renders, scientific/CUDA compute, overclock stability testing, or simply keeping an expensive GPU alive and your machine responsive. The watchdog runs headless; an optional tray dashboard gives you live sparklines and a pause toggle.
In Neon Genesis Evangelion, an AT Field is an absolute defensive barrier that prevents catastrophic damage to a high-power system. Here it's a backronym — Absolute Thermal-and-memory Field — a process-aware Windows service that intercepts the jobs trying to melt your rig.
Status: v0.4.2 (pre-release).
pip install atfieldfor the headless watchdog, or grab the one-click installer for the watchdog + tray dashboard in a single.exe. See CHANGELOG.md and docs/sensors.md for the full layered sensor strategy.
Two pieces, one tool
┌────────────────────────────────┐ ┌─────────────────────────────────┐
│ AT-Field Watchdog Service │ │ AT-Field Tray + Dashboard │
│ Python, NSSM, LocalSystem │ ◄──HTTP──│ Tauri (Rust + React), │
│ Always-on, no UI │ loopback │ user-mode, optional │
│ src/atfield/ │ :8765 │ at-field-tray/ │
└────────────────────────────────┘ └─────────────────────────────────┘
The service (this repo, src/atfield/) is the engine — it does all the
sensing, deciding, and killing. It runs as a Windows Service so it stays alive
without a logged-in user.
The tray app (at-field-tray/) is a thin lens over the service. It puts a status dot in your system tray (green / yellow / red / gray), pops open a dashboard with live signal sparklines and recent events, and gives you a one-click pause toggle. It's optional — the service works fine headless.
Project goals
These are the north stars every design decision is measured against:
- One-command install.
pip install atfield && atf installshould be the entire setup. No manual NSSM download, no manual sensor-daemon install, no editing service definitions inservices.msc. - Works on most setups out of the box. Any Windows 10/11 box with stock Python 3.10+ and an NVIDIA GPU should get useful protection immediately. Single-GPU, multi-GPU, AMD-CPU, Intel-CPU — capability is detected at startup; rules whose sensors aren't available are auto-disabled with a clear log message rather than failing the service.
- Zero config for the common case. The shipped defaults protect a typical AI rig without the user opening
config.toml. Configuration exists for power users, not as a prerequisite. Adding new hardware support must not require existing users to migrate their config.
Why?
If any of these are your search query, you're in the right place:
- "earlyoom for Windows" / "Windows OOM killer for runaway processes"
- "automatic process killer when GPU VRAM temperature too high"
- "NVIDIA RTX VRAM thermal protection sustained threshold"
- "kill python process when RAM exceeds threshold Windows service"
- "PyTorch / accelerate / torchrun killed my machine again"
- "how do I protect dual-GPU AI rig from runaway training jobs"
- "shut down Blender / render when GPU overheats Windows"
- "GPU thermal protection auto-kill for rendering / mining / overclock"
- "warn me before a long job crashes my workstation overnight"
Existing tools either monitor without acting (System-Resource-Monitor), throttle a single VRAM signal (VRAM-Guard, VRAM Shield), or are manual primitives (killall). AT-Field combines multi-signal sustained triggers, process-tree-aware Python targeting, and true Windows Service deployment.
What it does
- Watches every signal that matters: per-GPU core + memory junction temp, VRAM used %, system RAM %, pagefile / commit charge, CPU package temp. v0.3 adds PSU rail voltages (+12V / +5V / +3.3V / VCore) when LibreHardwareMonitor enumerates them — useful for catching voltage sag patterns that correlate with NVIDIA TDR / Kernel-Power 41 events on high-transient cards.
- Sustained-window logic, not instantaneous: a 2-second spike during model warmup never triggers. An actual sustained problem triggers within ~15 seconds.
- Process-tree-aware kill: when a runaway job is detected, walks the parent chain past known AI launchers (
torchrun,accelerate,deepspeed,mpiexec,ray,jupyter) to find the dispatcher, then terminates the whole tree. Self-healing workers can't respawn. - Rolling forensic buffer (v0.3): every per-tick sample is flushed to
forensics.jsonlevery 5 seconds and rotated on service start, so the next BSOD / power loss / hard reboot doesn't take pre-crash signal history with it.atf forensics --include-prev --since 10mreads back the last few minutes regardless of whether the watchdog survived. Append-only JSONL — the only format guaranteed partially readable after a power loss. - Layered sensor coverage (v0.3): NVML for NVIDIA, ROCm-SMI for AMD, psutil for system signals, and bundled LibreHardwareMonitor for VRAM-junction temp + CPU temp + voltages. LHM is read headlessly via its library — a tiny bundled
atfield-sensors.exeloadsLibreHardwareMonitorLib.dlland streams readings to the service — so there's no web server, port, URL ACL, or GUI to break. Full confidence matrix per signal × hardware × OS combo in docs/sensors.md. - Runs as a Windows Service (via NSSM) under
LocalSystem— works without an interactive login, survives reboots, no tray icon required. - Capability-negotiated: at startup, every collector probes its source. Anything missing (no LHM? AMD GPU? CPU sensor doesn't expose package temp?) downgrades to "rule disabled, with a clear log line" — never to "watchdog crashed" or "rule silently never fires".
- Audit trail: every action lands in
%ProgramData%\ATField\events.jsonlwith full process tree, signal values that triggered the rule, and rule name. - Safe by default: malformed config → observe-only mode (kills demoted to log entries), never kills the service's own PID,
never_kill_namesfilter forexplorer.exe,services.exe, etc. - Cheap to leave running: measured ~0.1–0.3% of one CPU core, ~55–85 MB RAM (bounded — fixed-size history rings), and ~40–90 MB/day of auto-rotating forensic log (hard-capped
250 MB). GPU telemetry is read-only NVML — the same counters$1–4/year**. Full data-backed breakdown with reproducible benchmarks in docs/footprint.md.nvidia-smireads, so it doesn't slow your training. Electricity works out to **
Install
Recommended — pip (no installer, no SmartScreen prompt)
If you have Python 3.10+ (most AI/dev boxes do), this is the fastest, friction-free path — no unsigned-installer SmartScreen warning, just a package install:
pip install atfield # or: pipx install atfield
atf install # run once from an elevated PowerShell — registers the service
atf status # confirm it's armed
atf install downloads NSSM into %ProgramData%\ATField\, registers the
LocalSystem auto-start Windows service, builds the headless sensor helper, drops
a starter config.toml, and starts the service. Existing config is preserved on
reinstall. Want VRAM-junction / CPU-package temps too? Run atf install-lhm first
to fetch the LibreHardwareMonitor DLLs.
This installs the headless watchdog + CLI — the engine that does all the work. For the live tray dashboard, use the one-click installer below.
Full experience — one-click installer (watchdog + tray dashboard)
- Download
AT-Field_x64-setup.exefrom the latest release. - Double-click it. Because the binary isn't code-signed yet, Windows SmartScreen may say "Windows protected your PC" — click More info → Run anyway. (Signing is on the v1.0 roadmap; the pip route above avoids this entirely.)
- Accept the single UAC prompt. The installer needs admin once to register the watchdog as a Windows service; that one consent covers the whole setup.
- Done. In its post-install step the installer registers the
AT-Field Watchdogservice (auto-start,LocalSystem), bundles the sensor helper + LibreHardwareMonitor DLLs, and drops a starterconfig.toml. A tray icon appears; click it for the dashboard.
No Python, no PATH, no separate elevated PowerShell. Step-by-step install, verification, and troubleshooting for a fresh machine live in docs/install.md.
LibreHardwareMonitor — bundled, read headlessly (v0.3+)
The board-level sensors (VRAM temp, CPU package temp, PSU rail voltages) come from LibreHardwareMonitor, but AT-Field reads it through its library, not its GUI web server. The bundled installer ships LibreHardwareMonitorLib.dll plus a tiny atfield-sensors.exe helper that loads the DLL and streams readings to the service over stdout — there's no web server, port, URL ACL, or GUI to misconfigure or break. (This replaced the fragile HTTP transport that a v0.9.4 → v0.9.6 LHM bump silently broke in v0.2.)
If you installed via pip install atfield instead of the bundled .exe, run atf install-lhm to fetch the LHM DLLs into %ProgramData%\ATField\; atf install then builds atfield-sensors.exe next to them using the in-box .NET Framework compiler (no extra toolchain needed). Set ATFIELD_SENSOR_EXE to override the helper location.
For all the gory details — confidence matrix per signal × hardware combo, and why we read LibreHardwareMonitor through its library instead of its web server — see docs/sensors.md.
Usage
atf status # service health, working signal map, disabled rules
atf inputs # one-shot probe + sample dump (use this to verify setup)
atf tail # follow events.jsonl
atf pause 30m # suspend kill actions for 30 minutes
atf unpause
atf test-kill <PID> # dry-run the kill walk-up against a real PID
atf run # foreground run (for debugging without NSSM)
atf forensics # read forensics.jsonl (rolling crash buffer); --include-prev pulls last run
atf install-lhm # download LibreHardwareMonitor v0.9.6 into %ProgramData%\ATField\
atf uninstall
atf inputs is the verification tool — run it after installing to confirm which sensors are visible and which rules will be active:
$ atf inputs
system: OK
reason: psutil + Win32 GlobalMemoryStatusEx OK
system.commit_percent = 11.681 percent
system.ram_used_percent = 18.600 percent
nvml: OK
reason: NVML driver 596.36, 2 GPU(s): NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 5090
gpu.0.core_temp_c = 38.000 celsius
gpu.0.vram_used_percent = 8.891 percent
gpu.1.core_temp_c = 35.000 celsius
...
lhmlib: OK
reason: atfield-sensors.exe streaming 6 sensor(s) via LibreHardwareMonitorLib
gpu.0.mem_junction_temp_c = 52.000 celsius
system.cpu_package_temp_c = 41.000 celsius
(If lhmlib shows unavailable, the helper binary is missing — build it with
scripts\build_helper.ps1 or set ATFIELD_SENSOR_EXE. The NVML and system
collectors still protect you in the meantime.)
Configuration
Defaults are conservative and protect a typical AI rig with no edits. To customize, see scripts/config.example.toml. The installer drops this in %ProgramData%\ATField\config.toml if no config exists.
A rule has the form:
[[rules]]
name = "gpu-core-hot"
signal = "gpu.*.core_temp_c" # glob expands per detected GPU
threshold = 83
window_s = 30
min_fraction_over = 0.67 # 67% of last 30s over threshold
action = "kill" # log | throttle | kill
Rules referencing signals no probed collector provides are auto-disabled with a startup log line — adding new hardware support never breaks existing configs.
Status
Pre-release v0.4.2. End-to-end verified on the development rig (2× RTX 5090) and clean-installed on additional machines. CI runs the test suite on Windows + Linux × Python 3.10/3.11/3.12 plus a wheel install + CLI smoke test.
Primary development rig:
- CPU: AMD Ryzen 9 9950X3D
- GPU: 2× NVIDIA RTX 5090 (32 GB GDDR7)
- RAM: 128 GB DDR5-5600
- OS: Windows 11
Should work on any Windows 10/11 box with NVIDIA driver ≥ 535. AMD-GPU support is best-effort via LibreHardwareMonitor; a dedicated AMD/ROCm collector is a v0.2 candidate (PRs welcome — the Collector protocol in src/atfield/collectors/__init__.py is the public extension point).
Architecture
┌──────────────────── Windows Service (NSSM-wrapped) ──────────────────┐
│ │
│ ┌─────────────┐ ┌────────────────────────┐ ┌─────────────────┐ │
│ │ Collectors │──►│ SignalStore + windows │──►│ PolicyEngine │ │
│ │ • system │ │ - per-rule sliding │ │ - glob expand │ │
│ │ • nvml │ │ - latest-sample │ │ - rule eval │ │
│ │ • lhm │ │ liveness check │ │ - cooldowns │ │
│ │ • plugins │ └────────────────────────┘ └────────┬────────┘ │
│ └─────────────┘ │ │
│ ▼ │
│ ┌───────────────────┐ │
│ │ Actuator │ │
│ │ - launcher walk-up│ │
│ │ - tree kill │ │
│ │ - self-protection │ │
│ └─────────┬─────────┘ │
│ ▼ │
│ %ProgramData%\ATField\ │
│ ├── config.toml │
│ ├── watchdog.log │
│ ├── events.jsonl │
│ └── heartbeat.txt │
└───────────────────────────────────────────────────────────────────────┘
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atfield-0.4.3.tar.gz.
File metadata
- Download URL: atfield-0.4.3.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ba7933025aca349ae02a3a558af3611be72f7c4e7679867b387f7279f3e64eb
|
|
| MD5 |
205b8867156c5ed40ce1fe4db4c71605
|
|
| BLAKE2b-256 |
7e3f24dc551dc87f4ca06af576fa266b4704f60135fc91d0371ea00e7a15085f
|
Provenance
The following attestation bundles were made for atfield-0.4.3.tar.gz:
Publisher:
release.yml on alonsorobots/at-field
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
atfield-0.4.3.tar.gz -
Subject digest:
6ba7933025aca349ae02a3a558af3611be72f7c4e7679867b387f7279f3e64eb - Sigstore transparency entry: 1981084543
- Sigstore integration time:
-
Permalink:
alonsorobots/at-field@29e03800fe990e5416ab57a50cb5105e1d425e7c -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/alonsorobots
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@29e03800fe990e5416ab57a50cb5105e1d425e7c -
Trigger Event:
push
-
Statement type:
File details
Details for the file atfield-0.4.3-py3-none-any.whl.
File metadata
- Download URL: atfield-0.4.3-py3-none-any.whl
- Upload date:
- Size: 130.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e45ac1a5882585bfa0be6ca737a2f4e48abd318a7d673967734da19ec435548
|
|
| MD5 |
bf0de63512c29e2705dc3d44b5291324
|
|
| BLAKE2b-256 |
26e090a2487b2f4ff216a21c2aa1c4efcdced1da4f74197315246e939130a912
|
Provenance
The following attestation bundles were made for atfield-0.4.3-py3-none-any.whl:
Publisher:
release.yml on alonsorobots/at-field
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
atfield-0.4.3-py3-none-any.whl -
Subject digest:
4e45ac1a5882585bfa0be6ca737a2f4e48abd318a7d673967734da19ec435548 - Sigstore transparency entry: 1981084646
- Sigstore integration time:
-
Permalink:
alonsorobots/at-field@29e03800fe990e5416ab57a50cb5105e1d425e7c -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/alonsorobots
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@29e03800fe990e5416ab57a50cb5105e1d425e7c -
Trigger Event:
push
-
Statement type: