Process governance for AI workloads on a single machine
Project description
Fleet Watch
Process governance for AI workloads on a single machine.
The Problem
You're running MLX, Ollama, vLLM, Candle/Cake, experiment runners, and AI coding agents on the same machine. They don't know about each other. Port 8899 gets stolen by a canary model. A 7B model quietly allocates 11 GB of Metal buffers on an 8 GB machine, swapping to SSD and running 65x slower than expected. Two Codex sessions write to the same repo. Health endpoints say "ok" while GPU memory is exhausted.
Fleet Watch prevents these collisions by maintaining a shared registry of what's running, what resources are claimed, and what's available — including pre-flight working set estimation that catches memory overcommit before it becomes a six-hour debug session.
Install
pipx install fleet-watch
Or from source:
pipx install ~/path/to/fleet-watch/
How It Works
Fleet Watch auto-discovers running AI processes (MLX servers, Ollama, vLLM, Candle/Cake, etc.) by scanning lsof and ps. It registers them in a local SQLite database with their port claims, GPU memory estimates, and repo locks. Any tool — human or AI — can call fleet guard --json before taking resource actions.
You don't register anything manually. Run fleet discover or let the launchd agent do it every 60 seconds.
GPU memory guard estimates total working set (weights + KV cache + activations + framework overhead) and compares it against physical RAM. A Candle-based 7B model with Q4_K_M quantization needs ~10 GB of working set due to buffer pool overhead — Fleet Watch catches that on an 8 GB machine before you start the process.
Quick Start
# See what's running
fleet status
# Auto-discover and register all AI processes
fleet discover
# Pre-flight: will this model fit?
fleet guard --gpu 4096 --framework candle --model "qwen2.5-7B-Q4_K_M" --json
# Check port and repo availability
fleet guard --port 8899 --repo ~/projects/my-app --json
# System health: memory pressure, sessions, GPU memory watch
fleet health
# See the audit trail
fleet history
# Generate state report
fleet report
Example: 7B model on 8 GB Apple Silicon
$ fleet guard --gpu 4096 --framework candle --model "qwen2.5-7B-Q4_K_M"
DENY
GPU 4096MB: working set 7049MB exceeds physical RAM (8192MB) minus reserve (6144MB available)
Breakdown: weights 3337MB + kv_cache 1792MB + activations 64MB x 2.0x (candle)
Physical RAM available after reserve: 6144MB
Suggestion: Use q2_k quantization (~1668 MB weights, ~5380 MB working set)
GPU budget available: 113664MB (0MB allocated)
The same model on a 128 GB machine:
$ fleet guard --gpu 4096 --framework candle --model "qwen2.5-7B-Q4_K_M"
ALLOW
GPU 4096MB: available (113664MB free)
Working set: 7049MB (weights 3337 + kv 1792 + act 64) x 2.0x
Physical RAM available after reserve: 114688MB
GPU budget available: 113664MB (0MB allocated)
Always-On Mode (macOS)
fleet install-launchd
Fleet Watch will auto-discover processes and monitor GPU memory pressure every 60 seconds.
AI Session Integration
Add this to your AI tool's system prompt or config:
Before binding a port, starting a model server, or writing to a repo: run
fleet guard --jsonwith the relevant--port,--repo,--gpu,--framework, and--modelflags. If"allowed": false, do not proceed. Use~/.fleet-watch/state.jsononly as a fallback artifact when the CLI is unavailable.
JSON Contract
fleet guard --json
Top-level keys:
allowed— boolean allow/deny decisionrequest— what the caller asked to usechecks— per-resource decision objectsstate— current machine summary
request contains:
port— requested port ornullrepo_dir— absolute repo path ornullgpu_mb— requested GPU claim ornullframework— inference framework hint ornullmodel— model name/path hint ornull
checks.port contains:
allowed,reason,holder,suggested_ports
checks.repo contains:
allowed,reason,holder
checks.gpu contains:
allowed— booleanreason— human-readable or"working_set_exceeds_physical_ram"detail— expanded explanation when deniedrequested_mb— requested GPU claimavailable_mb— currently available budgetsuggested_max_mb— maximum claim that fitsworking_set— (present when framework/model provided) object with:weights_mb,kv_cache_mb,activations_mb— component breakdownoverhead_multiplier— framework-specific pool overhead (e.g. 2.0x for Candle)total_mb— estimated total working setframework,model_size,quantization— detected parametersphysical_ram_mb,available_after_reserve_mb— machine contextfits— boolean: does the working set fit in available RAM?grounded— boolean: were framework and model size detected from real input?source—"explicit","command","fallback_default", or"insufficient_input"suggestion— (whenfitsis false) actionable alternative
state contains:
process_count,occupied_ports,safe_ports,locked_reposgpu_budget—total_mb,reserve_mb,allocated_mb,available_mbexternal_resources
fleet health --json
memory— RAM snapshot:total_mb,pressure_pct,pageouts,swapins, etc.sessions— discovered CLI sessions with RSS, CPU, classificationidle— workload processes at near-zero CPUgpu_memory_monitor— runtime pressure data:pageout_rate,gpu_process_footprints,alerts
~/.fleet-watch/state.json
Top-level keys:
agent_interface,generated_utcprocesses,external_resources,process_countgpu_budget,ports_claimed,preferred_ports,safe_ports,repos_lockedsession_leases,process_classifications,stale_processes,recent_eventsconflicts_prevented_24h,system_memory,sessions,idle_processesgpu_memory_monitor— latest discovery-cycle snapshot of pageout rate, per-process footprints, and active alerts
Commands
Core
| Command | What It Does |
|---|---|
fleet status |
Show active processes, GPU budget, claimed ports |
fleet status --json |
Machine-readable process and budget state |
fleet guard --json |
Canonical pre-flight contract for agents |
fleet guard --gpu MB --framework FW --model MODEL |
Working set estimation + allow/deny |
fleet check --port N --repo PATH --gpu MB |
Honest availability probe (exit 0/1) |
fleet discover |
Scan and register running AI processes |
fleet report |
Write STATE_REPORT.md + state.json |
Observability
| Command | What It Does |
|---|---|
fleet health |
RAM pressure, sessions, idle processes, GPU memory watch |
fleet health --json |
Machine-readable health and GPU monitor snapshot |
fleet changelog |
Rolling state changelog |
fleet changelog --json |
Raw changelog entries |
fleet history |
Hash-chained event audit trail |
fleet stale |
List heartbeat-stale processes with evidence |
fleet reconcile |
Non-destructive ownership diagnosis |
fleet reconcile --json |
Machine-readable ownership diagnosis |
Session Lifecycle
| Command | What It Does |
|---|---|
fleet session start |
Open or refresh a session lease |
fleet session heartbeat |
Refresh session lease heartbeat |
fleet session ensure |
Idempotent session management with retry |
fleet session close |
Close a session lease |
Process Management
| Command | What It Does |
|---|---|
fleet register |
Manually register a process |
fleet heartbeat --pid N |
Refresh heartbeat for a registered process |
fleet release --pid N |
Release all claims for a PID |
fleet reap |
Dry-run: show orphan-confirmed processes |
fleet reap --confirm |
Kill and release orphan-confirmed processes |
fleet reap-sessions |
Kill detached hot sessions (dry-run by default) |
fleet runaway |
Detect runaway high-CPU processes |
fleet runaway --kill |
Kill flagged runaway processes |
fleet preempt |
Take a port from a lower-priority holder |
fleet clean |
Remove entries for dead PIDs |
fleet install-launchd |
Install/update a launchd agent |
fleet watch |
Continuous discovery loop (foreground) |
Thunder (Remote GPU)
| Command | What It Does |
|---|---|
fleet thunder sync |
Ingest Thunder instances into Fleet Watch |
fleet thunder claim |
Attach ownership to a Thunder instance |
fleet thunder heartbeat |
Refresh Thunder resource heartbeat |
fleet thunder release |
Remove a Thunder instance |
GPU Working Set Estimation
Fleet Watch estimates total GPU working set per framework:
| Framework | Overhead Multiplier | Why |
|---|---|---|
| Candle/Cake | 2.0x | Retains intermediate buffers until command buffer completion |
| vLLM | 1.4x | Paged attention overhead |
| MLX | 1.3x | Aggressive buffer reuse |
| Ollama/llama.cpp | 1.1x | Tight memory management |
Working set = weights + (KV cache + activations) x overhead multiplier
Override multipliers via ~/.fleet-watch/config.json:
{
"gpu_estimator": {
"framework_overhead": {
"candle": 2.5
}
}
}
Configuration
Fleet Watch writes a default config on first run at ~/.fleet-watch/config.json.
{
"gpu_total_mb": 131072,
"gpu_reserve_mb": 16384,
"preferred_ports": [8000, 8001, 8080, 8100, 8888, 8899, 11434],
"patterns": [
{
"name_template": "My Server",
"process_match": "my_server.*serve",
"workstream": "my-project",
"priority": 3,
"gpu_mb_default": 4096
}
],
"session_patterns": [
{"name": "Claude Code", "kind": "claude-code", "process_match": "/claude\\b.*--"},
{"name": "Codex", "kind": "codex", "process_match": "/codex\\b"}
],
"idle_patterns": ["reranker", "socat.*TCP-LISTEN", "mlx_lm.*server"],
"idle_cpu_threshold": 1.0,
"pressure_thresholds": {"elevated": 70, "critical": 85}
}
Event Audit Trail
Every registration, release, conflict, and cleanup is logged with a SHA-256 hash chain. Verify integrity:
from fleet_watch import events, registry
conn = registry.connect()
valid, count = events.verify_chain(conn)
print(f"Chain valid: {valid}, events: {count}")
Ownership Model
Fleet Watch uses session leases to track who owns what. Process classification requires three independent signals before marking a process as safe to reap:
- Heartbeat expired — not seen by discovery in >180 seconds
- Session lease missing or closed — no active owner
- Parent chain detached — parent PID is dead or PID 1
All three must be true for orphan_confirmed. Use fleet reconcile to inspect, fleet reap --confirm to act.
States: live > disconnected > stale_candidate > orphan_confirmed > exited
Design Principles
- Advisory, not mandatory. If Fleet Watch crashes, all processes continue normally.
- One honest verb per action.
guarddecides,checkprobes,discoverobserves. - Single machine. No distributed consensus. SQLite is sufficient.
- Observe first. Default to alerting, not killing.
Limitations
- Discovery is heuristic and pattern-based. Unknown workloads are invisible until registered.
- GPU working set estimation uses architecture tables and framework multipliers, not kernel-level Metal accounting.
- Auto-discovery uses macOS
lsof. On Linux, use manual registration viafleet register. - Fleet Watch is advisory for human use. For AI agent sessions, a PreToolUse hook can make it fail-closed.
- Single-machine by design. No distributed coordination.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fleet_watch-0.2.0.tar.gz.
File metadata
- Download URL: fleet_watch-0.2.0.tar.gz
- Upload date:
- Size: 82.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0ec79736dd221319bf1d36e8c95deef6a6e82cdd9720b69cc18c671c7d6f6a0
|
|
| MD5 |
b7d6e87cfc1319923f8cad4e5786f0a4
|
|
| BLAKE2b-256 |
ecb382d544101a4120a026fb9940988faa486861fd7380121734d5089eb7c569
|
File details
Details for the file fleet_watch-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fleet_watch-0.2.0-py3-none-any.whl
- Upload date:
- Size: 60.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc1a709ff456be1954ca49e1b463e2884d59cb8838be25cf30ec3d30515fa552
|
|
| MD5 |
cb11284814df16cc8c36dc91c51f8731
|
|
| BLAKE2b-256 |
d21e7ce431c2c051788c7339ce89e8cb29746022aab327a7945357e499e59cc7
|