Self-healing supervisor for long-running processes — watch a command, catch the failure, fix it, leave a paper trail.
Project description
autosentry
Self-healing supervisor for long-running processes.
Watch a command, catch the failure, fix it, leave a paper trail.
autosentry supervises a long-running command — an ML training run, a data
pipeline, a service that's expected to stay up — watches its log stream for
known failure modes and anomalies, applies deterministic recovery rules when
it knows the answer, and escalates to a Claude Code shell when it doesn't.
Every incident is written into .autosentry/incidents/ as a folder
containing the exploded source around the failure, a stack trace, snapshots
of the configs that were in effect, and the fix that was applied.
It was generalized from a domain-specific monitor (rad_monitor.py) that
auto-healed a multi-stage ML pipeline through weeks of repeated failures.
The shape — file-based state, file-based outbox, synchronous main loop —
is deliberately simple so an operator can cat, grep, and kill -9
their way out of any problem the supervisor can't.
Contents
- Why autosentry
- Install
- Quick start
- How it works
- Anatomy of an incident
- Configuration reference
- Detectors
- Recovery rules and Claude fallback
- Fix branches and outcome verification
- Notifications
- Launching from your AI editor
- Update
- Status & roadmap
- Comparison
- FAQ
- Contributing
- License
Why autosentry
The main feature is the agent. autosentry's job is to put a
capable coding agent (Claude Code by default) in the chair when your
long-running process breaks, with enough structured context for it to
fix the actual bug — not just restart the process and hope. YAML
rules exist as a cheap fast lane for the small set of known
transients where a kill -HUP or an env-var nudge will do; for
everything else, the agent takes over by default.
Long-running jobs fail in three flavors:
- Known transient failures — NCCL hiccups, connection resets, OOMs that would clear with a smaller batch. The rule healer handles these in one shot.
- Anomalies — training stalls, loss spikes, throughput drops. Some match a rule; most need diagnosis. The agent takes the ones rules can't cover.
- Novel failures — code bugs, config mistakes, library regressions.
The agent reads the exploded source, the snapshotted configs, and
the stack trace, then proposes a patch on an isolated
autosentry/fix-*branch. autosentry watches the fix for the verification window; if the same detector re-fires, the fix is reverted and the next attempt gets fresh context. Outcomes (kept/regressed) are tracked in the attempts ledger.
autosentry's default posture is escalate to the agent quickly. Two
unverified rule-driven restarts and Claude takes over. A rule-based
fix that regresses inside the verify window pivots the next attempt
straight to the agent regardless of count — rules already failed on
that detector, so cycling them again is wasted budget. Both thresholds
are configurable (healing.escalate_to_claude_after,
healing.escalate_on_rule_regression); the defaults are tuned for
"agent first, rules as accelerator."
Install
One-line (recommended)
curl -fsSL https://raw.githubusercontent.com/ulmentflam/autosentry/main/install.sh | sh
The installer detects uv, pipx, or pip (in that order) and uses the
best one available. Pin a specific version with AUTOSENTRY_VERSION=0.2.0.
Homebrew (macOS / Linux)
brew install ulmentflam/tap/autosentry
That one-liner taps ulmentflam/homebrew-tap and installs autosentry into its
own virtualenv. Already tapped? brew install autosentry. Upgrade with
brew upgrade autosentry. Each release auto-syncs the formula, so the tap
tracks the latest version.
From PyPI
uv add autosentry # uv
pipx install autosentry # pipx (isolated)
pip install autosentry # plain pip
From source
git clone https://github.com/ulmentflam/autosentry.git
cd autosentry
make install
macOS / iCloud Drive caveat
If your clone lives under ~/Library/Mobile Documents/, iCloud sets
UF_HIDDEN on _*.pth files inside any venv and Python's site.py
then skips them, breaking editable installs. The Makefile detects
this and points the venv at ~/.cache/autosentry-venv automatically.
Override with make install VENV=/path/to/venv.
Let an AI agent do it
If you're already in a Claude Code / Cursor / Codex / Aider / OpenCode /
Windsurf / Zed / Continue / Gemini session, paste the prompt block below
and let the agent install and configure autosentry for the repo you're
sitting in. It's a short, declarative brief — the agent runs the right
commands for your stack, asks before destructive actions, and leaves
the repo in a state where autosentry run works on the next try.
Agent install brief — copy/paste into your session
Install and set up autosentry in this repo.
Follow this order; stop and ask me before doing anything that would
overwrite an existing file or change tracked code.
1. Verify autosentry isn't already installed. If it isn't, install it
with the one-liner from the README:
curl -fsSL https://raw.githubusercontent.com/ulmentflam/autosentry/main/install.sh | sh
Then confirm with `autosentry --version`.
2. Run `autosentry init --non-interactive` to scaffold the .autosentry/
tree (the config lives at `.autosentry/autosentry.yaml`). (Use
`--upgrade --force` if a config already exists and looks pre-0.6.1, or
to migrate a legacy root-level `autosentry.yaml` into `.autosentry/`.)
3. Inspect this repo to figure out:
- the right `process.command` (the thing I want supervised — read
pyproject.toml / package.json / Cargo.toml / go.mod / the
Makefile / scripts/ to guess; ASK ME before settling on it)
- which files belong in `config_snapshots` (env files, run configs,
pipeline definitions)
- a starting set of detectors and rules tailored to my stack
(OOM / NCCL / connection-reset patterns for ML; HTTP 5xx /
connection-refused for web; stall regex matching whatever
progress format my process emits)
Edit `.autosentry/autosentry.yaml` in place.
4. Install the /autosentry slash command for me with
`autosentry skills install --tool all`. This drops AGENTS.md plus the
per-tool wrappers so future sessions get the playbook automatically.
5. Run `autosentry doctor`. If anything is red, fix it. If it's all
green or only warnings, summarize the warnings.
6. Tell me the exact command to start the monitor in the background
(the `nohup autosentry run …` one-liner), but DO NOT run it
yourself. I'll start it.
Be terse. One or two sentences per step. Point me at
`.autosentry/autosentry.yaml` and `.autosentry/program.md` for context —
don't re-narrate the docs.
Quick start
pip install autosentry # or: uv add autosentry / pipx install autosentry
cd my-project
autosentry init # interactive: detects your stack, asks for process.command
autosentry doctor # verifies the env is healthy
autosentry run # starts monitoring
autosentry init is interactive from a real terminal — it detects
whether your repo is python/node/go/rust, suggests a starter
process.command, offers to snapshot config files, and (if you say
yes) installs the /autosentry slash command into whichever AI editors
it can find. From scratch to a running monitor is roughly five
minutes; most of that is reading the YAML it wrote.
A healthy autosentry run opens with a starting line naming your
supervisor and command, hands control to the tick loop, and from then
on only emits log lines on detections, state changes, and verification
outcomes. Silent is healthy. Sanity-check from another shell with
autosentry status (live pid + restarts counter), autosentry watch for the rich TUI, or autosentry doctor if anything looks
off.
From inside an AI editor
If you're already in a Claude Code / Cursor / Codex / Aider / OpenCode / Windsurf / Zed / Continue / Gemini session, you have two options:
autosentry init --for-agent # writes .autosentry/AGENT_NOTES.md, a cheat
# sheet the agent reads instead of paraphrasing docs
autosentry onboard --for-agent # phase-aware plain-text playbook, no scaffolding
Or paste the agent install brief into your
session and let it drive the whole sequence — install → init → detector
proposals → autosentry skills install → autosentry doctor — asking
before any destructive change.
After it's running
tail -F .autosentry/logs/autosentry.log # structured log
autosentry watch # rich TUI: state, incidents, log tail
autosentry web # browse incidents in your browser
autosentry status # one-shot state dump
autosentry incidents list # CLI incident browser
autosentry incidents show 2026-05-26T14-32-10Z-error-traceback
Bidirectional Slack (separate shell — the monitor stays offline-safe):
SLACK_BOT_TOKEN=xoxb-… autosentry dispatcher run --channel C0A4UK987ND
How it works
┌─────────────────────────────────────────────────────────────────┐
│ autosentry monitor │
│ start ──► tick loop: read log lines → run detectors → fire │
│ healers → apply action → write incident → notify │
└─────────────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Supervisor Detectors Healers Notifiers
local / slurm / pattern / rules.yaml → log /
docker / attach traceback / Claude (sub- slack outbox /
stall / process or discord outbox /
exit_code interactive) webhook
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────┐
│ Incident store │ │ Slack dispatcher │
│ .autosentry/incidents/ │ │ outbox → Slack │
│ <ts>-<kind>/ │ │ Slack thread → inbox │
│ report.md │ │ (abort/pause/set…) │
│ trace.txt │ └──────────────────────┘
│ frames/*.md │
│ configs/* │ ┌──────────────────────┐
│ state.json │ ◄──│ autosentry watch │
│ fix/ │ ◄──│ autosentry web │
└──────────────────────────┘ └──────────────────────┘
Layers are deliberately small and pluggable:
| layer | role | built-in implementations |
|---|---|---|
| Supervisor | start, observe, restart the process | local, slurm, docker, attach |
| Detector | watch the log stream and process state for anomalies | pattern, traceback, stall, exit_code |
| Healer | decide what to do about a detection | YAML rules → Claude CLI fallback |
| Incident store | persist a forensic record of what happened + the fix | folder-per-incident with index.jsonl |
| Notifier | broadcast events | log, slack_outbox, webhook |
| Dispatcher | bidirectional Slack bridge (outbound + thread inbound) | stdout, webhook, slack_api |
| Visualization | operator surfaces | autosentry watch, autosentry web |
Read it left-to-right as a pipeline. The supervisor owns the
process and hands the monitor a log-line queue. Detectors each get
every line plus a periodic tick; the first one to fire produces a
Detection. A healer consumes that detection — first the
deterministic rule engine, then Claude if no rule matches (or if
escalation is active). The healer returns an action; the monitor
applies it (restart, env tweak, abort, custom command), captures the
attempt's outcome on an isolated fix
branch, and asks the
incident store to commit a forensic folder. Notifiers broadcast
the event as a side effect. None of these layers know about the others'
guts — they share Detection, HealerOutcome, and Incident and
nothing else.
The monitor's main loop is a single, synchronous Python thread that
pulls log lines off the supervisor's queue. No async, no callbacks
across processes. If you can read
monitor.py, you can debug anything
autosentry does.
Anatomy of an incident
A .autosentry/incidents/<ts>-<kind>/ folder looks like this:
2026-05-26T14-32-10Z-error-traceback/
├── report.md ← human-readable, the marquee artifact
├── trace.txt ← raw stack trace
├── log_excerpt.txt ← ±200 lines around the failure
├── frames/
│ ├── 01-train.py.md ← exploded source for frame 1 (function + ±10 lines)
│ ├── 02-loader.py.md
│ └── 03-_torch_dist.py.md ← library frame, source explode skipped
├── configs/
│ ├── run.yaml ← snapshot of each declared config file
│ └── .env
├── state.json ← monitor state at the moment of the incident
├── rule_match.json ← which YAML rule fired (or "claude")
└── fix/
├── action.json ← {"kind":"restart_with_env","env":{"BATCH_SIZE":"4"}}
├── diff.patch ← if Claude edited files, the diff lives here
└── claude_response.md ← Claude's diagnosis text (if invoked)
A real report.md for an OOM:
# Incident — 2026-05-26 14:32:10 UTC — error / traceback
**Process:** local · `python train.py --config configs/run.yaml`
**PID:** 41822 · **Restart #:** 2/5
**Detector:** `traceback` (Python)
**Resolution:** rule `oom` → restart_with_env (BATCH_SIZE=4)
---
## Source — frame 1
`src/train.py:142` in `TrainLoop.step()`
```python
class TrainLoop:
def step(self, batch):
self.optimizer.zero_grad()
>>> logits = self.model(batch["input_ids"]) # line 142
loss = self.criterion(logits, batch["labels"])
loss.backward()
Stack trace
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.34 GiB
Fix applied
Rule oom matched. Restarted with BATCH_SIZE=4 (was 8).
Anomalies (stall, loss spike, etc.) get the same shape — the trace section
is replaced by a recent-metrics block and the configs section is expanded
into a "decisions that might be relevant" view.
---
## Configuration reference
After `autosentry init` you have a `.autosentry/autosentry.yaml` with every
option commented in place. It lives inside `.autosentry/` alongside the
runtime state, and `init` drops a `.autosentry/.gitignore` so the whole
tree — config included — stays out of git by default (delete that file to
track the config). Relative paths in the config resolve against the
**project root** — the directory that contains `.autosentry/` — not the
config file's own directory. A pre-0.8 root-level `autosentry.yaml` is
still loaded as a fallback; `autosentry init --upgrade` migrates it.
The top-level shape:
| key | type | default | what it does |
|--------------------|---------------|----------------------------------|--------------|
| `process.kind` | enum | `local` | `local`, `slurm`, `docker`, or `attach` (tail an existing PID/log) |
| `process.command` | list[str] | — | argv passed to the process. No shell interpretation. |
| `process.cwd` | str | `.` | working dir, relative to the project root (the dir containing `.autosentry/`) |
| `process.env` | dict[str,str] | `{}` | env vars; values can interpolate `$VAR` / `${VAR}` |
| `process.restart_policy.max_restarts` | int | `10` | when exceeded, monitor gives up |
| `process.restart_policy.cooldown_seconds` | int | `60` | wait before restart |
| `monitor.poll_interval_seconds` | int | `30` | tick rate for status checks and tick-driven detectors |
| `monitor.log_dir` | str | `.autosentry/logs` | structured log + supervised process log live here |
| `monitor.log_excerpt_lines` | int | `200` | lines per incident `log_excerpt.txt` |
| `config_snapshots` | list[str] | `[]` | files copied verbatim into every incident folder |
| `source_explode.context_lines` | int | `10` | lines around hot line when AST framing fails |
| `source_explode.languages` | list | `[python, javascript, typescript, go, rust, java]` | tree-sitter grammars to load |
| `source_explode.skip_paths` | list | site-packages / node_modules / etc. | frames in these paths emit "library" stubs |
| `detectors` | list | see below | what to watch for |
| `rules` | list | `[]` | YAML rule engine; first match wins |
| `healing.claude.enabled` | bool/str| `auto` | `true`/`false`/`auto`; `auto` enables when skill or CLI is present |
| `healing.claude.mode` | enum | `auto` | `auto`/`subprocess`/`interactive`; see [healer modes](#healer-modes) |
| `healing.claude.command` | list | `["claude", "--print"]` | how to invoke Claude in subprocess mode |
| `healing.claude.timeout_seconds` | int | `600` | Claude's budget per incident |
| `healing.claude.request_path` | str | `.autosentry/recovery_request.md`| interactive handshake: file the monitor writes |
| `healing.claude.response_path` | str | `.autosentry/recovery_response.md`| interactive handshake: file the subagent writes |
| `healing.escalate_to_claude_after` | int | `max_restarts // 5` (≥ 1) | force-escalate to the agent after N unverified rule restarts (default = 2 with `max_restarts=10`) |
| `healing.escalate_on_rule_regression` | bool | `true` | if a rule-based fix regresses, force the agent on the next attempt for that detector |
| `healing.verify_window_seconds` | int | `600` | window in which a re-fire counts as a regression |
| `healing.budget.max_attempts_per_detector_per_hour` | int | `5` | per-detector heal-attempt rate cap |
| `notifiers` | list | `[{kind: log}]` | event sinks |
| `state_path` | str | `.autosentry/state.json` | persistent state location |
| `incidents_dir` | str | `.autosentry/incidents` | where incident folders go |
---
## Detectors
| kind | what it does |
|---------------|-------------------------------------------------------------------------------------------------------|
| `pattern` | Fires when a log line matches a regex. Cheapest, most common. |
| `traceback` | Picks up multi-line stack traces from **Python**, **Node/JS**, **Go**, **Rust**, **Java**. |
| `stall` | With `metric_regex`: progress value doesn't advance for N seconds → anomaly. Without: no log output for N seconds → anomaly. |
| `exit_code` | Process exits non-zero (or zero too, if you flip `nonzero_only: false`). |
Example detector block:
```yaml
detectors:
- kind: pattern
name: oom
regex: "(OutOfMemoryError|CUDA out of memory)"
- kind: pattern
name: nccl
regex: "NCCL.*(error|timeout)"
- kind: traceback
- kind: stall
name: training_stall
metric_regex: "step (\\d+)/"
no_progress_seconds: 1800
- kind: exit_code
Recovery rules and Claude fallback
The agent is the main healer. Rules are a cheap fast lane for the
small set of known transients where a deterministic action is
known-good — restart on OOM, set NCCL_P2P_DISABLE=1 and restart on
NCCL hiccups, drop the batch size on CUDA out of memory. Anything
that isn't one of those falls through to the agent immediately. Rules
that do match but produce a fix that regresses inside the verify
window pivot the next attempt to the agent automatically (see
Healer-aware restart budget).
Rules are tried top-down; the first one whose match clause is
satisfied by a detection wins. The Claude healer runs in one of two
modes — picked automatically based on what's installed (see
mode resolution below).
Healer modes
- subprocess — spawn
claude --printas a headless process, pipe the prompt in, capture stdout. Works without an open Claude Code session; needs theclaudeCLI on PATH. Use in CI, k8s, on air-gapped boxes. - interactive — write a recovery request file with YAML
frontmatter (incident id, detector, recommended subagent type),
then block waiting for a response. The
/autosentryslash command running in your open Claude Code session sees the request, spawns a subagent via the Task tool with the incident's full context, and writes the response (typically viaautosentry healer respond). Keeps your main session clean; the subagent owns the diagnosis.
The file handshake (interactive mode)
Two files, one direction each, polled by mtime. No sockets, no daemon.
| file | written by | read by |
|---|---|---|
.autosentry/recovery_request.md |
monitor (the blocking healer) | /autosentry skill in Claude |
.autosentry/recovery_response.md |
a Task-tool subagent (autosentry healer respond) |
monitor (mtime-gated wait) |
Paths are configurable (healing.claude.request_path /
.response_path). The healer captures a baseline mtime before
writing the request so a stale response file from a previous run is
ignored. The wait is bounded by healing.claude.timeout_seconds.
healing:
claude:
enabled: auto # auto-detect; never red-light the doctor
mode: auto # interactive if skill installed, else subprocess
subagents:
default:
type: general-purpose
description: "Diagnose an autosentry incident"
training_stall:
type: Plan # specialize per detector
description: "Diagnose a stalled training loop"
Mode resolution (when mode: auto):
/autosentry skill installed |
claude on PATH |
resolved mode |
|---|---|---|
| yes | — | interactive |
| no | yes | subprocess |
| no | no | disabled (rule-only — no red doctor row) |
autosentry doctor reports the resolved mode so you can see which one
will actually run.
Subagents
In interactive mode the healer doesn't talk to Claude directly — it
prompts the /autosentry skill (running in the user's open session) to
spawn a Task-tool subagent of the type declared in the request
frontmatter. That subagent reads the incident folder, edits the repo
if needed, and writes the response file with one Bash call:
autosentry healer respond \
--action restart_with_env \
--set BATCH_SIZE=4 \
--diagnosis "OOM at step 8450; halving batch."
Per-detector subagent routing (healing.claude.subagents) lets each
failure mode get the right kind of investigator without inflating the
operator's main conversation.
Rule-only operation is a first-class mode, not a degraded one — set
healing.claude.enabled: false and autosentry skips both subprocess
and interactive paths.
rules:
- name: oom_halve_batch
match: { detector: oom }
action:
kind: restart_with_env
set:
BATCH_SIZE: half # halves the prior overlay
notify: true
- name: transient_restart
match: { detector: nccl }
action: { kind: restart, notify: true }
- name: stall_restart
match: { detector: training_stall }
action: { kind: restart, notify: true }
Supported actions: restart, restart_with_env (with half/double/literal
values in set:), pause, abort, custom_command.
When Claude is invoked, it reads:
- the recovery prompt template at
.autosentry/prompts/recovery.md, - the current
state.json, - the last incident report (so it has the exploded source frames),
- snapshots of every file listed in
config_snapshots.
It is expected to (a) optionally edit files in place — those edits are
captured into fix/diff.patch — and (b) end its response with an ACTION:
block:
ACTION: restart_with_env
SET: BATCH_SIZE=4
If Claude says ACTION: abort, the monitor stops and waits for a human.
Fix branches and outcome verification
A healer's fix is only as good as the next few minutes of runtime. To keep regressions out of your working tree, every Claude-driven fix runs on its own branch and isn't kept unless it survives a verification window. The pattern is borrowed from autoresearch.
- When the Claude healer fires, autosentry creates
autosentry/fix-<incident-id>off the current HEAD. - Claude's edits land on that branch. The supervisor is restarted.
- The monitor watches for the same detector to re-fire within
healing.verify_window_seconds(default 600s). - No recurrence → the attempt is marked
keptinattempts.tsv. Withhealing.git.auto_merge: truethe branch fast-forwards into your working branch and is deleted; otherwise the branch is left for you to merge by hand. - Recurrence inside the window → the attempt is marked
regressed. The working tree is restored, you're returned to your original branch, and the fix branch stays put as a forensic artifact.
Every attempt is recorded in .autosentry/attempts.tsv — flat
tab-separated, append-only, grep-friendly. Browse it with
autosentry analyze:
$ autosentry analyze --since 24h
attempts — 14 total (last 24h)
kept=9 pending=1 regressed=3 crashed=1
top failing detectors
┃ detector ┃ attempts ┃
┃ training_stall ┃ 6 ┃
┃ oom ┃ 4 ┃
┃ nccl ┃ 3 ┃
per-rule success
┃ source ┃ total ┃ kept ┃ regressed ┃ success ┃
┃ oom_halve_batch ┃ 4 ┃ 3 ┃ 1 ┃ 75% ┃
┃ stall_restart ┃ 6 ┃ 3 ┃ 2 ┃ 60% ┃
┃ claude ┃ 3 ┃ 3 ┃ 0 ┃ 100% ┃
Healer-aware restart budget
The restart counter is outcome-aware, not a dumb tally. Three pieces, all in service of the same posture: get the agent on it before rules burn the budget.
- Kept fixes reset the counter. When a verification window closes
with no recurrence,
state.restartsdrops back to 0. A run that survives a real failure mid-week doesn't burn its restart budget on the next, unrelated incident. - Force-escalate after two unverified restarts. Once
state.restartshitshealing.escalate_to_claude_after(default:max(1, max_restarts // 5)— so 2 with the defaultmax_restarts=10), the next detection skips the rule healer and goes straight to the agent. Rules clearly aren't sticking; bring in the heavier diagnosis. - Rule regression auto-pivots to the agent. If a rule-based fix
regresses inside the verify window
(
healing.escalate_on_rule_regression=trueby default), the next attempt for that detector skips rules entirely and routes to the agent. Rules already failed on that detector — recycling them is wasted budget. The marker clears after use, so a different detector still gets the cheap rule path on its first try.
The complementary per-detector rate cap
(healing.budget.max_attempts_per_detector_per_hour, default 5) keeps a
runaway failure mode from monopolizing the healer. When it burns
through, the monitor still writes incidents and notifies — but stops
trying fixes for that detector until a manual approve lands in the
Slack inbox.
Notifications
Notifier specs are a list under notifiers:. Built-ins:
notifiers:
- kind: log # always-on default
- kind: slack_outbox
outbox_path: .autosentry/slack_outbox.jsonl
channel: "C0A4UK987ND" # Slack channel id
thread_key: "pipeline"
- kind: discord_outbox
outbox_path: .autosentry/discord_outbox.jsonl
channel: "123456789012345678" # Discord channel id (snowflake)
thread_key: "pipeline"
- kind: webhook
url: "https://hooks.example.com/autosentry"
Neither notifier talks to chat directly — they append JSON lines to an
outbox file. A separate autosentry dispatcher run daemon drains the
outbox and (with the slack_api or discord_bot backend) also polls
the thread for replies. The dispatcher is lazy:
- Outbox drain is mtime-gated — when nothing has been queued, the
dispatcher's loop costs one
stat()call. - Inbound polling is trigger-driven — the monitor
touch()es.autosentry/inbox_poll_requeston every detection fire, which is what wakes the dispatcher's Slack-thread poll. A long-period sweep (--idle-inbound-seconds 300) catches replies sent during quiet stretches.
This indirection mirrors the original rad_monitor.py and lets
autosentry run on machines without outbound network. The monitor
consumes slack_inbox.jsonl / discord_inbox.jsonl on its own tick
and applies recognized commands (abort, pause, resume,
set max_restarts N, approve, comment:) directly to the
supervised process.
Pick the backend with env vars or --backend:
| credentials present | backend chosen | inbound? |
|---|---|---|
SLACK_BOT_TOKEN |
slack_api |
yes |
DISCORD_BOT_TOKEN |
discord_bot |
yes |
SLACK_WEBHOOK_URL |
webhook |
no |
DISCORD_WEBHOOK_URL |
discord_webhook |
no |
| (none) | stdout |
no |
Run a Slack daemon and a Discord daemon side by side — the dispatcher auto-namespaces its state/inbox/marker files per platform.
Launching from your AI editor
autosentry skills install drops a /autosentry slash-command into
your repo for whichever AI editor you use. Once it's there, typing
/autosentry asks the agent to bootstrap autosentry, configure it
for your process, or walk you through the last incident — without
leaving your editor.
# install in just this repo (default)
autosentry skills install # all tools, /autosentry skill
autosentry skills install --tool claude # one tool
autosentry skills install --skill init # the focused /autosentry-init slash command
autosentry skills install --skill update # the focused /autosentry-update slash command
autosentry skills install --skill all # /autosentry + /autosentry-init + /autosentry-update
# install once, inherit everywhere
autosentry skills install --scope global # writes ~/.claude/commands/, ~/.codex/prompts/, ...
autosentry skills install --scope global --skill all
autosentry skills list # full destination table (local + global)
Three skills land here:
/autosentry— full operator playbook (install → init → run → operate → interactive recovery)./autosentry-init— focused onboarding of a fresh repo (no operator/recovery content). Smaller reading cost for AI agents that only have one job./autosentry-update— focused check-and-upgrade: runsautosentry update --checkand applies the right backend (uv / pipx / pip / Homebrew) when you're behind.
Supported tools:
| tool | dropped at | invoke |
|---|---|---|
| Claude Code | .claude/commands/autosentry.md |
/autosentry |
| OpenCode | .opencode/command/autosentry.md |
/autosentry |
| OpenAI Codex CLI | .codex/prompts/autosentry.md |
/autosentry |
| Gemini (Antigravity / CLI) | .gemini/commands/autosentry.toml |
/autosentry |
| Cursor | .cursor/commands/autosentry.md |
/autosentry |
| Aider | .aider.conf.yml (binds AGENTS.md) |
ambient context |
| Continue.dev | .continue/config.json |
/autosentry |
| Windsurf (Cascade) | .windsurfrules |
ambient context |
| Zed | .zed/prompts/autosentry.md |
/autosentry |
| Universal (any AGENTS.md-aware) | AGENTS.md at the repo root |
auto-loaded |
All wrappers defer to AGENTS.md for the full playbook; the
single-source-of-truth for the agent's instructions lives there.
The skill prompt itself is canonical and lives at
src/autosentry/templates/skills/autosentry.md inside this repo. All
per-tool wrappers either embed it or reference it.
Update
autosentry update # update to latest stable
autosentry update --check # current vs latest; recommends how to upgrade
autosentry update --check --json # machine-readable: {"current","latest","is_outdated"}
autosentry update --pre # allow pre-releases
autosentry update auto-detects how it was installed — uv tool, pipx,
pip --user, or Homebrew — and runs the matching upgrade (brew upgrade autosentry for tap installs). --check caches the PyPI lookup for a day
(pass --no-cache to force a live query) and always exits 0, so the
/autosentry skill can run it on every invocation and nudge you when a newer
release is out.
Or use the standalone updater (works for installs made by install.sh even
when the CLI itself is broken):
curl -fsSL https://raw.githubusercontent.com/ulmentflam/autosentry/main/update.sh | sh
Status & roadmap
The package is 3 - Alpha on PyPI. Individual subsystems below are
labelled "stable" because the test suite pins their behavior and the
public CLI / YAML schemas are under semver discipline. Alpha applies
to the project shape: defaults may shift, less-trodden combinations
(SLURM + interactive Claude + Discord, say) have had hours of use
rather than weeks. Expect breaking changes in 0.x minor releases;
the CHANGELOG calls them out.
| capability | status |
|---|---|
| Local subprocess supervisor | stable |
| Pattern / traceback / stall / exit detectors | stable |
| YAML rule engine + Claude CLI fallback | stable |
| Tree-sitter source exploder (py/js/ts/go/rust/java) | stable |
| Incident store + structured logs + slack file-outbox | stable |
install.sh one-liner |
stable |
autosentry update mechanism |
stable |
| AI-editor skills (Claude/OpenCode/Codex/Gemini/Cursor) | stable |
| SLURM supervisor | stable |
| Docker supervisor | stable |
| Attach-to-PID supervisor | stable |
| Slack dispatcher daemon (outbound + inbound) | stable |
autosentry watch status TUI |
stable |
autosentry web incident viewer |
stable |
| Fix-branch isolation + outcome verification | stable |
attempts.tsv ledger + autosentry analyze |
stable |
program.md operator mission statement |
stable |
| PyPI release automation | planned |
| Slack interactive buttons (approve/abort UI) | planned |
Comparison
| tool | restarts | reads logs | rule recovery | LLM recovery | incident audit trail |
|---|---|---|---|---|---|
supervisord |
yes | no | no | no | log files |
systemd |
yes | no | no | no | journald |
k8s livenessProbe |
yes | no | no | no | events |
runit |
yes | no | no | no | log files |
tini / dumb-init |
partial | no | no | no | none |
| autosentry | yes | yes | yes (YAML) | yes (Claude) | folder-per-incident |
autosentry is not trying to replace your supervisor of record. It runs above one (or alongside it) and is designed for the long-running, high-friction-to-restart job: the multi-day training run, the slow nightly ETL, the batch service that can't easily be turned into a stateless k8s deployment.
FAQ
Isn't this just systemd + a cron?
For category (1) failures — yes, a Restart=on-failure unit covers
it. autosentry earns its keep when the cost of a bad restart is high
(re-warming a cache, re-loading model weights, re-running an
hour-long preprocessing stage) and when the failure mode isn't
"process exited non-zero" — a stalled training loop, a loss spike, a
silently-degraded throughput. systemd can't read your log stream,
match a regex against it, edit a config file, and verify the fix
didn't regress. autosentry runs above a supervisor of record, not
in place of it.
Why YAML for rules instead of code/Python?
Operators edit autosentry mid-incident, often from a phone over
Slack. YAML diff-reviews cleanly, survives a copy/paste into a chat
thread, and can't trigger arbitrary import-time side effects. The
escape hatch for genuine logic is action: { kind: custom_command }
— run a script, return an action. We took the same trade as
Kubernetes manifests for the same reason.
What if Claude makes a worse fix?
Two safeguards. First, every Claude edit lands on an isolated
autosentry/fix-<incident-id> branch — your working tree is
untouched until verification passes. Second, the
verification window restores
the working tree and marks the attempt regressed if the same
detector re-fires inside healing.verify_window_seconds. The diff
stays on disk as a forensic artifact. You can also clamp Claude to
diagnose-only with healing.claude.command: ["claude", "--print", "--no-edit"].
Does autosentry need Claude installed?
You get the most out of autosentry with an agent — that's the
headline feature. With healing.claude.enabled: false (or simply
with no claude CLI on PATH when enabled: auto), autosentry falls
back to rules-only. doctor won't redline — rules-only is a
supported mode — but you've turned off the part that makes this
different from systemd. Useful on air-gapped boxes, in CI where the
rule set is well-tested, or when you want a hard "no LLM in the
loop" guarantee.
Does it support pre-existing log files?
Yes. Set process.kind: attach and supply process.extra.pid (or
pid_file) plus log_path. The monitor tails the file from EOF and
runs the same detector pipeline against it. It will not restart a
process it didn't start; abort is honoured only when
process.extra.allow_kill: true (sends SIGTERM to the watched PID),
and everything else falls back to pause / custom_command.
What happens to the process when the monitor itself crashes?
The supervised process keeps running — autosentry uses
start_new_session=True so it doesn't share a process group. On restart,
the monitor reads its persisted state and resumes. If the process is gone
by then, it starts a fresh one.
Can rules call out to scripts?
Yes — action: { kind: custom_command, command: [...] }. Runs in the
configured cwd with the configured env. Returns the action.
Will Claude actually edit my code?
By default, yes. It runs in the configured cwd. Any file edits it makes
are captured as a diff in the incident folder so you can review and revert.
Set healing.claude.command to something more restrictive (e.g.
["claude", "--print", "--no-edit"]) if you'd rather Claude only diagnose.
Won't this double-restart with my existing supervisor?
In attach mode autosentry will never restart a process it didn't
start — restart actions raise. In local/slurm/docker mode
autosentry is the supervisor; don't also configure
Restart=on-failure for the same unit. Use one or the other.
What's the resource overhead?
The monitor is a single Python process that wakes on the
poll_interval_seconds tick (default 30s) and on each log line your
process writes. Steady-state memory is dominated by the loaded
tree-sitter grammars plus a bounded log-tail buffer. The dispatcher
is a separate, optional process and is mtime-gated — when no
notifications are queued, its loop is a single stat(). There's no
background polling of remote services and no persistent network
connection.
iCloud Drive keeps breaking my venv.
See the install note. Either point the venv outside iCloud or
let the Makefile auto-redirect it.
Contributing
See CONTRIBUTING.md. The short version:
git clone https://github.com/ulmentflam/autosentry.git
cd autosentry
make install
make ci # ruff lint + format check + pyrefly + pytest
Open an issue first for non-trivial changes. New behavior needs a test.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autosentry-0.8.4.tar.gz.
File metadata
- Download URL: autosentry-0.8.4.tar.gz
- Upload date:
- Size: 147.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1020995217522d28d71779608f9c96aabd9d3431f1dfd0dfbc07908c46d76829
|
|
| MD5 |
0fd83385d1a2fdfecd96ce39db48e397
|
|
| BLAKE2b-256 |
27285864fbdb30191330846b004b61072174b0037a228d74f1a82e42312b5b81
|
Provenance
The following attestation bundles were made for autosentry-0.8.4.tar.gz:
Publisher:
release.yml on ulmentflam/autosentry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autosentry-0.8.4.tar.gz -
Subject digest:
1020995217522d28d71779608f9c96aabd9d3431f1dfd0dfbc07908c46d76829 - Sigstore transparency entry: 1666784707
- Sigstore integration time:
-
Permalink:
ulmentflam/autosentry@5021d417682398076fc4369a5436fca8818af725 -
Branch / Tag:
refs/tags/v0.8.4 - Owner: https://github.com/ulmentflam
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5021d417682398076fc4369a5436fca8818af725 -
Trigger Event:
push
-
Statement type:
File details
Details for the file autosentry-0.8.4-py3-none-any.whl.
File metadata
- Download URL: autosentry-0.8.4-py3-none-any.whl
- Upload date:
- Size: 183.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f2e6020d71df1bda2dc3a92eaca01b9c26c57bd758764ef4e68915f226514e4
|
|
| MD5 |
0bceddc701cfed4e9b0f7d01aedd382d
|
|
| BLAKE2b-256 |
da2732ebbee5b9ee55ede673a5ddc52c19917776530ae51269bb74e72e623815
|
Provenance
The following attestation bundles were made for autosentry-0.8.4-py3-none-any.whl:
Publisher:
release.yml on ulmentflam/autosentry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autosentry-0.8.4-py3-none-any.whl -
Subject digest:
2f2e6020d71df1bda2dc3a92eaca01b9c26c57bd758764ef4e68915f226514e4 - Sigstore transparency entry: 1666784845
- Sigstore integration time:
-
Permalink:
ulmentflam/autosentry@5021d417682398076fc4369a5436fca8818af725 -
Branch / Tag:
refs/tags/v0.8.4 - Owner: https://github.com/ulmentflam
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5021d417682398076fc4369a5436fca8818af725 -
Trigger Event:
push
-
Statement type: