Thin tmux sidecar supervisor for long-running AI coding agent workflows
Project description
thin-supervisor
Long-running AI coding tasks fail silently. The agent asks "should I continue?", you're not watching, and the task stalls. Or worse — the agent says "done" but didn't actually pass the tests.
thin-supervisor fixes this. It's an acceptance-centered run supervisor that sits alongside your existing coding agent (Claude Code, Codex, or any CLI agent), watches what the agent does, and makes structured decisions: continue, verify, retry, branch, escalate, or finish. "Done" means the verifier passed and the acceptance contract is satisfied — not that the agent said so. You stay in your familiar agent UI. The supervisor handles the rest.
Architecture deep-dive: See docs/ARCHITECTURE.md for the six-layer architecture, first-class objects, and design principles.
Docs hub:
- docs/getting-started.md — install and run tmux, open-relay, and JSONL workflows
- docs/ARCHITECTURE.md — object model, layers, and current implementation status
- CHANGELOG.md — release notes and unreleased changes
- docs/design/p2-external-surfaces.md — surface abstraction and roadmap
- docs/design/p3-observation-sources.md — observation sources and normalization
- docs/design/p4-jsonl-observation.md — transcript-backed observation mode
- docs/reviews/2026-04-11-deep-code-review.md — latest deep code review log and remaining-risk audit
- docs/reviews/2026-04-12-amp-supervisor-capability-review.md — Amp-vs-thin-supervisor capability review and oracle-layer roadmap
┌────────────────────────────┐ ┌──────────────────────────┐
│ Your Agent (visible pane) │ │ Supervisor (sidecar) │
│ Claude Code / Codex │ │ reads pane output │
│ │ │ parses checkpoints │
│ ... working ... │ │ gates decisions │
│ │ │ runs verifiers │
│ <checkpoint> │──│ injects next step │
│ status: step_done │ │ │
│ </checkpoint> │ │ state: RUNNING → VERIFY │
└────────────────────────────┘ └──────────────────────────┘
tmux session
When to use this
| Scenario | Without supervisor | With supervisor |
|---|---|---|
| 10-step implementation plan | Agent asks permission at every step | Runs to completion, verifies each step |
| Test-driven workflow | Agent says "done" without running tests | Verifier runs tests, rejects if failing |
| Agent asks "should I continue?" | You miss it, task stalls for hours | Supervisor auto-answers, keeps going |
| Dangerous operation detected | Agent proceeds silently | Supervisor escalates to you |
Core Concepts
Runtime Objects (stable)
| Object | Question it answers | What it is |
|---|---|---|
| WorkflowSpec | What should be done? | YAML task definition with steps, verification criteria, and finish policy |
| CheckpointEvent | What did the agent just report? | Structured status with seq tracking, evidence, and needs |
| SupervisorDecision | What does the control plane think? | Typed gate decision with confidence, reasoning, and causality link |
| HandoffInstruction | What should the agent do next? | Composed instruction with full traceability to the triggering decision |
| ExecutionSurface | How do we talk to the agent? | Protocol for read/inject/cwd — tmux, open-relay, and JSONL observation surfaces |
| SessionRun | Who is this run? | Identity + durable event history — survives crashes, enables recovery |
Emerging Architecture (implemented, maturing)
| Object | Purpose |
|---|---|
| AcceptanceContract | Defines "what counts as truly done" — required evidence, forbidden states, risk class, reviewer gating |
| WorkerProfile | Explicit worker capabilities — provider, model, trust level. Drives supervision intensity. |
| SupervisionPolicy | Three modes: strict_verifier (default) / collaborative_reviewer / directive_lead. Prevents thin supervisor from micromanaging strong worker. |
| RoutingDecision | Escalation routing — human, stronger reviewer, or alternate executor |
These form a causality chain: every instruction traces back to the decision that caused it, which traces back to the checkpoint that triggered it.
CheckpointEvent(seq=3) → SupervisorDecision(triggered_by_seq=3) → HandoffInstruction(triggered_by_decision=X)
Quick Start
Full guide: See docs/getting-started.md for step-by-step instructions covering tmux, open-relay, JSONL observation, and Codex/Claude/OpenCode/Droid workflows.
# Install
pip install thin-supervisor
# Install the Codex / Claude skills automatically when supported
thin-supervisor skill install
# Initialize in your project
cd your-project
thin-supervisor init
# If .supervisor/ exists but is missing config, repair the scaffold in place
thin-supervisor init --repair
# Write a spec (or let the Skill generate one)
cat > .supervisor/specs/my-plan.yaml << 'EOF'
kind: linear_plan
id: my_feature
goal: implement feature X with tests
approval:
required: true
status: draft
finish_policy:
require_all_steps_done: true
require_verification_pass: true
policy:
default_continue: true
max_retries_per_node: 3
steps:
- id: write_tests
type: task
objective: write failing tests for feature X
verify:
- type: artifact
path: tests/test_feature_x.py
exists: true
- id: implement
type: task
objective: implement feature X until tests pass
verify:
- type: command
run: pytest -q tests/test_feature_x.py
expect: pass
- id: final_check
type: task
objective: run full test suite
verify:
- type: command
run: pytest -q
expect: pass
EOF
# Approve the draft spec, then attach
thin-supervisor spec approve --spec .supervisor/specs/my-plan.yaml --by human
scripts/thin-supervisor-attach.sh my-plan
Execution entry points reject draft specs. This is deliberate: the clarify/approve step is part of the contract.
What happens next
- Supervisor reads the agent's pane output every 2 seconds
- Agent emits a
<checkpoint>block after completing work - Supervisor parses the checkpoint and makes a gate decision:
- Continue — agent is making progress, don't interrupt
- Verify — agent says step is done, run the verifier
- Retry — verification failed, inject retry instruction with failure details
- Branch — decision node in workflow, select a path
- Escalate — missing credentials, dangerous action, or low confidence — pause for human
- Finish — all steps done, all verifiers pass, finish policy and review requirements satisfied
- If continuing or retrying, supervisor injects the next instruction into the pane
- Run-level decisions are logged to
session_log.jsonl; project-level bootstrap and repair incidents are logged to.supervisor/runtime/ops_log.jsonl
Historical runs can now be turned into stable artifacts and reports:
thin-supervisor run export <run_id> > run.json
thin-supervisor run summarize <run_id> --json
thin-supervisor run replay <run_id> --json
thin-supervisor run postmortem <run_id>
run replay re-evaluates historical checkpoints with the current gate logic but does not inject or verify against live surfaces. run postmortem writes a markdown report under .supervisor/reports/ by default.
If your spec sets acceptance.must_review_by, the run pauses at the finish gate until someone acknowledges review:
thin-supervisor run review <run_id> --by human
# or
thin-supervisor run review <run_id> --by stronger_reviewer
When a run enters PAUSED_FOR_HUMAN, thin-supervisor now derives two user-facing fields:
pause_reason— why the supervisor stoppednext_action— the exact recovery command to run next
By default the daemon also emits pause notifications through two built-in channels:
tmux_display— atmux display-messagealert on the supervised panejsonl— durable records in.supervisor/runtime/notifications.jsonl
Pause handling is now also policy-driven:
pause_handling_mode: notify_only— notify and remain pausedpause_handling_mode: notify_then_ai— notify first, then let the agent attempt an automatic recovery for selected cases such as blocked checkpoints, repeated node mismatch, or retry-budget exhaustion
The default is currently tuned for test periods:
pause_handling_mode: notify_then_ai
max_auto_interventions: 2
The default config now includes:
notification_channels:
- kind: tmux_display
- kind: jsonl
pause_handling_mode: notify_then_ai
max_auto_interventions: 2
Future delivery targets such as Feishu or Telegram plug into the same channel interface in supervisor/notifications.py.
Checkpoint Protocol
Agents must emit structured checkpoints for the supervisor to parse:
<checkpoint>
run_id: <run_id from thin-supervisor status>
checkpoint_seq: <incrementing integer, start from 1>
status: working | blocked | step_done | workflow_done
current_node: <step_id>
summary: <one-line description>
evidence:
- modified: <file path>
- ran: <command>
- result: <short result>
candidate_next_actions:
- <next action>
needs:
- none
question_for_supervisor:
- none
</checkpoint>
The Codex/Claude Code Skills teach agents this protocol automatically.
Verification Types
| Type | Fields | Description |
|---|---|---|
command |
run, expect |
Run a shell command. expect: pass, fail, contains:<text> |
artifact |
path, exists |
Check if a file exists |
git |
check, expect |
Check git state (e.g., check: dirty, expect: false) |
workflow |
require_node_done |
Check if current node is marked done |
All verifiers run in the agent's working directory (pane cwd), not the supervisor's.
CLI
thin-supervisor init [--force|--repair] # Create or repair .supervisor/ directory
thin-supervisor deinit [--force] # Remove .supervisor/
thin-supervisor daemon start [--config <path>] # Start background daemon
thin-supervisor daemon stop # Stop daemon
thin-supervisor stop # Legacy alias for daemon stop
thin-supervisor run register --spec <spec> --pane <target> [--surface tmux|open_relay|jsonl]
thin-supervisor run foreground --spec <spec> --pane <target> [--surface ...]
thin-supervisor run stop <run_id>
thin-supervisor run resume --spec <spec> --pane <target> [--surface ...]
thin-supervisor run review <run_id> --by human|stronger_reviewer
thin-supervisor run export <run_id> [--output file]
thin-supervisor run summarize <run_id> [--json]
thin-supervisor run replay <run_id> [--json]
thin-supervisor run postmortem <run_id> [--output file]
thin-supervisor spec approve --spec <spec> [--by human]
thin-supervisor status # Active runs in current worktree
thin-supervisor list # Detailed active-run view
thin-supervisor ps # Registered daemons across worktrees
thin-supervisor pane-owner <pane> # Show which run owns a pane
thin-supervisor observe <run_id> # Read-only observation snapshot
thin-supervisor note add <text> [--type ...] [--run ...] # Shared notes for coordination
thin-supervisor note list [--type ...] [--run ...]
thin-supervisor session detect # Detect current agent session ID
thin-supervisor session jsonl # Resolve current transcript path
thin-supervisor session list # List recent sessions and cwd
thin-supervisor skill install # Install Codex / Claude skills
thin-supervisor bridge <action> [args] # tmux bridge operations
thin-supervisor is the runtime CLI. It is the only command family normal task users should need.
thin-supervisor-dev learn friction add --kind <kind> --message "..." [--run-id <run_id>] [--signal <signal>]
thin-supervisor-dev learn friction list [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn friction summarize [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn prefs set --key <key> --value <value>
thin-supervisor-dev learn prefs show [--json]
thin-supervisor-dev eval list
thin-supervisor-dev eval run [--suite approval-core|approval-adversarial|clarify-contract-core|routing-core|escalation-core|finish-gate-core|pause-ux-core] [--json]
thin-supervisor-dev eval replay --run-id <run_id> [--json]
thin-supervisor-dev eval compare --suite approval-core --candidate-policy <policy> [--json]
thin-supervisor-dev eval canary --run-id <run_id> [--run-id <run_id> ...] [--candidate-id <candidate_id>] [--phase shadow|limited] [--json]
thin-supervisor-dev eval expand --suite approval-core --output <path> [--variants-per-case 2]
thin-supervisor-dev eval propose --suite approval-core --objective <goal> [--json]
thin-supervisor-dev eval review-candidate --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval candidate-status --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval gate-candidate --candidate-id <candidate_id> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promote-candidate --candidate-id <candidate_id> --approved-by <name> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promotion-history [--json]
thin-supervisor-dev eval rollout-history [--candidate-id <candidate_id>] [--json]
thin-supervisor-dev oracle consult --question "..." [--file path ...]
thin-supervisor-dev is the devtime/operator CLI. Use it for local tuning, offline evals, candidate rollout, learning signals, and advisory second opinions. Do not expose it to normal runtime users.
Add --save-report to run, replay, compare, canary, propose, review-candidate, gate-candidate, or promote-candidate to persist a JSON report under .supervisor/evals/reports/. When used with eval propose, thin-supervisor-dev also writes a candidate-lineage manifest under .supervisor/evals/candidates/, eval review-candidate turns that manifest back into a bounded human review summary, eval candidate-status assembles the manifest, latest related reports, and promotion-registry state into one lifecycle dossier, eval gate-candidate combines compare plus optional canary signals into a promotion recommendation, and eval promote-candidate records an approved promotion in .supervisor/evals/promotions.jsonl.
If a daemon-managed run pauses, status and list now show the human-readable reason and the suggested next command. For non-active persisted runs, the same hint appears under Local state found:.
thin-supervisor-dev eval is the first offline evaluation surface for the new skill-evolution work. Bundled suites now cover more than approval copy: approval-core checks explicit approval vs re-ask behavior, approval-adversarial covers tricky mixed signals and repeat-approval cases, clarify-contract-core checks whether the system locks the right delivery contract instead of silently narrowing “real UAT” work into a mock/dev baseline, routing-core checks deterministic step_done/workflow_done -> VERIFY_STEP routing, escalation-core checks blocked -> ESCALATE_TO_HUMAN, finish-gate-core checks reviewer and completion contracts, and pause-ux-core checks externally visible pause/completion summaries. thin-supervisor-dev eval replay --run-id ... wraps the existing history replay path into the same evaluation surface so policy candidates can be checked against real historical traces. thin-supervisor-dev eval compare ... adds a blind A/B-style comparator over deterministic suite results so baseline and candidate policies can be compared without hard-coding one output format into the report consumer. thin-supervisor-dev eval canary ... aggregates replay pass-rate, mismatch kinds, and friction over a set of real runs so shadow-canary promotion decisions become a command instead of a checklist; when you pass --candidate-id, the same command also records a rollout attempt under .supervisor/evals/rollouts.jsonl. thin-supervisor-dev eval expand ... generates provenance-tagged synthetic variants from the golden suite so coverage can grow without mutating the original contract set. thin-supervisor-dev eval propose ... is the first constrained candidate-generator surface: it summarizes failure cases, consults the advisory/self-review layer, recommends a policy candidate for a stated objective without automatically changing shipped defaults, and can persist a candidate-lineage manifest for later comparison and promotion review. thin-supervisor-dev eval review-candidate ... loads one of those manifests and emits the bounded human-review summary for the next promotion step. thin-supervisor-dev eval candidate-status ... turns the manifest, related eval reports, promotion-registry state, and recorded rollout attempts into one lifecycle dossier. thin-supervisor-dev eval rollout-history ... exposes the rollout ledger directly. thin-supervisor-dev eval gate-candidate ... then combines that bounded review with deterministic compare output and optional real-run canary signals before a human decides whether to promote. thin-supervisor-dev eval promote-candidate ... records that approval in the promotion registry so candidate history and current promoted policies are queryable later.
Real Canary Loop
Yes, you should run real canaries. A safe sequence is:
- Offline gate
Run
eval run,eval replay,eval compare, and optionallyeval propose, all with--save-report. - Shadow canary
Pick 3-5 real tasks and keep the baseline behavior in charge. Record each finished run with:
thin-supervisor run summarize <run_id>thin-supervisor run postmortem <run_id>thin-supervisor-dev eval replay --run-id <run_id> --save-reportthin-supervisor-dev eval canary --run-id <run_id> ... --candidate-id <candidate_id> --phase shadow --save-reportthin-supervisor-dev eval rollout-history --candidate-id <candidate_id> --json - Limited rollout If shadow canary stays clean, run 10-20 real tasks with the candidate under close observation.
For each real canary, log friction explicitly when needed:
thin-supervisor-dev learn friction add \
--kind repeated_confirmation \
--message "user had to approve twice" \
--run-id <run_id> \
--signal user_repeated_approval
Then summarize what actually accumulated for a run:
thin-supervisor-dev learn friction summarize --run-id <run_id> --json
Bridge subcommands
thin-supervisor bridge read <pane> [lines] # Capture pane output
thin-supervisor bridge type <pane> <text> # Send text (no Enter)
thin-supervisor bridge keys <pane> <key>... # Send special keys
thin-supervisor bridge list # Show all panes
thin-supervisor bridge id # Current pane ID
thin-supervisor bridge doctor # Check tmux connectivity
Configuration
.supervisor/config.yaml:
surface_type: "tmux" # tmux | open_relay | jsonl
surface_target: "agent" # pane label / oly session ID / transcript path
poll_interval_sec: 2.0 # seconds between reads
read_lines: 100 # lines captured per read
# LLM judge (null = offline stub mode, rules-only)
judge_model: null # e.g., anthropic/claude-haiku-4-5-20251001
judge_temperature: 0.1
judge_max_tokens: 512
jsonl is observation-only: the supervisor can watch checkpoints from a transcript file, but instruction delivery still depends on the agent skill / hook path.
Override with environment variables: SUPERVISOR_SURFACE_TYPE, SUPERVISOR_SURFACE_TARGET, SUPERVISOR_PANE_TARGET, SUPERVISOR_JUDGE_MODEL, etc.
Design Philosophy
Inspired by Anthropic's Scaling Managed Agents:
-
The system's memory lives in SessionRun, not in the model's context. Crashes don't lose history. Everything is in
session_log.jsonl. -
The execution surface is just a "hand", not the system. Today that includes tmux, open-relay, and transcript-backed JSONL observation. Tomorrow it could be a PTY wrapper or a remote session. The
SessionAdapterprotocol keeps the supervisor decoupled. -
Harnesses change, primitives don't. The current sidecar loop is one harness. The 6 first-class objects (WorkflowSpec, SessionRun, ExecutionSurface, CheckpointEvent, SupervisorDecision, HandoffInstruction) are the stable interface.
-
Verification is deterministic, not verbal. "Done" means the verifier passed, not that the agent said so.
-
Skill evolution happens from structured hindsight, not ad-hoc prompt edits.
friction_events anduser_preference_memorygive the system a durable learning substrate. The intended loop is: capture friction -> summarize/postmortem -> replay/eval candidate policy changes -> update skills/rules only when the offline signal says they are better.
Skill Integration
Install for Claude Code:
cp -r skills/thin-supervisor ~/.claude/skills/
Install for Codex:
cp -r skills/thin-supervisor-codex ~/.codex/skills/thin-supervisor
Invoke with /thin-supervisor to start the default flow:
- clarify ambiguous goals
- generate a draft spec
- wait for approval
- attach and execute only after approval
The skill is now split into two layers:
- frozen contract:
skills/thin-supervisor*/references/contract.md - optimizable strategy fragments under
skills/thin-supervisor*/strategy/
Future policy optimization should target the strategy fragments, not the whole SKILL.md.
Oracle Consultation
If you want an Amp-style "oracle" second opinion without giving up supervisor control, use:
thin-supervisor-dev oracle consult \
--mode review \
--question "Review the retry policy design" \
--file supervisor/loop.py \
--file supervisor/gates/supervision_policy.py
When an external provider key is configured, thin-supervisor calls that provider as a read-only advisor. Without an external key, it falls back to a self-adversarial review scaffold instead of failing hard. Add --run <run_id> to persist the consultation into the shared notes plane for the active supervised run.
Development
git clone https://github.com/fakechris/thin-supervisor
cd thin-supervisor
pip install -e ".[dev]"
pytest -q
For repo-specific setup and examples, start with docs/getting-started.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thin_supervisor-0.2.0.tar.gz.
File metadata
- Download URL: thin_supervisor-0.2.0.tar.gz
- Upload date:
- Size: 249.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18a6f304aa4fdb09946e9d4ef6cf1320bbcaede7c681d35d5c0136c36be8f136
|
|
| MD5 |
8f6054c244d1e92316235331b6a3902f
|
|
| BLAKE2b-256 |
e6d55359fc62d37a39f4245234d540876cf11d394527b8ed4ee202a3b6a1adb6
|
Provenance
The following attestation bundles were made for thin_supervisor-0.2.0.tar.gz:
Publisher:
publish.yml on fakechris/thin-supervisor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
thin_supervisor-0.2.0.tar.gz -
Subject digest:
18a6f304aa4fdb09946e9d4ef6cf1320bbcaede7c681d35d5c0136c36be8f136 - Sigstore transparency entry: 1301752945
- Sigstore integration time:
-
Permalink:
fakechris/thin-supervisor@b38c42ee88d966388a34897b2ce5e3b284a952dc -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/fakechris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b38c42ee88d966388a34897b2ce5e3b284a952dc -
Trigger Event:
push
-
Statement type:
File details
Details for the file thin_supervisor-0.2.0-py3-none-any.whl.
File metadata
- Download URL: thin_supervisor-0.2.0-py3-none-any.whl
- Upload date:
- Size: 126.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abccda6c4215f66553d441582ade34db88784c59c4d348c4d660c9fc94311788
|
|
| MD5 |
bb3bbae8cec56098c9df1daee2f63b98
|
|
| BLAKE2b-256 |
2786fda983af948e9ba5d7147b7bdfc9b6336171037e4de5e2546f1e15461557
|
Provenance
The following attestation bundles were made for thin_supervisor-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on fakechris/thin-supervisor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
thin_supervisor-0.2.0-py3-none-any.whl -
Subject digest:
abccda6c4215f66553d441582ade34db88784c59c4d348c4d660c9fc94311788 - Sigstore transparency entry: 1301753036
- Sigstore integration time:
-
Permalink:
fakechris/thin-supervisor@b38c42ee88d966388a34897b2ce5e3b284a952dc -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/fakechris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b38c42ee88d966388a34897b2ce5e3b284a952dc -
Trigger Event:
push
-
Statement type: