Skip to main content

Thin tmux sidecar supervisor for long-running AI coding agent workflows

Project description

thin-supervisor

Long-running AI coding tasks fail silently. The agent asks "should I continue?", you're not watching, and the task stalls. Or worse — the agent says "done" but didn't actually pass the tests.

thin-supervisor fixes this. It's an acceptance-centered run supervisor that sits alongside your existing coding agent (Claude Code, Codex, or any CLI agent), watches what the agent does, and makes structured decisions: continue, verify, retry, branch, escalate, or finish. "Done" means the verifier passed and the acceptance contract is satisfied — not that the agent said so. You stay in your familiar agent UI. The supervisor handles the rest.

Architecture deep-dive: See docs/ARCHITECTURE.md for the six-layer architecture, first-class objects, and design principles.

Docs hub:

┌────────────────────────────┐  ┌──────────────────────────┐
│  Your Agent (visible pane) │  │  Supervisor (sidecar)    │
│  Claude Code / Codex       │  │  reads pane output       │
│                            │  │  parses checkpoints      │
│  ... working ...           │  │  gates decisions         │
│                            │  │  runs verifiers          │
│  <checkpoint>              │──│  injects next step       │
│  status: step_done         │  │                          │
│  </checkpoint>             │  │  state: RUNNING → VERIFY │
└────────────────────────────┘  └──────────────────────────┘
                 tmux session

When to use this

Scenario Without supervisor With supervisor
10-step implementation plan Agent asks permission at every step Runs to completion, verifies each step
Test-driven workflow Agent says "done" without running tests Verifier runs tests, rejects if failing
Agent asks "should I continue?" You miss it, task stalls for hours Supervisor auto-answers, keeps going
Dangerous operation detected Agent proceeds silently Supervisor escalates to you

Core Concepts

Runtime Objects (stable)

Object Question it answers What it is
WorkflowSpec What should be done? YAML task definition with steps, verification criteria, and finish policy
CheckpointEvent What did the agent just report? Structured status with seq tracking, evidence, and needs
SupervisorDecision What does the control plane think? Typed gate decision with confidence, reasoning, and causality link
HandoffInstruction What should the agent do next? Composed instruction with full traceability to the triggering decision
ExecutionSurface How do we talk to the agent? Protocol for read/inject/cwd — tmux, open-relay, and JSONL observation surfaces
SessionRun Who is this run? Identity + durable event history — survives crashes, enables recovery

Emerging Architecture (implemented, maturing)

Object Purpose
AcceptanceContract Defines "what counts as truly done" — required evidence, forbidden states, risk class, reviewer gating
WorkerProfile Explicit worker capabilities — provider, model, trust level. Drives supervision intensity.
SupervisionPolicy Three modes: strict_verifier (default) / collaborative_reviewer / directive_lead. Prevents thin supervisor from micromanaging strong worker.
RoutingDecision Escalation routing — human, stronger reviewer, or alternate executor

These form a causality chain: every instruction traces back to the decision that caused it, which traces back to the checkpoint that triggered it.

CheckpointEvent(seq=3) → SupervisorDecision(triggered_by_seq=3) → HandoffInstruction(triggered_by_decision=X)

Quick Start

Full guide: See docs/getting-started.md for step-by-step instructions covering tmux, open-relay, JSONL observation, and Codex/Claude/OpenCode/Droid workflows.

# Install
pip install thin-supervisor

# Install the Codex / Claude skills automatically when supported
thin-supervisor skill install

# Initialize in your project
cd your-project
thin-supervisor init
# If .supervisor/ exists but is missing config, repair the scaffold in place
thin-supervisor init --repair

# Write a spec (or let the Skill generate one)
cat > .supervisor/specs/my-plan.yaml << 'EOF'
kind: linear_plan
id: my_feature
goal: implement feature X with tests
approval:
  required: true
  status: draft
finish_policy:
  require_all_steps_done: true
  require_verification_pass: true
policy:
  default_continue: true
  max_retries_per_node: 3

steps:
  - id: write_tests
    type: task
    objective: write failing tests for feature X
    verify:
      - type: artifact
        path: tests/test_feature_x.py
        exists: true

  - id: implement
    type: task
    objective: implement feature X until tests pass
    verify:
      - type: command
        run: pytest -q tests/test_feature_x.py
        expect: pass

  - id: final_check
    type: task
    objective: run full test suite
    verify:
      - type: command
        run: pytest -q
        expect: pass
EOF

# Approve the draft spec, then attach
thin-supervisor spec approve --spec .supervisor/specs/my-plan.yaml --by human
scripts/thin-supervisor-attach.sh my-plan

Execution entry points reject draft specs. This is deliberate: the clarify/approve step is part of the contract.

What happens next

  1. Supervisor reads the agent's pane output every 2 seconds
  2. Agent emits a <checkpoint> block after completing work
  3. Supervisor parses the checkpoint and makes a gate decision:
    • Continue — agent is making progress, don't interrupt
    • Verify — agent says step is done, run the verifier
    • Retry — verification failed, inject retry instruction with failure details
    • Branch — decision node in workflow, select a path
    • Escalate — missing credentials, dangerous action, or low confidence — pause for human
    • Finish — all steps done, all verifiers pass, finish policy and review requirements satisfied
  4. If continuing or retrying, supervisor injects the next instruction into the pane
  5. Run-level decisions are logged to session_log.jsonl; project-level bootstrap and repair incidents are logged to .supervisor/runtime/ops_log.jsonl

Historical runs can now be turned into stable artifacts and reports:

thin-supervisor run export <run_id> > run.json
thin-supervisor run summarize <run_id> --json
thin-supervisor run replay <run_id> --json
thin-supervisor run postmortem <run_id>

run replay re-evaluates historical checkpoints with the current gate logic but does not inject or verify against live surfaces. run postmortem writes a markdown report under .supervisor/reports/ by default.

If your spec sets acceptance.must_review_by, the run pauses at the finish gate until someone acknowledges review:

thin-supervisor run review <run_id> --by human
# or
thin-supervisor run review <run_id> --by stronger_reviewer

When a run enters PAUSED_FOR_HUMAN, thin-supervisor now derives two user-facing fields:

  • pause_reason — why the supervisor stopped
  • next_action — the exact recovery command to run next

By default the daemon also emits pause notifications through two built-in channels:

  • tmux_display — a tmux display-message alert on the supervised pane
  • jsonl — durable records in .supervisor/runtime/notifications.jsonl

Pause handling is now also policy-driven:

  • pause_handling_mode: notify_only — notify and remain paused
  • pause_handling_mode: notify_then_ai — notify first, then let the agent attempt an automatic recovery for selected cases such as blocked checkpoints, repeated node mismatch, or retry-budget exhaustion

The default is currently tuned for test periods:

pause_handling_mode: notify_then_ai
max_auto_interventions: 2

The default config now includes:

notification_channels:
  - kind: tmux_display
  - kind: jsonl
pause_handling_mode: notify_then_ai
max_auto_interventions: 2

Future delivery targets such as Feishu or Telegram plug into the same channel interface in supervisor/notifications.py.

Checkpoint Protocol

Agents must emit structured checkpoints for the supervisor to parse:

<checkpoint>
run_id: <run_id from thin-supervisor status>
checkpoint_seq: <incrementing integer, start from 1>
status: working | blocked | step_done | workflow_done
current_node: <step_id>
summary: <one-line description>
evidence:
  - modified: <file path>
  - ran: <command>
  - result: <short result>
candidate_next_actions:
  - <next action>
needs:
  - none
question_for_supervisor:
  - none
</checkpoint>

The Codex/Claude Code Skills teach agents this protocol automatically.

Verification Types

Type Fields Description
command run, expect Run a shell command. expect: pass, fail, contains:<text>
artifact path, exists Check if a file exists
git check, expect Check git state (e.g., check: dirty, expect: false)
workflow require_node_done Check if current node is marked done

All verifiers run in the agent's working directory (pane cwd), not the supervisor's.

CLI

thin-supervisor init [--force|--repair]                   # Create or repair .supervisor/ directory
thin-supervisor deinit [--force]                           # Remove .supervisor/

thin-supervisor daemon start [--config <path>]             # Start background daemon
thin-supervisor daemon stop                                # Stop daemon
thin-supervisor stop                                       # Legacy alias for daemon stop

thin-supervisor run register --spec <spec> --pane <target> [--surface tmux|open_relay|jsonl]
thin-supervisor run foreground --spec <spec> --pane <target> [--surface ...]
thin-supervisor run stop <run_id>
thin-supervisor run resume --spec <spec> --pane <target> [--surface ...]
thin-supervisor run review <run_id> --by human|stronger_reviewer
thin-supervisor run export <run_id> [--output file]
thin-supervisor run summarize <run_id> [--json]
thin-supervisor run replay <run_id> [--json]
thin-supervisor run postmortem <run_id> [--output file]
thin-supervisor spec approve --spec <spec> [--by human]

thin-supervisor status                                     # Active runs in current worktree
thin-supervisor list                                       # Detailed active-run view
thin-supervisor ps                                         # Registered daemons across worktrees
thin-supervisor pane-owner <pane>                          # Show which run owns a pane
thin-supervisor observe <run_id>                           # Read-only observation snapshot
thin-supervisor note add <text> [--type ...] [--run ...]  # Shared notes for coordination
thin-supervisor note list [--type ...] [--run ...]

thin-supervisor session detect                             # Detect current agent session ID
thin-supervisor session jsonl                              # Resolve current transcript path
thin-supervisor session list                               # List recent sessions and cwd

thin-supervisor skill install                              # Install Codex / Claude skills
thin-supervisor bridge <action> [args]                     # tmux bridge operations

thin-supervisor is the runtime CLI. It is the only command family normal task users should need.

thin-supervisor-dev learn friction add --kind <kind> --message "..." [--run-id <run_id>] [--signal <signal>]
thin-supervisor-dev learn friction list [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn friction summarize [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn prefs set --key <key> --value <value>
thin-supervisor-dev learn prefs show [--json]
thin-supervisor-dev eval list
thin-supervisor-dev eval run [--suite approval-core|approval-adversarial|clarify-contract-core|routing-core|escalation-core|finish-gate-core|pause-ux-core] [--json]
thin-supervisor-dev eval replay --run-id <run_id> [--json]
thin-supervisor-dev eval compare --suite approval-core --candidate-policy <policy> [--json]
thin-supervisor-dev eval canary --run-id <run_id> [--run-id <run_id> ...] [--candidate-id <candidate_id>] [--phase shadow|limited] [--json]
thin-supervisor-dev eval expand --suite approval-core --output <path> [--variants-per-case 2]
thin-supervisor-dev eval propose --suite approval-core --objective <goal> [--json]
thin-supervisor-dev eval review-candidate --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval candidate-status --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval gate-candidate --candidate-id <candidate_id> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promote-candidate --candidate-id <candidate_id> --approved-by <name> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promotion-history [--json]
thin-supervisor-dev eval rollout-history [--candidate-id <candidate_id>] [--json]
thin-supervisor-dev oracle consult --question "..." [--file path ...]

thin-supervisor-dev is the devtime/operator CLI. Use it for local tuning, offline evals, candidate rollout, learning signals, and advisory second opinions. Do not expose it to normal runtime users.

Add --save-report to run, replay, compare, canary, propose, review-candidate, gate-candidate, or promote-candidate to persist a JSON report under .supervisor/evals/reports/. When used with eval propose, thin-supervisor-dev also writes a candidate-lineage manifest under .supervisor/evals/candidates/, eval review-candidate turns that manifest back into a bounded human review summary, eval candidate-status assembles the manifest, latest related reports, and promotion-registry state into one lifecycle dossier, eval gate-candidate combines compare plus optional canary signals into a promotion recommendation, and eval promote-candidate records an approved promotion in .supervisor/evals/promotions.jsonl.

If a daemon-managed run pauses, status and list now show the human-readable reason and the suggested next command. For non-active persisted runs, the same hint appears under Local state found:.

thin-supervisor-dev eval is the first offline evaluation surface for the new skill-evolution work. Bundled suites now cover more than approval copy: approval-core checks explicit approval vs re-ask behavior, approval-adversarial covers tricky mixed signals and repeat-approval cases, clarify-contract-core checks whether the system locks the right delivery contract instead of silently narrowing “real UAT” work into a mock/dev baseline, routing-core checks deterministic step_done/workflow_done -> VERIFY_STEP routing, escalation-core checks blocked -> ESCALATE_TO_HUMAN, finish-gate-core checks reviewer and completion contracts, and pause-ux-core checks externally visible pause/completion summaries. thin-supervisor-dev eval replay --run-id ... wraps the existing history replay path into the same evaluation surface so policy candidates can be checked against real historical traces. thin-supervisor-dev eval compare ... adds a blind A/B-style comparator over deterministic suite results so baseline and candidate policies can be compared without hard-coding one output format into the report consumer. thin-supervisor-dev eval canary ... aggregates replay pass-rate, mismatch kinds, and friction over a set of real runs so shadow-canary promotion decisions become a command instead of a checklist; when you pass --candidate-id, the same command also records a rollout attempt under .supervisor/evals/rollouts.jsonl. thin-supervisor-dev eval expand ... generates provenance-tagged synthetic variants from the golden suite so coverage can grow without mutating the original contract set. thin-supervisor-dev eval propose ... is the first constrained candidate-generator surface: it summarizes failure cases, consults the advisory/self-review layer, recommends a policy candidate for a stated objective without automatically changing shipped defaults, and can persist a candidate-lineage manifest for later comparison and promotion review. thin-supervisor-dev eval review-candidate ... loads one of those manifests and emits the bounded human-review summary for the next promotion step. thin-supervisor-dev eval candidate-status ... turns the manifest, related eval reports, promotion-registry state, and recorded rollout attempts into one lifecycle dossier. thin-supervisor-dev eval rollout-history ... exposes the rollout ledger directly. thin-supervisor-dev eval gate-candidate ... then combines that bounded review with deterministic compare output and optional real-run canary signals before a human decides whether to promote. thin-supervisor-dev eval promote-candidate ... records that approval in the promotion registry so candidate history and current promoted policies are queryable later.

Real Canary Loop

Yes, you should run real canaries. A safe sequence is:

  1. Offline gate Run eval run, eval replay, eval compare, and optionally eval propose, all with --save-report.
  2. Shadow canary Pick 3-5 real tasks and keep the baseline behavior in charge. Record each finished run with: thin-supervisor run summarize <run_id> thin-supervisor run postmortem <run_id> thin-supervisor-dev eval replay --run-id <run_id> --save-report thin-supervisor-dev eval canary --run-id <run_id> ... --candidate-id <candidate_id> --phase shadow --save-report thin-supervisor-dev eval rollout-history --candidate-id <candidate_id> --json
  3. Limited rollout If shadow canary stays clean, run 10-20 real tasks with the candidate under close observation.

For each real canary, log friction explicitly when needed:

thin-supervisor-dev learn friction add \
  --kind repeated_confirmation \
  --message "user had to approve twice" \
  --run-id <run_id> \
  --signal user_repeated_approval

Then summarize what actually accumulated for a run:

thin-supervisor-dev learn friction summarize --run-id <run_id> --json

Bridge subcommands

thin-supervisor bridge read <pane> [lines]   # Capture pane output
thin-supervisor bridge type <pane> <text>     # Send text (no Enter)
thin-supervisor bridge keys <pane> <key>...   # Send special keys
thin-supervisor bridge list                   # Show all panes
thin-supervisor bridge id                     # Current pane ID
thin-supervisor bridge doctor                 # Check tmux connectivity

Configuration

.supervisor/config.yaml:

surface_type: "tmux"              # tmux | open_relay | jsonl
surface_target: "agent"           # pane label / oly session ID / transcript path
poll_interval_sec: 2.0            # seconds between reads
read_lines: 100                   # lines captured per read

# LLM judge (null = offline stub mode, rules-only)
judge_model: null                 # e.g., anthropic/claude-haiku-4-5-20251001
judge_temperature: 0.1
judge_max_tokens: 512

jsonl is observation-only: the supervisor can watch checkpoints from a transcript file, but instruction delivery still depends on the agent skill / hook path.

Override with environment variables: SUPERVISOR_SURFACE_TYPE, SUPERVISOR_SURFACE_TARGET, SUPERVISOR_PANE_TARGET, SUPERVISOR_JUDGE_MODEL, etc.

Design Philosophy

Inspired by Anthropic's Scaling Managed Agents:

  1. The system's memory lives in SessionRun, not in the model's context. Crashes don't lose history. Everything is in session_log.jsonl.

  2. The execution surface is just a "hand", not the system. Today that includes tmux, open-relay, and transcript-backed JSONL observation. Tomorrow it could be a PTY wrapper or a remote session. The SessionAdapter protocol keeps the supervisor decoupled.

  3. Harnesses change, primitives don't. The current sidecar loop is one harness. The 6 first-class objects (WorkflowSpec, SessionRun, ExecutionSurface, CheckpointEvent, SupervisorDecision, HandoffInstruction) are the stable interface.

  4. Verification is deterministic, not verbal. "Done" means the verifier passed, not that the agent said so.

  5. Skill evolution happens from structured hindsight, not ad-hoc prompt edits. friction_events and user_preference_memory give the system a durable learning substrate. The intended loop is: capture friction -> summarize/postmortem -> replay/eval candidate policy changes -> update skills/rules only when the offline signal says they are better.

Skill Integration

Install for Claude Code:

cp -r skills/thin-supervisor ~/.claude/skills/

Install for Codex:

cp -r skills/thin-supervisor-codex ~/.codex/skills/thin-supervisor

Invoke with /thin-supervisor to start the default flow:

  • clarify ambiguous goals
  • generate a draft spec
  • wait for approval
  • attach and execute only after approval

The skill is now split into two layers:

  • frozen contract: skills/thin-supervisor*/references/contract.md
  • optimizable strategy fragments under skills/thin-supervisor*/strategy/

Future policy optimization should target the strategy fragments, not the whole SKILL.md.

Oracle Consultation

If you want an Amp-style "oracle" second opinion without giving up supervisor control, use:

thin-supervisor-dev oracle consult \
  --mode review \
  --question "Review the retry policy design" \
  --file supervisor/loop.py \
  --file supervisor/gates/supervision_policy.py

When an external provider key is configured, thin-supervisor calls that provider as a read-only advisor. Without an external key, it falls back to a self-adversarial review scaffold instead of failing hard. Add --run <run_id> to persist the consultation into the shared notes plane for the active supervised run.

Development

git clone https://github.com/fakechris/thin-supervisor
cd thin-supervisor
pip install -e ".[dev]"
pytest -q

For repo-specific setup and examples, start with docs/getting-started.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thin_supervisor-0.2.0.tar.gz (249.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thin_supervisor-0.2.0-py3-none-any.whl (126.8 kB view details)

Uploaded Python 3

File details

Details for the file thin_supervisor-0.2.0.tar.gz.

File metadata

  • Download URL: thin_supervisor-0.2.0.tar.gz
  • Upload date:
  • Size: 249.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for thin_supervisor-0.2.0.tar.gz
Algorithm Hash digest
SHA256 18a6f304aa4fdb09946e9d4ef6cf1320bbcaede7c681d35d5c0136c36be8f136
MD5 8f6054c244d1e92316235331b6a3902f
BLAKE2b-256 e6d55359fc62d37a39f4245234d540876cf11d394527b8ed4ee202a3b6a1adb6

See more details on using hashes here.

Provenance

The following attestation bundles were made for thin_supervisor-0.2.0.tar.gz:

Publisher: publish.yml on fakechris/thin-supervisor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thin_supervisor-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: thin_supervisor-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 126.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for thin_supervisor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 abccda6c4215f66553d441582ade34db88784c59c4d348c4d660c9fc94311788
MD5 bb3bbae8cec56098c9df1daee2f63b98
BLAKE2b-256 2786fda983af948e9ba5d7147b7bdfc9b6336171037e4de5e2546f1e15461557

See more details on using hashes here.

Provenance

The following attestation bundles were made for thin_supervisor-0.2.0-py3-none-any.whl:

Publisher: publish.yml on fakechris/thin-supervisor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page