Colleague is a swappable coder-agent harness that turns different models into repo workers behind one shared task contract.
Project description
colleague
Colleague is a swappable coder-agent harness that turns different model backends into repo workers behind one shared task runtime.
One runtime, many minds.
Colleague is the harness around the model. The model is the backend; Colleague supplies the task runtime, controls, shared task contract, and handoff that turn that backend into a usable repo worker. Point it at a repo task and it runs the work through whichever coder backend you select — and the caller never has to care which one ran.
Architecture
| Part | In Colleague |
|---|---|
| Mind / backend | the model/coder backend (a local vLLM model, an OpenAI-compatible endpoint, …) |
| Adapter | the code that invokes and controls one backend (colleague/engines/) |
| Task runtime | the shared task contract + lifecycle (Task → TaskResult) |
| Tool loop | the bounded agentic loop the backend drives the repo through |
| Plugins | replaceable backend adapters, discovered via Python entry points |
| Run report | the JSON result artifact + step trace each run writes |
| Telemetry | opt-in OpenTelemetry traces + metrics (colleague/telemetry/) |
| Handoff | branch/commit/push + gh pr create, gated for offline/CI (colleague/handoff.py) |
| Doctor | colleague doctor — read-only configuration-readiness health check (colleague/oilcheck/) |
| Registry | colleague wheels list — the backend plugins installed in this environment |
| Approval gate | colleague/policy.py — operator-declared .colleague/approvals.json that controls what the harness executes |
What ships in v0
- A shared task contract — a typed
TaskandTaskResultthat every backend consumes and produces identically. - A bounded agentic tool-loop — the backend calls
read_file,write_file,list_dir,run_command,culture(AgentCulture CLIs), andfinish, confined to the target repo, until it finishes or hits the step budget. - Two backends, both registered through the same
colleague.enginesentry-point group an out-of-tree plugin would use:mock— deterministic and networkless; the CI workhorse.vllm-openai— drives any OpenAI-compatible/v1/chat/completionsendpoint with tool calling. The built-in default model issakamakismile/Qwen3.6-27B-Text-NVFP4-MTP(an NVFP4 Qwen3 checkpoint on a vLLM server — whatdoctorchecks for); any tool-calling model works.
- Git/PR handoff — branch → commit → push →
gh pr create, gated so--no-pr(or no remote) stays a local commit and CI never pushes. - A result artifact (
.colleague/<task-id>.json) for handoff back to Guildmaster / Taskmaster / Steward. - Command templates — reusable, parameterized task recipes stored under
.colleague/commands/*.md, invoked withdrive --command <name> [args…]or selected in the interactive palette. - Lifecycle hooks — operator-authored shell commands that fire at
task_start,pre_tool,post_tool, andfinishevents; apre_toolhook can allow, deny, or rewrite tool calls before the backend executes them. A per-model hooks overlay (.colleague/<model>/hooks.json) layers model-specific fixes ahead of the base hooks, giving the operator a precision tool for recurring model biases — applied only for the targeted model, a strict no-op for all others; no new runtime dep, socket, or daemon. - Interactive palette —
colleague sessionopens a foreground command browser so operators can select templates and run ad-hoc instructions without leaving the shell. - Layered per-model config — AGENTS instructions
(
AGENTS.md→AGENTS.colleague.md→AGENTS.colleague.<model>.md) and skills (.colleague/skills/*.md→.colleague/<model>/skills/*.md) compose into a model-specific system prompt, with strict per-model isolation; inspect them withcolleague agents list/colleague skills list. - Telemetry: OpenTelemetry observability — opt-in traces + metrics over OTLP,
emitted identically by every backend; off by default and a strict no-op, with
the SDK as an optional
[otel]extra so the base install stays dep-free. doctor— a read-only configuration-readiness health check across identity, provider, engines, otel-readiness, and environment; emits a rubric-shaped report and exits non-zero when unhealthy.- Mesh-member integration — a drive resolves a process-level identity (the
repo's
culture.yamlnick or.colleague/identity.json) and propagates it to subcommands viaCOLLEAGUE_IDENTITY. The loop exposes one curatedculturetool (allow-list:agtag,devex) that shells out to the operator-installed CLIs with the identity injected. Operators opt into read-only ephemeral neighbour clones via.colleague/neighbours.json(defaults to empty; no new runtime dependency). - Destination — colleague's sibling to telemetry. When a task warrants it, a
backend can set a curated
devagueloop tool to open and converge a goal-frame before driving the repo, and declare the announcement on arrival. The destination (frame slug + announcement) is recorded in the JSON artifact; the curated allow-list excludesconfirm/reject(user-only) andexport(operator-only), and convergence is advisory — only human-confirmed claims are authoritative. Setting a destination is optional and backend-judged. - Approval gate — an operator-declared
.colleague/approvals.jsonthat gates what the harness executes. Approval is tamper-protection, not just a name list:approverecords a file's content checksum; if the file changes later the approval is void. Three categories are gated, each opt-in by presence of its section:run_commandCLIs by program token (allow/deny lists), lifecycle hook scripts by checksum, and command templates by checksum. Skills and AGENTS instructions load freely — they are never gated. See the Approval gate section below for the full config shape and usage. - Startup banner —
colleague driveandcolleague sessiongreet an interactive terminal with an ASCII banner. It's decorative chrome: written to stderr, shown only on a TTY, and suppressed under--json, so it never pollutes the stdout result stream or agent-parsed output.
Not in v0 (by design): a multi-backend router / routing policy, an execution
sandbox, a daemon mode, Codex/Claude/Gemini adapters, a --no-hooks escape hatch
(there is no such flag; the approval gate is the landed hook-trust increment —
a policy gate, not a sandbox), and a live MCP runtime (no mcp.json, no mcp
verb; the curated culture tool shells out to operator CLIs — no socket, no MCP
transport). The runtime package has no third-party dependencies — the vLLM
adapter speaks the OpenAI wire format over the standard library.
Feature docs
Each shipped feature has a focused page under docs/features/
— start at the feature index. For the per-version
list of what shipped (and when), see CHANGELOG.md.
| Feature | Doc |
|---|---|
| Drive & the tool-loop | drive-and-loop.md |
| Context budget / graceful degradation | graceful-degradation.md |
| Backends & plugins | engines.md |
| Model & endpoint selection | model-selection.md |
| Git/PR handoff | handoff.md |
| Result artifact | artifact.md |
| Command templates | command-templates.md |
| Lifecycle hooks | hooks.md |
| Interactive palette | session.md |
| Cockpit views (tui / TAUI) | tui.md |
| Layered per-model config | layered-config.md |
| Telemetry: OpenTelemetry | telemetry.md |
| Drive stats & feedback (ROI) | stats-and-feedback.md |
doctor (health check) |
doctor.md |
| Agent-first CLI | agent-cli.md |
| Mesh-member integration | mesh-member.md |
| Destination | destination.md |
| Subagents | subagents.md |
| Parallel subagents (convoy) | parallel-subagents.md |
| Audit fan-out | audit-fanout.md |
| Per-model configuration | per-model-configuration.md |
| Approval gate | See Approval gate section below |
| Outsource (a different mind) | outsource.md |
| Escalation (agtag continuation) | escalation.md |
The detailed sections below remain the canonical reference; the feature pages add per-feature source pointers and cross-links.
Before → after: the extensibility layer
Before this layer, colleague drive accepted one raw instruction string
and ran the tool-loop with no operator gate and no saved recipes: run_command
and write_file executed unconditionally, and every task had to be typed from
scratch.
After, operators drop files into .colleague/ and gain three things that
work identically across every backend (the all-engines rule):
- Command templates — author a recipe once, invoke it by name with
positional arguments;
drive --command <name> [args…]expands it into the sameTaskshape a rawdrive "…"produces. - Lifecycle hooks —
pre_toolhooks can allow, deny (reason fed back to the model), or rewrite tool arguments before they execute;post_toolhooks run formatters or linters after;task_startandfinishhooks bracket the whole drive. Every firing is recorded in the result artifact. - Interactive palette —
colleague sessionlists discovered templates, accepts a selection (by number or name) plus optional arguments, and runs the chosen task through the same drive path, loop, hooks, and artifact — no parallel code path.
This extensibility lives in the runtime (colleague/loop.py), not in any one
backend, so it binds equally to mock, vllm-openai, and any future plugin.
Quickstart
uv sync
uv run pytest -n auto # full suite, no network needed
# Open the interactive harness (the session palette) at a terminal:
uv run colleague
# Discover the backends installed in this environment:
uv run colleague wheels list
# Drive toward a goal with the deterministic mock backend (no model, no network):
uv run colleague drive "add a CONTRIBUTING.md stub" --repo . --engine mock --no-pr
Driving a real model (vLLM)
Start an OpenAI-compatible vLLM server with tool calling enabled:
vllm serve Qwen/Qwen3-32B \
--port 8001 \
--enable-auto-tool-choice \
--tool-call-parser hermes
The right --tool-call-parser depends on the model and the vLLM build:
hermes works for many models (including Qwen/Qwen3-32B above), while other
builds need a different one — e.g. an NVFP4 Qwen3 checkpoint served via vLLM may
want qwen3_coder. The backend itself is parser-agnostic — any parser that makes
the server emit OpenAI-format tool calls works.
Tip (anecdotal). With an NVFP4 Qwen3 checkpoint,
qwen3_coderhandled tool-argument escaping more reliably thanhermesin our testing: ahermesrun over-escaped the triple-quotes in a generated docstring (writing\"\"\"instead of"""), producing aSyntaxError, whereqwen3_coderwrote the same file cleanly. This is a single observation, not a benchmark — but if a parser garbles quote-heavy edits, trying the other one is worth a shot.
Then point Colleague at it (defaults already target localhost:8001):
uv run colleague drive "fix the typo in the README title" \
--repo /path/to/target/repo \
--engine vllm-openai \
--base-url http://localhost:8001/v1 \
--model Qwen/Qwen3-32B
Configuration resolves in the order: explicit flag → COLLEAGUE_* env →
OPENAI_* env → default. The built-in default --model is
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP; the example above overrides it with
--model to match the server you started. Because the adapter only touches the
OpenAI surface, pointing --base-url at any compatible server (llama.cpp, an
OpenAI proxy) needs no code change.
The opt-in live end-to-end test proves this against a real server:
COLLEAGUE_VLLM_E2E=1 uv run pytest tests/test_vllm_live.py -v
Command templates
Operators save reusable task recipes as Markdown files under
.colleague/commands/<name>.md (repo-level or ~/.colleague/commands/ for
user-level; repo-level shadows user-level by stem).
.colleague/commands/ is committable — it is the one part of the otherwise
gitignored .colleague/ dir that git tracks, so a team can share recipes in-repo
(run artifacts, hooks.json, and approvals.json stay local). The committed
doc-review recipe is a worked example.
Template file format
A template may open with an optional --- metadata block:
---
description: Fix lint errors under a path
engine: mock
constraints: keep diffs minimal, run the formatter
arg-hint: <path>
---
Fix all lint errors under $1. Then run the formatter. $ARGUMENTS
Supported metadata keys:
| Key | Meaning |
|---|---|
description |
One-line description shown in listings |
engine |
Engine to use when running this command (overridden by --engine) |
constraints |
Comma-separated constraints added to the Task |
arg-hint |
Short argument hint shown in commands list |
If no --- block is present, the entire file content is the body.
Argument substitution
| Placeholder | Expands to |
|---|---|
$ARGUMENTS |
All arguments joined by a space |
$1, $2, … |
The N-th positional argument (empty string if not supplied) |
Running a command template
# One-shot via drive:
uv run colleague drive --command fix-lint src/ --repo /path/to/repo --engine mock --no-pr
# List all discovered templates:
uv run colleague commands list --repo .
# Surface overview:
uv run colleague commands overview
The --command flag and a positional instruction are mutually exclusive; any
tokens after --command <name> are passed as template arguments ($1, $2,
$ARGUMENTS).
Lifecycle hooks
Hooks are operator-authored shell commands registered in
.colleague/hooks.json (repo-level or ~/.colleague/hooks.json for
user-level; repo-level wins).
Config format
{
"hooks": {
"pre_tool": [{ "matcher": "run_command", "command": "my-policy-gate.sh" }],
"post_tool": [{ "matcher": "write_file", "command": "black $file 2>/dev/null; true" }],
"task_start":[{ "command": "echo task starting" }],
"finish": [{ "command": "echo done" }]
}
}
Each entry has:
| Field | Meaning |
|---|---|
matcher |
Regex (re.fullmatch) tested against the tool name. Absent or empty matches every tool. Ignored for task_start / finish events. |
command |
Shell command run in the target repo directory. |
Lifecycle events
| Event | When it fires | Pre/post effect |
|---|---|---|
task_start |
Before the first tool call | Observe only |
pre_tool |
Before each tool call | Can allow, deny, or rewrite |
post_tool |
After each tool call | Observe only (side-effects OK) |
finish |
After the loop ends | Observe only |
Hook I/O contract
The hook receives a JSON payload on stdin:
{
"event": "pre_tool",
"tool": "run_command",
"arguments": { "command": "pytest" },
"task_id": "<uuid>",
"repo_path": "/path/to/repo"
}
The hook signals its decision via exit code and optional structured stdout:
| Exit code | Stdout | Decision |
|---|---|---|
| non-zero | any | deny — stderr (fallback: stdout) is fed back to the model as the tool result |
| 0 | empty or non-JSON | allow — tool runs as-is |
| 0 | {"decision":"allow", ...} |
allow |
| 0 | {"decision":"deny", "reason":"..."} |
deny — reason fed back to model |
| 0 | {"decision":"rewrite","arguments":{...}} |
rewrite — tool runs with the supplied replacement arguments |
Any response may carry an "additionalContext" string. Every firing (event,
matched command, decision, exit code) is recorded in TaskResult.hook_firings
and appears in the result artifact JSON.
post_tool, task_start, and finish hooks are observe-only: a deny from
these events is recorded but does not halt the loop.
Inspecting hooks
uv run colleague hooks list --repo .
uv run colleague hooks overview
Interactive cockpit (session)
colleague session opens a foreground interactive cockpit (#74 A2): it
renders one CockpitState — a command palette + a running conversation + popups —
and runs each selection through the same drive path (same Task, loop, hooks,
and artifact — no parallel code path):
uv run colleague session --repo /path/to/repo --engine vllm-openai
Input is line-based. At the prompt, plain text runs a drive — a number
(palette entry), a template name, or a free-text instruction (ad-hoc
task). A line starting with / is a slash command — the meta/system
namespace, akin to Claude Code / Codex:
- Introspection (surface an existing noun in the cockpit):
/help,/commands,/skills,/agents,/config(thedoctorreadiness view),/engines,/telemetry,/feedback. - Live config:
/engine <name>,/model <name>,/base <branch>,/pr(toggle push + PR) — change the session without restarting it. /quit(orq/ empty line) ends the session.
Three render tiers of the one state, chosen automatically:
- Interactive (a colour TTY) — the dynamic ANSI cockpit: redraw-in-place, and
popups on real events (an
errorpopup when a drive step fails). - Non-interactive (piped / captured) — Markdown menus (the static but
full agent-readable view), the default off a TTY.
--no-tuiforces it on a TTY. --json— stdout carries only the driveTaskResult(one JSON object each, preserving the machine contract); the cockpit renders to stderr as chrome.
Running colleague with no arguments at a terminal opens this same cockpit
(backend resolved like drive: --engine > COLLEAGUE_ENGINE > vllm-openai,
never a silent mock). By default it is a "talk + iterate" loop — each drive
commits locally but does not push or open a PR; /pr or --pr opts in.
Cockpit views (tui)
colleague tui exposes a headless, stdlib-only cockpit in three views of one
CockpitState: a JSON/TAUI mirror (the agent-readable, selector-addressed
source of truth — tui state), an ANSI frame (the visual render — tui render,
the default), and a Markdown view (the agent-facing readable render —
tui render --format markdown). TAUI (Textual Agentic UI) lets an agent read and
operate the UI without screen-scraping, an LLM, or any colleague import.
The cockpit is a pure reducer (event → reduce(state, event) → CockpitState) and
its state carries a popup model (skill_suggestion / confirmation / error /
progress / diff / help). tui snapshot captures a moment; tui diagnose
classifies cross-view disagreements (no LLM, no network); tui live opens the
foreground TTY cockpit.
uv run colleague tui state # the TAUI JSON mirror
uv run colleague tui render --state <file> # the ANSI frame
uv run colleague tui render --format markdown --state <file>
uv run colleague tui overview
Watching a live drive. A real drive feeds the cockpit (#74):
uv run colleague drive "<task>" --engine mock # auto: live cockpit on a TTY
uv run colleague drive "<task>" --engine mock --no-tui # force the plain step lines
uv run colleague drive "<task>" --engine mock --tui-events run.jsonl # live event stream
uv run colleague tui replay --trace .colleague/<id>.trace.jsonl # replay a finished drive
- Live cockpit (A1) — on an interactive terminal a drive renders the cockpit
as it runs (conversation per step, popups on real events — e.g. an
errorpopup when a tool step fails). Auto-on a TTY;--tui/--no-tuiforce it. Off a TTY (pipes/agents/CI) it falls back to the plainstep N: <tool> [ok|err]stderr lines, byte-for-byte unchanged. - Live event stream (A3) —
--tui-events <path>appends oneDriveStepJSONL line per step as the drive runs, so an agent can follow it turn-by-turn ortui replayit. (A stream written into the driven repo is treated as harness telemetry — never swept into the drive branch.) - Replay a real drive (A4) —
tui replay --trace <id>.trace.jsonlfolds a finished drive's loop-step trace into the cockpit (live and replayed steps read identically — one shared converter). - Interactive cockpit (A2) —
colleague sessionis now cockpit-rendered with slash commands; see Interactive cockpit (session).
A mid-loop failure still writes a partial artifact (status=error) with the
steps, usage, and changed files accumulated so far.
See docs/features/tui.md for the full surface.
Telemetry: OpenTelemetry observability
A drive can emit OpenTelemetry traces + metrics so it's observable against an OTLP collector — not just the per-run JSON artifact. Telemetry lives in the runtime (the loop + the shared drive path), so every backend emits it identically, exactly like lifecycle hooks.
It is off by default and a strict no-op when off (no spans, no SDK import, the result artifact unchanged). The OpenTelemetry SDK is an optional extra — the base install keeps zero runtime dependencies:
pip install 'colleague[otel]' # or: uv sync --extra otel
export COLLEAGUE_OTEL_ENABLED=1
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 # OTLP/HTTP collector
uv run colleague drive "<task>" --repo . --engine mock --no-pr
# -> stderr prints "trace: <id>"; the collector receives the spans + metrics
Requested without the extra installed, colleague degrades to a no-op with a one-line stderr notice — it never fails the drive.
Signals. Spans: colleague.drive (root) → colleague.tool.* (per tool
call) → colleague.handoff. Metrics: colleague.steps, colleague.tokens,
colleague.tool.latency, colleague.tool.calls, colleague.hook.denials,
colleague.drive.duration.
Config (precedence: explicit > COLLEAGUE_OTEL_* > standard OTEL_* >
default): COLLEAGUE_OTEL_ENABLED, COLLEAGUE_OTEL_ENDPOINT /
OTEL_EXPORTER_OTLP_ENDPOINT, COLLEAGUE_OTEL_SERVICE_NAME /
OTEL_SERVICE_NAME. OTEL_SDK_DISABLED=true is honored as a kill-switch.
uv run colleague telemetry status # resolved config + whether the SDK is installed
uv run colleague telemetry overview # describe the surface
Drive stats & the feedback loop (ROI)
Together these let a caller compute the ROI of outsourcing a task: the stats say what a drive cost, a feedback record says how good it was.
Always-on stats. Every TaskResult carries a stats block
(DriveStats), written into the artifact on every drive — no flag, no opt-in.
It records the request, ISO start + wall-clock duration, model turns, step count,
per-tool counts, files changed, exact UTF-8 bytes_written, and
reasoning-vs-answer char/byte sizes. Exact token counts stay on usage, verbatim
from the model response (never estimated). Like hooks and telemetry, stats are
runtime-owned — identical for mock and vllm-openai.
Honest token limit. Colleague has no tokenizer (zero deps), and the served model reports no reasoning-token breakdown — so "thought vs written" is measured as chars/bytes, not tokens.
Feedback. A single record per drive (re-grading overwrites) lives beside the
artifact at .colleague/<task_id>.feedback.json; a per-repo last_drive pointer
lets you grade the most recent drive without quoting its id. An ungraded drive
reads back as a clean "no feedback yet" state, never an error.
uv run colleague feedback record last --rating 4 --notes "correct but verbose"
uv run colleague feedback show last --repo .
uv run colleague feedback overview
The agent-facing entry is the outsource feedback skill verb. See
docs/features/stats-and-feedback.md.
Configuration readiness: doctor
Before you hand colleague work, colleague doctor answers "is this install
actually ready to drive?" It is colleague's read-only, diagnose-only
health check (no --fix, zero new runtime deps) that emits a
rubric-shaped {healthy, checks[]} report across five ordered check-groups:
| Group | Checks (severity) |
|---|---|
| identity | prompt_file_present / backend_consistency (error), skills_present (warning) |
| provider | resolved base_url/model with redacted api_key (info); credentials + budget advisories (warning) on a non-default provider |
| engines | backends discovered + both bundled backends present + each plugin loads (error; all-engines rule) |
| otel | telemetry enabled / SDK importable / endpoint configured (info; error only when enabled but the [otel] extra is missing) |
| environment | .colleague/ config, hooks.json validity, command-template parsing, AGENTS/skills layering, git (error) + gh (warning) on PATH, CLI integrity |
Only a failed error check flips the report unhealthy; warnings and info are
advisory. doctor exits 1 when unhealthy, else 0. The diagnostic logic lives
in the runtime-level colleague/oilcheck/ package (like telemetry); the verb
is a thin renderer. Add a check-group by appending a read-only checks() callable
to CHECK_GROUPS — see colleague explain doctor and
docs/features/doctor.md.
uv run colleague doctor # human-readable rubric; exit 1 if unhealthy
uv run colleague doctor --json # structured {healthy, checks[]}
uv run colleague doctor --probe # + a live provider ping (the one networked check)
--probe adds two opt-in checks that open a network connection — provider_reachable
(can colleague reach the endpoint?) and provider_model_available (is the
configured model actually served at that endpoint?) — gated behind the flag so the
default doctor stays network-free.
Per-model instructions & skills
Colleague composes a model-specific system prompt for every drive from two layered families, resolved relative to the model currently driving. Strict per-model isolation: driving model X reads only X's overlay plus the shared base — it never even opens model Y's files (isolation is structural, built from exact paths, not filtered).
AGENTS instructions cascade from the repo root (the cross-tool standard
location — sibling agent tools read AGENTS.md there too), general → specific,
with a ~/.colleague/ user-level fallback:
AGENTS.md # shared base
AGENTS.colleague.md # colleague overlay
AGENTS.colleague.<model>.md # model overlay
Skills are markdown capability docs under .colleague/, folded into the
prompt as a compact name + one-line-summary catalog (a skill is instructional
text only — there is no skill execution in v0):
.colleague/skills/*.md # base
.colleague/<model>/skills/*.md # model overlay (shadows base by stem)
<model> is sanitized to a filename-safe token (e.g. Qwen/Qwen3-32B →
Qwen-Qwen3-32B). Inspect what resolves for a model:
uv run colleague agents list --model Qwen/Qwen3-32B --repo .
uv run colleague skills list --model Qwen/Qwen3-32B --repo .
MCP layering is not built yet. Colleague does not read
mcp.jsonor connect to any MCP server today; a live MCP client needs its own spec. There is nomcpverb — don't rely on a non-existent surface.
Subagents
Mid-drive, a backend may delegate scoped sub-tasks via two loop tools:
subagent (a single child) or subagents (a batch that runs concurrently). Each
child runs the same bounded tool-loop as a nested in-process call, isolated in
its own throwaway git worktree on a sub/<id> branch; its result is returned to
the parent and folded into TaskResult.sub_results (omitted when empty). A
sequential merge-subagent integrates the branches afterward, surfacing (never
force-merging) unresolvable conflicts. An optional engine/model parameter lets
a child run on a different backend or model, resolved through the existing
registry.load + EngineConfig inheritance (a config-level switch, no backend
code change).
Concurrency is opt-in: COLLEAGUE_SUBAGENT_CONCURRENCY (default 1 =
byte-identical sequential behavior); with width > 1, up to
MIN(width, MAX_SUBAGENT_FANOUT-1) children run in parallel via
concurrent.futures (threads confined to colleague/subagents.py), reserving one
slot for the merge child. Delegation is backend-judged and optional (like the
devague destination tool), never a forced gate. Termination is structural:
MAX_SUBAGENT_DEPTH=2 (checked before any child work) and MAX_SUBAGENT_FANOUT=4
(per-drive, including the merge child). Only the top-level drive hands off —
sub-drives never branch, commit, or open a PR. This is runtime-owned (the tools
fire identically for every backend) and is explicitly not the out-of-scope
multi-backend router / routing policy: there is no automatic task→backend routing.
Honest limit: real wall-clock speedup requires the served model to handle concurrent requests; on a serializing server, gain is bounded by overlapped I/O wait, not model compute.
uv run colleague explain subagent # the loop tool's contract (not a CLI verb)
Outsource (a different mind)
outsource is colleague's one first-party Claude Code skill — the inverse of
the vendored skills. It lets another agent hand a scoped task to colleague: a
different backend/model (e.g. a local vLLM Qwen), not a stronger one — diversity
is the point. Four verbs over colleague drive:
| Verb | What it does |
|---|---|
outsource explore |
Read-only investigation of an area (worktree-isolated). |
outsource review |
A diverse second opinion on the committed <base>...HEAD diff (the headline verb). |
outsource write |
Delegate a small change — previews by default; --apply lands a drive branch, --pr opens a PR. |
outsource feedback |
Grade a finished drive (close the ROI loop). |
explore/review run in a throwaway git worktree (no working-tree side effects);
write previews in one too unless --apply/--pr. See
docs/features/outsource.md.
Approval gate
The approval gate is an operator-declared allow-list that controls what the
harness executes — not just what it discovers. Approval is tamper-protection:
approve records the file's current content checksum; if the file changes after
approval, the checksum no longer matches and the approval is void.
Config shape
{
"run_command": { "allow": ["git", "pytest", "uv"], "deny": [] },
"hooks": { ".colleague/lint.sh": "sha256:<hex>" },
"commands": { "fix-lint": "sha256:<hex>" }
}
Place this file at .colleague/approvals.json in the target repo (or
~/.colleague/approvals.json for user-level defaults; repo-level wins). A
per-model overlay at .colleague/<sanitized-model>/approvals.json is composed
ahead — per-model keys replace base keys for the same section; no sibling model
is ever read.
What is gated (and what is not)
| Category | Gated by | Absent section |
|---|---|---|
run_command |
Program token (shlex first token): allow/deny lists |
No-op (all commands allowed) |
hooks |
Content checksum of the referenced hook script file | No-op (all hooks run) |
commands |
Content checksum of the template .md file (checked at expansion) |
No-op (all templates expand) |
| Skills / AGENTS | Never gated — declarative, load freely | — |
A section is gated only when it is present in approvals.json. An absent
section is a strict no-op: byte-identical to behavior before the gate existed.
When a section is present, allow-list semantics apply: anything unlisted,
unapproved, or tampered is denied.
Approving files
# Approve a command template by checksum (default: sha256):
uv run colleague commands approve fix-lint --repo .
# Approve a hook script by repo-relative path:
uv run colleague hooks approve .colleague/lint.sh --repo .
# Use md5 instead (drift detection; not recommended for integrity):
uv run colleague commands approve fix-lint --repo . --algo md5
# Both commands support --json for machine-readable output.
Inspecting approval status
# commands list shows: approved | drifted | unapproved | ungated
uv run colleague commands list --repo .
# hooks list shows approval status per entry + the run_command policy if present
uv run colleague hooks list --repo .
# skills list always shows: accessible (never gated)
uv run colleague skills list --repo .
Status values:
| Status | Meaning |
|---|---|
approved |
Entry present, checksum matches current file content |
drifted |
Entry present, but file changed since approval — approval void |
unapproved |
Section present but no entry for this name |
ungated |
Section absent from approvals.json — gate not active |
accessible |
Skills / AGENTS — never gated, always accessible |
Honest limits
This is a policy gate, not a sandbox.
- The
run_commandcheck inspects the first shell token only. It is trivially bypassable bysh -c '...', shell pipelines, command substitution, shell expansion, or an absolute path to a renamed binary. The gate encodes operator intent; it does not contain a hostile process. An airtight execution sandbox is explicitly out of v0 scope. md5detects accidental drift (file edited by mistake), not a deliberate attacker who can recompute a hash. Usesha256when integrity matters.- Checksum-only in v0. There is no
version-based pinning. Approvals are recorded and verified by content hash only. Version pinning is a documented follow-up that is not yet built — do not rely on it. - This is the landed increment of the tracked "per-repo hook trust gate" from
the security section below. There is still no
--no-hooksflag — that remains a future follow-up.
⚠ Security: repo-shipped hooks run by default
This is a code-execution risk. Read before driving an untrusted repo.
When you run colleague drive (or colleague session) against a repo that
contains a .colleague/hooks.json, those hooks execute automatically with
your operating-system privileges. There is no confirmation prompt and no
sandboxing. Cloning a malicious repository and pointing Colleague at it will
run whatever shell commands that repository's hooks.json specifies.
This behavior is intentional under Colleague's trusted-operator-env model
(D2): the same design tradeoff Claude Code and Codex make for their .claude/
and .codex/ hook configs. You are expected to trust (or audit) the repos you
drive.
What is implemented: the approval gate lets you gate hook
scripts by checksum — an unapproved or tampered hook script is skipped (not a
hard deny of the tool call; it fires a skipped firing in the artifact). The
run_command allow/deny list gates which CLI programs the loop may invoke.
What is NOT yet implemented: a --no-hooks escape hatch or any other
mechanism to disable repo-shipped hooks without editing .colleague/hooks.json
yourself. The approval gate is a policy gate, not a sandbox — see its honest
limits above. A further hardening increment is tracked but has not shipped
in the current version. Do not rely on a non-existent flag.
Safe practices until the trust gate ships:
- Only drive repos you own or have audited.
- Review
.colleague/hooks.jsonbefore runningdrivein an unfamiliar repo. - Use user-level (
~/.colleague/hooks.json) hooks as an allow-list approach if you want hooks without trusting any repo's config.
CLI
| Verb | What it does |
|---|---|
drive <goal> |
Drive toward a goal/instruction: work autonomously through a coder backend; write the artifact; hand off. |
drive --command <name> [args…] |
Expand a saved command template and drive it. |
commands list |
List discovered command templates for a repo (shows approval status). |
commands approve <name> |
Record a checksum approval for a command template. |
commands overview |
Describe the commands surface. |
hooks list |
List configured hook entries for a repo (shows approval status + run_command policy). |
hooks approve <script> |
Record a checksum approval for a hook script file (repo-relative path). |
hooks overview |
Describe the hooks surface. |
agents list |
List resolved AGENTS instruction layers for a model. |
agents overview |
Describe the agents surface. |
skills list |
List resolved skill docs for a model. |
skills overview |
Describe the skills surface. |
telemetry status |
Show the resolved telemetry / OpenTelemetry config + whether the SDK is installed. |
telemetry overview |
Describe the telemetry surface. |
session |
Open a foreground interactive palette. |
wheels list |
List discovered backend plugins (the registry). |
whoami |
Report nick, version, mesh backend, and the live drive engine + model (the delegate an outsource would actually run). |
learn |
Print a structured self-teaching prompt. |
explain <path> |
Markdown docs for any noun/verb path. |
overview |
Read-only descriptive snapshot of the agent. |
doctor |
Configuration-readiness health check: identity, provider, engines, otel, environment. |
cli overview |
Describe the CLI surface itself. |
Every command supports --json. Results go to stdout, errors/diagnostics to
stderr (never mixed). Exit codes: 0 success, 1 user error, 2 environment
error, 3+ reserved.
Writing your own backend plugin
A backend is a class implementing colleague.engine.Engine (one method:
drive(task, config) -> TaskResult). Advertise it under the entry-point group
and colleague wheels list discovers it — no change to Colleague core:
[project.entry-points."colleague.engines"]
my-engine = "my_package.engine:MyEngine"
Most backends never re-implement the loop — they delegate to
colleague.loop.run and only supply how the model is called. Because the
loop owns hook firing, a custom backend inherits the full lifecycle extensibility
layer for free.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file colleague-0.33.2.tar.gz.
File metadata
- Download URL: colleague-0.33.2.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a768259abeba83746f2163a9407ed45f241d192dbfcdd1083828b082d68ffd2
|
|
| MD5 |
55954d764bb3b50d448565e7f80c54d6
|
|
| BLAKE2b-256 |
08fb14598208eb441b8980c151bb2756d55a1f7bcc22ae96a9cb41d5693d0705
|
File details
Details for the file colleague-0.33.2-py3-none-any.whl.
File metadata
- Download URL: colleague-0.33.2-py3-none-any.whl
- Upload date:
- Size: 267.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a94ea99557d1b7efc946e1b92e3d350174c6c83bb27a73eb6f401fa6bd8d83e
|
|
| MD5 |
1cc13d27790d9f79af9b14f80a6c5c1d
|
|
| BLAKE2b-256 |
9bc8aac8b4f208ae3d581c3cd4125c9987ec4d2215524ece2b70067ffd09238d
|