CLI for unattended AI-driven development with multi-backend support (Codex, Claude, Gemini, OpenCode)
Project description
autodev
autodev is a local CLI for unattended AI-driven development. The intended workflow is: the developer describes what they want, autodev plan turns that intent or document into a managed runtime queue in task.json, autodev task list lets you inspect that queue, and then autodev executes those tasks, checks the result, writes progress logs, and optionally commits successful work to git.
Features
- Supports
claude,codex,gemini, andopencodebackends. - Runs tasks in non-interactive mode.
- Defaults all supported backends to full-auto / YOLO-style execution.
- Bootstraps tool-native project conventions for Claude Code, Codex, Gemini CLI, and OpenCode.
- Can generate COCA specs before task planning.
- Tracks changed files with directory snapshots.
- Separates
verificationevidence collection fromcompletionpass/fail judgment. - Gives every task an observable completion contract before considering work complete.
- Marks failed tasks as blocked and continues.
- Appends structured progress to
progress.txt. - Mechanically audits generated and refined tasks before accepting them.
- Supports bounded iterative metric-driven tasks with baseline measurement, metric comparison, and automatic keep/revert behavior.
Installation
From the repository root:
pip install -e .
If you do not want to install it yet, you can also invoke it with:
python3 -m autodev.cli --help
Prerequisites
Before running autodev, make sure the backend CLI you want to use is installed and already authenticated:
- Claude Code:
claude - Codex CLI:
codex - Gemini CLI:
gemini - OpenCode CLI:
opencode
autodev checks this at startup and exits early if the selected command is not available.
Verified Behavior
The usage below was re-checked on March 22, 2026 in a local environment with:
Claude Code 2.0.76codex-cli 0.46.0OpenCode 1.2.27- Gemini CLI headless mode from the official docs
What was verified locally:
claude -pis the non-interactive entrypoint.claude --dangerously-skip-permissionsis available.codex execis the non-interactive entrypoint.codex exec --full-autoandcodex exec --dangerously-bypass-approvals-and-sandboxare documented in local help.codex exec --yolois accepted by the local CLI parser in this environment, even though it is not listed incodex exec --help.gemini -pis the official headless entrypoint and supports--modelplus--yolo.opencode runis the non-interactive entrypoint.
Quick Start
Detailed tutorial: docs/how-to-use-autodev.md Architecture note: docs/skills-main-integration.md
- Initialize a project:
autodev init .
Choose exactly one tool wrapper to scaffold per init run:
autodev init . --use codex
autodev init . --use claude
autodev init . --use gemini
autodev init . --use opencode
autodev init now writes backend.default like this:
- no
--use: default tocodex - one explicit tool such as
--use codex: use that tool asbackend.default
You can re-run autodev init --use <tool> later to add another tool wrapper for the same project without overwriting existing files.
This creates:
autodev.tomltask.jsonruntime queue scaffoldAGENT.mdTASK.mdprogress.txtlogs/.skills/canonical shared skills- Tool-specific wrapper files for the selected CLI such as
.claude/,.codex/,.gemini/, or.opencode/
The default shared skills copied into .skills/ are:
autodev-runtimecoca-specspec-driven-developfind-skillsskill-creator
If you also want that host tool to discover the generated skills through its native install flow, run:
autodev install-skills
autodev install-skills reads autodev.toml and installs for [backend].default.
You can inspect or search those skills directly from the CLI:
autodev skills list
autodev skills recommend "create a new skill"
autodev skills recommend "find a code review workflow"
autodev skills doctor
autodev skills doctor checks the current project's .skills/ layout, the
selected tool wrapper for backend.default, and the install state for user-level
skill links when that can be inspected locally.
- Check or adjust the backend in
autodev.toml:
[backend]
default = "codex"
- Describe your intent and let
autodev planhandle the spec + planning flow. For long free-form text, prefer--intent:
autodev plan --intent "Build a small FastAPI service for todo items with CRUD APIs, SQLite storage, and unittest coverage."
You can also use a PRD or spec file explicitly:
autodev plan -f ./prd.md
autodev plan -f docs/specs/intent-coca-spec.md
If the request is not already a COCA spec, autodev plan now generates an
intermediate COCA spec automatically, saves it under docs/specs/, and then
creates task.json.
You can still plan directly from an existing spec:
autodev plan -f docs/specs/intent-coca-spec.md
When the input document already looks like a COCA spec, autodev plan
automatically switches to a COCA-aware task decomposition prompt.
When you plan from a file, autodev also injects that source document path
into each generated task's docs field.
- Preview what would run:
autodev run --dry-run
- Start execution:
autodev run
During execution, autodev now writes a live dashboard snapshot to
logs/dashboard.html and a machine-readable snapshot to
logs/runtime-status.json.
To monitor all projects through a web dashboard:
pip install -e ".[web]"
autodev web
Then open:
http://127.0.0.1:8080
Overnight Tutorial
The commands below are a practical end-to-end example for unattended overnight work.
This example uses codex as the backend. If you prefer claude, gemini, or opencode,
change the backend name in autodev.toml.
- Install
autodevfrom the repository root:
cd /mnt/e/projects/autodev
pip install -e .
autodev --help
- Make sure your backend CLI is installed and already authenticated:
codex --help
- Create or enter your target project:
mkdir -p /mnt/e/projects/asr-realtime-cpp
cd /mnt/e/projects/asr-realtime-cpp
git init
- Initialize the project for
autodev:
autodev init .
- Set the default backend if you want something other than the init default:
autodev init already defaults new projects to codex, so you can usually skip this step.
- Generate the task plan.
autodev planwill automatically create an intermediate COCA spec first when the request is still plain intent text.
Example using --intent text input:
autodev plan --intent "Use C++ to build a realtime ASR speech-to-text framework. Requirements: CMake project; modular design; microphone and streaming audio input support; chunked streaming pipeline; VAD abstraction; ASR engine abstraction layer; start with the framework and do not hard-bind to one cloud vendor; provide a sample CLI; include basic unit tests; keep the directory structure clean; prioritize extensibility, successful compilation, and future adapters for whisper.cpp, sherpa-onnx, and FunASR."
Example using a PRD / spec file:
autodev plan -f ./prd.md
- Review the generated work queue:
autodev task list
- Do one safety preview before sleeping:
autodev run --dry-run
- Start the unattended overnight run in a detached tmux session:
autodev run --detach --epochs 5 --max-retries 20 --max-tasks 999
- Start the web dashboard to monitor all projects:
autodev web
Then open:
http://127.0.0.1:8080
- Confirm it started:
autodev list
tmux attach -t autodev-asr-realtime-cpp
- Check the result in the morning:
autodev status
autodev task list
cat progress.txt
tail -n 200 logs/autodev.log
git log --oneline --decorate -n 20
Important note:
--max-retries 20means each task may be retried up to 20 times.- It does not yet mean 20 full outer planning-and-execution iterations of the whole project.
Core Files
autodev.toml: Main configuration file.task.json: Generated runtime queue thatautodev runexecutes.TASK.md: Current active task summary written byautodevduring execution.progress.txt: Structured execution history.AGENT.md: Root canonical rule file shared by all supported tools.logs/: Main log and per-attempt logs.
Tool-Native Scaffolding
autodev init bootstraps a lightweight agent layout that is intentionally split into:
- One shared canonical source:
AGENT.mdTASK.md.skills/autodev-runtime/SKILL.md.skills/autodev-runtime/references/task-lifecycle.md.skills/autodev-runtime/references/skills-main-integration.md.skills/autodev-runtime/references/mcp-essentials.md.skills/coca-spec/SKILL.md.skills/spec-driven-develop/SKILL.md.skills/spec-driven-develop/references/doc-templates.md
- Thin tool-native wrappers copied into the directories expected by the selected CLI tool:
- Claude Code:
.claude/CLAUDE.md,.claude/rules/core.md,.claude/skills -> ../.skills,.claude/commands/spec-dev.md,.claude/commands/coca-spec.md - Codex:
.codex/AGENTS.md,.codex/rules/core.md,.codex/skills -> ../.skills - Gemini CLI:
.gemini/GEMINI.md,.gemini/rules/core.md,.gemini/skills -> ../.skills,.gemini/commands/autodev/*.toml,.gemini/settings.json - OpenCode:
.opencode/AGENTS.md,.opencode/rules/core.md,.opencode/skills -> ../.skills
This is the lightweight compromise that works best in practice:
- The canonical project rules exist once in
AGENT.md - The canonical skills exist once in
.skills/ - The tool entry files stay thin
- The always-on
rules/files stay very small - Tool-local
skills/paths point back to the canonical shared skills
Why not make everything a single physical file with no wrappers at all?
- Because each CLI discovers context from its own native paths
- Always-on rules need to exist where the CLI will load them
- Command support differs per tool
- A tiny wrapper plus project-local skill links is more portable than relying only on tool-global installs
This scaffolding is intentionally simple and project-local. It does not try to install
global plugins or mutate ~/.codex, ~/.claude, ~/.gemini, or ~/.config/opencode.
Use autodev install-skills when you want the explicit second step
that registers the generated wrappers for the configured backend.default.
autodev init --use supports exactly one tool per invocation:
claudecodexgeminiopencode
If omitted, autodev init defaults to codex.
Intent-First Workflow
For a step-by-step end-to-end guide, see docs/how-to-use-autodev.md.
You should not need to hand-write task.json in normal usage.
The default flow is:
- Explain the project intent in one sentence or a short paragraph.
- Run
autodev plan --intent "...". - Review the planned queue with
autodev task list. - Run
autodev run.
If you already have a requirements document, you can pass it explicitly:
autodev plan -f docs/prd.md
You can also pipe intent text from another command or file:
cat idea.md | autodev plan
printf '%s\n' "Build a CLI that syncs local notes to S3 with tests." | autodev plan
autodev spec still exists as an explicit advanced step when you want to
inspect or edit the generated COCA spec before planning:
autodev spec --intent "Add a billing dashboard for team admins."
autodev spec -f docs/prd.md
Spec-Driven Workflow
The generated agent scaffolding now includes a lightweight coca-spec skill.
In normal CLI usage, you usually do not need to call it manually because
autodev plan already uses the same idea internally.
Recommended default sequence:
- Run
autodev plan --intent "...". - Let
autodevgenerate an intermediate COCA spec when needed. - Review the planned queue with
autodev task list. - Run
autodev run.
Advanced sequence when you want to inspect the spec explicitly:
- Run
autodev spec --intent "...". - Review or edit
docs/specs/<name>-coca-spec.md. - Run
autodev plan -f docs/specs/<name>-coca-spec.md. - Run
autodev run.
autodev plan auto-detects COCA spec headings and generates tasks with more
emphasis on constraints, assertions, and spec-linked docs.
The generated tool-specific skills include a lightweight spec-driven-develop
workflow adapted from:
The local version keeps the same core idea but in a simpler form:
- Clarify intent
- Analyze the codebase
- Create planning docs under
docs/ - Use
docs/progress/MASTER.mdas the continuity anchor - Convert the approved plan into executable work
This is especially useful for large rewrites, migrations, and architecture-heavy tasks.
The generated rules also point agents at .skills/autodev-runtime/references/task-lifecycle.md,
which documents how autodev expects task.json and progress.txt to stay aligned.
Backend Configuration
By default, autodev is configured to let the selected coding agent act in a fully automatic mode:
- Claude Code defaults to
skip_permissions = truewithpermission_mode = "bypassPermissions". - Codex defaults to
yolo = true. - Gemini CLI defaults to
yolo = true. - OpenCode defaults to permissive tool access via
OPENCODE_PERMISSION.
These defaults are intentionally aggressive. They are best suited for trusted local repos, disposable sandboxes, or CI environments that are already isolated externally.
Claude Code
Uses claude -p in non-interactive mode.
[backend]
default = "claude"
[backend.claude]
skip_permissions = true
permission_mode = "bypassPermissions"
output_format = "stream-json"
model = ""
verbose = true
Codex
Uses codex exec in non-interactive mode.
[backend]
default = "codex"
[backend.codex]
model = "gpt-5-codex"
yolo = true
full_auto = false
dangerously_bypass_approvals_and_sandbox = true
ephemeral = false
Notes:
yolo = truemakesautodevcallcodex exec --yolo, which is the default.- On the locally verified
codex-cli 0.46.0,--yolois accepted but not shown incodex exec --help. - If you turn
yolooff,autodevfalls back to the documented split flagsfull_autoanddangerously_bypass_approvals_and_sandbox. ephemeral = trueavoids persisting Codex session data.
Gemini CLI
Uses gemini -p in non-interactive mode.
[backend]
default = "gemini"
[backend.gemini]
model = ""
yolo = true
approval_mode = ""
output_format = "text"
all_files = false
include_directories = ""
debug = false
OpenCode
Uses opencode run in non-interactive mode.
[backend]
default = "opencode"
[backend.opencode]
model = ""
format = "default"
permissions = '{"read":"allow","edit":"allow","bash":"allow","glob":"allow","grep":"allow"}'
log_level = ""
Generated Task File Format
task.json is generated by autodev plan. In normal usage you should inspect it with autodev task list, not hand-edit it.
Each task now has two explicit contracts:
verification: howautodevgathers evidence that the implementation changed the right things and passes validation commands.completion: howautodevdecides the task is actually complete.
execution is separate from completion semantics:
execution.strategy = "single_pass"is the default for normal delivery work.execution.strategy = "iterative"is used for bounded metric-driven optimization loops.
A normal delivery-style task looks like this:
{
"project": "Example Project",
"tasks": [
{
"id": "P0-1",
"title": "Implement authentication",
"description": "Add login flow and session validation.",
"steps": [
"Add auth service",
"Implement login endpoint",
"Write tests"
],
"docs": [],
"passes": false,
"blocked": false,
"block_reason": "",
"verification": {
"path_patterns": ["src/auth/*", "tests/test_auth.py"],
"validate_commands": ["python3 -m unittest"],
"validate_timeout_seconds": 1800
},
"completion": {
"kind": "boolean",
"source": "gate",
"success_when": "all_checks_pass"
},
"execution": {
"strategy": "single_pass"
},
"output": ["src/auth.py", "tests/test_auth.py"]
}
]
}
This means ordinary feature work also has an observable completion metric: a boolean completion result derived from the gate.
Command Reference
Top-level commands:
autodev init: initialize a project and scaffold config, logs, and agent wrapper filesautodev run: execute pending tasks with the configured backendautodev task: inspect or manage tasks intask.jsonautodev plan: primary planning command; generatetask.jsonfrom intent text, stdin, or a requirements/spec fileautodev spec: explicitly generate a COCA spec when you want to review the intermediate spec before planningautodev verify: run task completion verification manuallyautodev status: show the current run state, queue counts, and task statusesautodev web: launch the web dashboard for multi-project management (requirespip install autodev[web])autodev list: show all running detached autodev tmux sessionsautodev attach: attach to a running detached sessionautodev stop: stop a running detached session
Most common examples:
autodev init ./my-project --name "My Project"
autodev init ./my-project --use codex
autodev plan --intent "Build a FastAPI todo service with SQLite and tests."
autodev plan -f docs/prd.md
autodev spec -f docs/prd.md
autodev run
autodev run --dry-run
autodev run --backend codex
autodev run --backend claude
autodev run --backend gemini
autodev run --backend opencode
autodev run --epochs 5 --max-retries 10
autodev run --detach
autodev run --detach --epochs 5 --max-retries 20
autodev list
autodev attach autodev-my-project
autodev stop autodev-my-project
autodev stop --all
autodev status
autodev web
autodev verify P0-1 --changed-file src/auth.py --changed-file tests/test_auth.py
For metric-driven iterative tasks, the CLI entrypoint is the same:
autodev plan --intent "Optimize benchmark latency with a measurable JSON metric."
autodev task list
autodev run --dry-run
autodev run --epochs 3 --max-retries 5
autodev status
autodev web
autodev run Parameter Table
| CLI parameter | autodev.toml default |
Meaning | Example |
|---|---|---|---|
--backend {claude,codex,gemini,opencode} |
[backend].default |
Select the backend for this run only | autodev run --backend codex |
--max-tasks N |
[run].max_tasks |
Stop after processing at most N tasks in this run |
autodev run --max-tasks 3 |
--max-retries N |
[run].max_retries |
Retry each task at most N times before marking it blocked |
autodev run --max-retries 10 |
--epochs N |
[run].max_epochs |
Run up to N workflow epochs; after one queue is exhausted, autodev can re-plan the next queue automatically |
autodev run --epochs 5 --max-retries 10 |
--detach |
none | Run in a background tmux session instead of the foreground | autodev run --detach --epochs 3 |
--dry-run |
none | Preview prompts and queue behavior without calling the backend | autodev run --dry-run |
Related run-time config in autodev.toml:
| Config key | Default | Meaning |
|---|---|---|
[run].max_retries |
3 |
Default retry count per task |
[run].max_tasks |
999 |
Default max tasks per run |
[run].max_epochs |
1 |
Default max workflow epochs per run |
[run].heartbeat_interval |
20 |
Seconds between heartbeat updates while a task is running |
[run].delay_between_tasks |
2 |
Seconds to wait before the next retry or next task |
[reflection].enabled |
true |
Enable failed-attempt reflection and task refinement |
[reflection].max_refinements_per_task |
3 |
Limit how many times one task can auto-refine itself |
Workflow Epochs
autodev run --epochs N adds a higher-level autonomous workflow loop:
- use the current runtime queue in
task.json - execute tasks
- reflect and learn during task execution
- when the current queue is exhausted, re-plan the next queue
- continue until the epoch limit is reached or no further tasks remain
This is different from task retries:
--max-retries: task-level iteration inside one task--epochs: workflow-level iteration acrossplan -> tasks -> dev
Recommended unattended command:
autodev run --detach --epochs 5 --max-retries 20 --max-tasks 999
Notes:
epochsworks best whentask.jsonwas generated byautodev plan, becauseautodevneeds the persistedplanning_sourcemetadata to re-plan automatically.- If a queue still has pending tasks at the end of one epoch, the next epoch continues that queue instead of re-planning immediately.
- If re-planning produces zero new tasks,
autodevstops early even if the epoch limit was larger.
Detached Mode
autodev run --detach launches the run inside a background tmux session. This is the recommended way to run unattended overnight builds instead of nohup.
autodev run --detach --epochs 5 --max-retries 20
Each detached session is named autodev-<project-name> (derived from [project].name in autodev.toml). You can manage sessions with:
autodev list # show all running autodev sessions
autodev attach autodev-my-project # attach to watch live output
autodev stop autodev-my-project # stop a specific session
autodev stop --all # stop all autodev sessions
You can also use tmux directly:
tmux attach -t autodev-my-project # attach
tmux kill-session -t autodev-my-project # kill
Multiple projects in parallel: Each project runs in its own isolated tmux session with its own codex process, task queue, and working directory. There are no file conflicts because each codex operates independently.
cd /path/to/project-a && autodev run --detach --epochs 3
cd /path/to/project-b && autodev run --detach --epochs 2
cd /path/to/project-c && autodev run --detach
autodev list # shows all three sessions
Related config in autodev.toml:
| Config key | Default | Meaning |
|---|---|---|
[detach].tmux_session_prefix |
"autodev" |
Prefix for tmux session names |
Prerequisite: tmux must be installed and available in PATH.
autodev task subcommands:
autodev task list: show all tasks and their current status with live running-task detectionautodev task next: show the next pending taskautodev task reset: reset selected tasks, or all tasks, back to pendingautodev task retry: reset only blocked tasks back to pendingautodev task block: mark a task as blocked manually
Task command examples:
autodev task list
autodev task next
autodev task reset --ids P0-1,P0-2
autodev task retry
autodev task retry --ids P1-3,P1-4
autodev task block P1-4 "waiting for API credentials"
Live Monitoring
CLI output now uses color-coded task state badges so you can quickly distinguish:
PENDING: the task has not started yetRUNNING: the task has started and is actively executingWAITING: the task has started, but is currently waiting for model output or the next actionCOMPLETED: the task finished successfullyBLOCKED: the task exhausted retries or hit a blockerRETRY: the current attempt failed andautodevis preparing the next retry
Useful monitoring commands:
autodev status
autodev task list
autodev web
Recommended local workflow:
- In terminal 1, start development:
autodev run --detach
- Start the web dashboard:
autodev web
- Open the dashboard in your browser:
http://127.0.0.1:8080
The web dashboard shows all projects, their task queues, current tasks, and live logs. It auto-refreshes every few seconds.
If you only want to check generated per-project status files without running the web server:
xdg-open logs/dashboard.html
The per-project status files are written during execution:
logs/dashboard.html— per-project HTML snapshot (auto-refreshes)logs/runtime-status.json— machine-readable live snapshot
C++ And CUDA Verification
autodev now includes more stable defaults for C++ and CUDA projects:
- verification commands default to a longer timeout with
verification.validate_timeout_seconds = 1800 - snapshot filtering ignores common out-of-source build paths such as
build-*,cmake-build-*, andout-* - common compiled artifacts such as
*.o,*.so,*.a,*.ptx, and*.cubinare ignored in changed-file tracking by default - snapshot filtering can optionally track only relevant source/header paths via
snapshot.include_path_globs - verification commands can run from a dedicated directory and with explicit environment variables via
validate_working_directoryandvalidate_environment
Recommended verification style for C++ / CUDA tasks:
- keep
verification.path_patternsfocused on source, headers, CMake files, and tests - use explicit out-of-source build commands in
verification.validate_commands - use
validate_working_directorywhen the real project root is a subdirectory - use
validate_environmentwhen CUDA or toolchain variables must be injected consistently - avoid treating build artifacts as required task outputs unless that is truly the goal
Example:
{
"verification": {
"path_patterns": [
"src/**/*.cpp",
"src/**/*.cu",
"include/**/*.hpp",
"include/**/*.cuh",
"tests/**",
"CMakeLists.txt",
"CMakePresets.json"
],
"validate_commands": [
"cmake --preset dev-debug",
"cmake --build --preset dev-debug -j",
"ctest --test-dir build/dev-debug --output-on-failure"
],
"validate_timeout_seconds": 3600,
"validate_working_directory": "",
"validate_environment": {
"CMAKE_BUILD_PARALLEL_LEVEL": "8",
"CUDAARCHS": "native"
}
}
}
Optional source-focused snapshot configuration:
[snapshot]
watch_dirs = ["."]
include_path_globs = [
"src/**/*.cpp",
"src/**/*.cc",
"src/**/*.cu",
"include/**/*.hpp",
"include/**/*.hh",
"include/**/*.cuh",
"tests/**",
"CMakeLists.txt",
"CMakePresets.json",
]
Iterative Self-Improvement
autodev run now models task completion through explicit completion and execution contracts.
Boolean completion for delivery work
Normal feature and bug-fix tasks use:
completion.kind = "boolean"completion.source = "gate"completion.success_when = "all_checks_pass"execution.strategy = "single_pass"
That means delivery work also has an observable completion metric. The metric is boolean: the task is complete only when the unified gate result says the completion contract is met.
The normal unattended workflow remains:
- execute the task
- verify changed files and validation commands
- evaluate boolean completion through the gate
- reflect on failures without changing the task goal
- retry up to the configured limit
- mark the task completed or blocked
Strict task audit
Before autodev accepts generated tasks or refined tasks, it now runs a mechanical audit.
This prevents weak or ambiguous tasks from silently entering the queue.
At minimum, each task must have:
idtitledescription- non-empty
steps - meaningful
verification
Generated or refined tasks are rejected when they do things like:
- omit
description - leave
stepsempty or too weak to execute - remove existing verification strength
- pre-mark work as
passes = trueorblocked = true - omit a valid
completioncontract - define invalid numeric completion without a usable machine-readable metric
- define
execution.strategy = "iterative"without numeric completion
Reflection constraints
When an attempt fails, autodev may refine only the execution guidance, not the goal itself.
The following fields stay fixed:
idtitledescriptioncompletionexecution
Reflection may refine:
stepsdocsoutputimplementation_notesverification_noteslearning_notesverification.*
Each failed or completed attempt is recorded into:
task.attempt_historytask.learning_notes- top-level
learning_journal
The next attempt prompt automatically includes recent task and project learnings.
When --epochs is greater than 1, autodev can also re-plan a fresh task queue for the next workflow epoch after the current queue is exhausted.
Default reflection config:
[reflection]
enabled = true
max_refinements_per_task = 3
prompt_timeout_seconds = 180
log_tail_lines = 80
max_attempt_history_entries = 12
max_learning_notes = 20
max_project_learning_entries = 50
prompt_learning_limit = 6
This makes autodev behave more like an unattended engineering loop:
- try the task
- verify the result
- diagnose what went wrong
- strengthen the task guidance and verification
- retry with the new learning context
- when an epoch finishes, generate the next task queue if more work remains
Numeric Completion Tasks
Use numeric completion when the goal is not just “make the task pass”, but “improve or hit a measurable metric with bounded autonomous iterations”.
Typical use cases:
- tuning latency or throughput
- reducing benchmark time
- improving score-based output quality when the score is machine-readable
- self-improving
autodevitself on objective verification loops
Numeric completion requirements
A metric-driven iterative task should define:
- normal
verification.validate_commands completion.kind = "numeric"- a machine-readable metric source
execution.strategy = "iterative"
Current MVP metric support is intentionally strict:
completion.source = "json_stdout"completion.direction = "lower_is_better"or"higher_is_better"completion.json_pathmust point to a numeric value in stdout JSON
Example task:
{
"id": "P1-1",
"title": "Tune latency",
"description": "Reduce end-to-end latency for the benchmark path.",
"steps": [
"Measure the current baseline",
"Make one focused optimization per iteration"
],
"docs": ["docs/benchmarks/latency.md"],
"passes": false,
"blocked": false,
"verification": {
"path_patterns": ["src/**/*.py", "tests/**/*.py"],
"validate_commands": ["python3 scripts/benchmark_latency.py"],
"validate_timeout_seconds": 1800
},
"completion": {
"kind": "numeric",
"name": "latency_ms",
"source": "json_stdout",
"json_path": "$.metrics.latency_ms",
"direction": "lower_is_better",
"min_improvement": 1,
"unchanged_tolerance": 0
},
"execution": {
"strategy": "iterative",
"max_iterations": 5,
"rollback_on_failure": true,
"keep_on_equal": false,
"commit_prefix": "experiment",
"stop_after_no_improvement": 2,
"stop_after_invalid": 2
}
}
Validation command output for numeric completion
For json_stdout metrics, the validation command must print JSON to stdout.
For example:
{
"metrics": {
"latency_ms": 95.2
}
}
If json_path = "$.metrics.latency_ms", autodev extracts 95.2 and compares it to the baseline or best-so-far result.
Iterative execution flow
For metric-driven iterative tasks, autodev runs a bounded optimization loop:
- run validation first to collect a baseline metric
- ask the backend to make one focused change
- create a task-scoped experiment commit before comparison
- run normal verification plus metric extraction
- classify the result as
improved,unchanged,regressed, orinvalid - keep or revert the commit according to policy
- record the iteration in
logs/experiments.jsonl - stop when the iteration limit or stop threshold is reached
Current keep/revert behavior:
improved: keep the changeunchanged: keep only ifkeep_on_equal = true, otherwise revertregressed: revert whenrollback_on_regression = true, otherwise block for manual reviewinvalid: revert whenrollback_on_regression = true, otherwise block for manual review
Completion observability
While a task is running, autodev now exposes completion context in prompts and runtime status.
For every task:
completion_kindcompletion_namecompletion_target_summarylast_completion_outcome
For iterative numeric tasks it also exposes:
- baseline metric
- best metric so far
- last measured metric
- last outcome
- kept count
- reverted count
- no-improvement streak
- recent experiment history
- recent git history
You can monitor this through:
logs/experiments.jsonllogs/runtime-status.jsonlogs/dashboard.htmlautodev statusautodev web
Failure Recovery
When a task fails, there are three different retry layers:
autodev run --max-retries N: retry inside the same run before the task becomes blockedautodev task retry: re-open blocked tasks only, then run againautodev task reset: force any selected task back to pending, including completed tasks
Recommended recovery commands:
autodev task list
autodev task retry
autodev run --epochs 1
Retry only one blocked task:
autodev task retry --ids P1-3
autodev run --epochs 1
Force a full or selective reset:
autodev task reset --ids P1-3
autodev run --epochs 1
autodev task retry and autodev task reset now create a timestamped task.json.bak.<UTCSTAMP> backup automatically before they write changes, so you no longer need a separate --backup flag.
Runtime Behavior
For each task, autodev run does the following:
- Loads the next pending task from
task.json. - Renders the prompt from the task, recent attempt history, project learnings, and iterative execution context when applicable.
- Launches the selected backend in non-interactive mode.
- Captures output into the main log and the per-attempt log.
- Computes changed files using filesystem snapshots.
- Runs
verificationchecks and computes a unifiedcompletionresult. - If the task uses
execution.strategy = "iterative", collects a baseline metric, performs commit-before-compare, and automatically keeps or reverts iterations according to policy. - If an attempt fails, reflects on the failure and refines the current task guidance when possible.
- Updates
logs/runtime-status.jsonandlogs/dashboard.htmlfor live monitoring, including completion observability. - Appends structured execution and learning records to
progress.txt. - Marks the task complete or blocked.
- Creates a git commit if auto-commit is enabled and the directory is a git repo.
When --epochs N is greater than 1, autodev run also does this between epochs:
- Detects whether the current queue is exhausted.
- Uses persisted
planning_sourcemetadata plus the learning journal to re-plan the next queue. - Writes the next epoch's
task.json. - Continues until the epoch limit is reached or no further tasks are generated.
Exit Codes
0: Run completed without blocked tasks or environment errors.1: Environment or runtime failure, such as missing CLI dependencies or log write errors.2: The run completed, but at least one task is blocked.130: Interrupted by signal.
Running Tests
Run the test suite from the repository root:
python3 -m unittest discover -s tests -v
Tips
- Start with
autodev run --dry-runbefore handing over a real task queue. - Keep each task small enough for one model session.
- Add
verification.validate_commandswherever possible so the agent has an objective success condition. - Prefer
autodev task listto inspect the current queue; if a task keeps refining itself, it usually means the task should be split. - If you are using auto-commit, run inside a git repo and keep unrelated local changes out of the worktree.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autodev_cli-0.1.0.tar.gz.
File metadata
- Download URL: autodev_cli-0.1.0.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a96f3ed7a384a13299c41c02b2faf5e007633e5529cca1decaba6c165249af0
|
|
| MD5 |
0194e47919b371404035c9f00c9872e3
|
|
| BLAKE2b-256 |
dcfb0cc3b1a857f86cdd72952ce7f6ffe1c5da8131cf68b34fc13f883abe47f0
|
File details
Details for the file autodev_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: autodev_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26383bba93b33f7b0b86d715098ef3503fdd8f70c85eaf4c31fb8491fdb089d9
|
|
| MD5 |
db1ba786ca2e487c0c03c131dd9bc20b
|
|
| BLAKE2b-256 |
c9129efa1e2f965eea66f26050c623ad6a6586d997fdffa92cd783bb0b1190c5
|