Multi-provider autonomous coding agent harness with API rotation pool

These details have not been verified by PyPI

Project links

Project description

🔥 claw-forge

Autonomous coding agent harness for serious Python projects.

Multi-provider API rotation · Claude Agent SDK core · 18 pre-installed skills · Pure asyncio · Zero Node.js

What it does

claw-forge runs autonomous coding agents that implement features, fix bugs, write tests, and review code — in parallel, across multiple AI providers, with live progress tracked in a Kanban UI.

It is built on the Claude Agent SDK as its execution engine, with a provider pool layer on top for API key rotation, circuit breaking, and cost tracking.

Quick Start

# 1. Install
pip install --pre claw-forge   # or: uv tool install claw-forge --prerelease allow

# 2. Bootstrap your project (scaffolds .claude/, CLAUDE.md, config, and app_spec.example.xml)
mkdir my-api && cd my-api
claw-forge init
# Output:
# ✓ Created claw-forge.yaml   (edit providers/API keys)
# ✓ Created .env.example      (copy to .env and fill keys)
# ✓ Created .claude/ with settings.json
# ✓ Created app_spec.example.xml  (XML format reference)
# ✓ Scaffolded 8 slash commands → .claude/commands/

# 3. Add your API key
cp .env.example .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env

# 4. Create your spec — two options:
#   Option A: Open Claude Code in this dir and run /create-spec
#   Option B: Paste your PRD into Claude with this prompt:
#     "Convert this PRD to claw-forge XML. Use app_spec.example.xml as the template.
#      Write the result to app_spec.txt."

# 5. Validate spec quality (structural + LLM checks)
claw-forge validate-spec app_spec.txt

# 6. Parse spec → feature DAG (uses Opus for accurate planning)
claw-forge plan app_spec.txt
# Output:
# ✅ Spec parsed: my-api
# ┌──────────────────────┬───────┐
# │ Authentication       │    8  │
# │ Task Management      │   22  │
# │ Total                │   30  │
# └──────────────────────┴───────┘
# Dependency waves: 3
# Estimated run time: ~12 minutes (at concurrency=5)
# Next: claw-forge run --concurrency 5

# 6. Run agents
claw-forge run --concurrency 5

# 7. Open the Kanban board
claw-forge ui

Writing a Project Spec

The spec is the single input that drives everything. The initializer agent reads it, breaks it into atomic tasks, infers dependencies, and creates an execution plan. The quality of your spec directly determines the quality of what gets built.

claw-forge supports two spec formats:

Format	When to use	Command
Plain text (`app_spec.txt`)	Quick greenfield projects	`claw-forge plan app_spec.txt`
XML (`app_spec.xml`)	Richer greenfield specs with phases, DB schema, API summary	`claw-forge plan app_spec.xml`
Brownfield XML (`additions_spec.xml`)	Adding features to an existing codebase	`claw-forge add --spec additions_spec.xml`

The /create-spec slash command generates whichever format is appropriate — it auto-detects whether you're in a greenfield or brownfield project and runs the matching conversational flow.

4 ways to create a spec

Option 1 — Write it yourself (best control)

Project: my-api
Stack:   Python 3.12, FastAPI, SQLAlchemy, pytest

Description:
  A REST API for managing tasks with JWT auth, tag filtering, and reminders.

Features:

1. User authentication
   Description: JWT-based register/login/logout using bcrypt.
     Access tokens expire in 1h, refresh tokens in 7d.
   Acceptance criteria:
   - POST /auth/register creates user, returns 201 with user_id
   - POST /auth/login returns {access_token, refresh_token, expires_in}
   - POST /auth/refresh exchanges refresh_token for new access_token
   - POST /auth/logout invalidates the refresh token
   - Passwords hashed with bcrypt (cost factor 12)
   - 15 unit tests, all passing
   Tech notes: Use python-jose for JWT, passlib for bcrypt.

2. Task CRUD
   Description: Full CRUD for tasks with title, description, status
     (todo/in_progress/done), priority (1–5), due_date, and tags.
   Acceptance criteria:
   - POST /tasks — create, returns 201
   - GET /tasks  — list with pagination (?page=1&per_page=20)
   - PATCH /tasks/{id} — partial update
   - DELETE /tasks/{id} — soft delete (sets deleted_at)
   - All endpoints require valid JWT (401 if missing)
   - Users can only access their own tasks (403 otherwise)
   Depends on: 1

3. Integration test suite
   Description: End-to-end API tests using pytest + httpx AsyncClient.
     Runs against in-memory SQLite — no external dependencies.
   Acceptance criteria:
   - Coverage ≥ 90% across all modules
   - Tests cover auth flow, CRUD, and error cases
   - All tests pass: pytest tests/ -v
   Depends on: 1, 2

Option 2 — XML spec (recommended for complex projects)

XML gives you richer structure: implementation phases, DB schema, API endpoint summary, design system, and success criteria. The parser generates 100–400 granular features automatically.

<!-- app_spec.xml -->
<project_specification>
  <project_name>my-api</project_name>
  <overview>A REST API for task management with JWT auth.</overview>
  <technology_stack>Python 3.12, FastAPI, SQLAlchemy, PostgreSQL, pytest</technology_stack>

  <core_features>
    User can register with email and password
    System validates email format and enforces password strength
    User can log in and receive JWT access + refresh tokens
    User can create tasks with title, description, status, and priority
    User can list their tasks with pagination
  </core_features>

  <database_schema>
    Users: id, email, password_hash, created_at
    Tasks: id, owner_id, title, description, status, priority, created_at
  </database_schema>

  <implementation_steps>
    <phase name="Auth">User registration, login, JWT tokens</phase>
    <phase name="Tasks">Task CRUD endpoints</phase>
    <phase name="Tests">Integration test suite, coverage ≥ 90%</phase>
  </implementation_steps>

  <success_criteria>
    All endpoints tested, coverage ≥ 90%, ruff + mypy clean
  </success_criteria>
</project_specification>

claw-forge plan app_spec.xml

Use app_spec.example.xml as your starting point — claw-forge init copies it into your project automatically.

Option 3 — Brownfield XML spec (adding features to an existing project)

When you want to add a set of related features to an existing codebase, use additions_spec.xml. The mode="brownfield" attribute tells claw-forge to load brownfield_manifest.json and inject existing conventions into agent context.

<!-- additions_spec.xml -->
<project_specification mode="brownfield">
  <project_name>my-api — Stripe Integration</project_name>
  <addition_summary>Add Stripe payment processing to the existing FastAPI app.</addition_summary>

  <existing_context>
    <!-- Auto-populated by /create-spec when brownfield_manifest.json exists -->
    <stack>Python / FastAPI / PostgreSQL</stack>
    <test_baseline>47 tests passing, 87% coverage</test_baseline>
    <conventions>snake_case, async handlers, pydantic v2 models</conventions>
  </existing_context>

  <features_to_add>
    User can add a payment method via Stripe Elements
    System charges stored payment method on subscription creation
    Admin can issue refunds from the dashboard
  </features_to_add>

  <integration_points>
    Extends User model with stripe_customer_id field
    Adds /payments router alongside existing /auth and /tasks routers
  </integration_points>

  <constraints>
    Must not modify existing auth flow
    All 47 existing tests must stay green
    Follow existing async handler pattern in routers/
  </constraints>

  <implementation_steps>
    <phase name="Stripe Setup">Stripe client, webhook handler, config</phase>
    <phase name="Payment Methods">Add/remove cards, default card</phase>
    <phase name="Subscriptions">Create, cancel, webhook events</phase>
  </implementation_steps>
</project_specification>

claw-forge add --spec additions_spec.xml

Use skills/app_spec.brownfield.template.xml as a starting point.

Option 4 — Interactive with /create-spec (easiest)

Open Claude Code in your project directory and type /create-spec. It auto-detects the mode:

No brownfield_manifest.json → greenfield flow: walks through project name, stack, features, DB schema, API endpoints — outputs app_spec.xml + claw-forge.yaml
brownfield_manifest.json exists → brownfield flow: pre-populates <existing_context> from the manifest, asks what to add and what constraints apply — outputs additions_spec.xml

# In Claude Code (greenfield — new project):
/create-spec

# In Claude Code (brownfield — after claw-forge analyze):
claw-forge analyze     # generates brownfield_manifest.json first
/create-spec           # detects manifest → runs brownfield flow automatically

Option 5 — From an existing doc

Paste a PRD, Notion export, or detailed README into Claude and ask:

"Convert this into a claw-forge app_spec.txt. Break each requirement into a concrete feature with acceptance criteria and Depends on: links."

Spec format reference

Plain text fields:

Field	Required	Description
`Project:`	✅	Project name
`Stack:`	✅	Languages, frameworks, databases
`Description:`	optional	One paragraph overview
`Features:`	✅	List of numbered features
`Acceptance criteria:`	✅	Bullet list — each item is a verifiable condition
`Depends on:`	optional	Comma-separated feature numbers
`Tech notes:`	optional	Library preferences, patterns, constraints

XML elements (greenfield app_spec.xml):

Element	Required	Description
`<project_name>`	✅	Project name
`<overview>`	✅	One paragraph summary
`<technology_stack>`	✅	Languages, frameworks, databases
`<core_features>`	✅	One feature per line, action-verb format
`<database_schema>`	optional	Table definitions
`<api_endpoints_summary>`	optional	Route list
`<ui_layout>`	optional	Page/component structure
`<implementation_steps>`	optional	`<phase name="...">` blocks — features grouped by phase
`<success_criteria>`	optional	Done conditions

XML elements (brownfield additions_spec.xml, mode="brownfield"):

Element	Required	Description
`<project_name>`	✅	Project + addition name
`<addition_summary>`	✅	What you're adding and why
`<existing_context>`	optional	`<stack>`, `<test_baseline>`, `<conventions>` — auto-populated by `/create-spec`
`<features_to_add>`	✅	One feature per line (same format as `<core_features>`)
`<integration_points>`	optional	Where new code hooks into existing code
`<constraints>`	optional	What must NOT change
`<implementation_steps>`	optional	`<phase name="...">` blocks

Tips for a good spec

One feature = one atomic unit of work. If it would take >2 hours to implement, split it.
Acceptance criteria are tests. Write them as if the agent must tick each one before marking done.
Use Depends on: for real hard dependencies. Features with no dependencies run in parallel in Wave 1.
Start with 5–10 features. Add more with /expand-project once the first wave is running.
Be specific about libraries. "Use python-jose for JWT" beats "implement JWT".
Include test count targets. "15 unit tests" gives the agent a concrete goal.

Adding features to a running project

claw-forge plan reconciles by default — it matches spec features against existing tasks by description, keeps completed/pending/failed tasks untouched, and only inserts new features as pending tasks. Dependencies are wired correctly across old and new tasks.

# Option A — /expand-project slash command in Claude Code
/expand-project
# Claude lists current features, asks what to add, POSTs them atomically

# Option B — edit spec and re-plan (only new features are added)
echo "
4. Email reminders
   Description: Send due-date reminders 24h before via SendGrid.
   Acceptance criteria:
   - Celery task runs hourly, finds tasks due in next 24h
   - Sends email via SendGrid with task title and due date
   - Emails only sent once per task per due date
   Depends on: 2
" >> app_spec.txt
claw-forge plan app_spec.txt --project my-api
# Output:
# Reconciled with existing session:
#   3 completed  (kept)
#   1 new        (added)

# Use --fresh to ignore existing state and start from scratch
claw-forge plan app_spec.txt --project my-api --fresh

Provider Pool

Never hit a rate limit again. Configure multiple providers with automatic failover:

# claw-forge.yaml
providers:
  # Use your `claude login` token — no API key needed
  claude-oauth:
    type: anthropic_oauth
    priority: 1

  # Direct API key as primary fallback
  anthropic-primary:
    type: anthropic
    api_key: ${ANTHROPIC_KEY_1}
    priority: 2

  # Second API key for burst capacity
  anthropic-secondary:
    type: anthropic
    api_key: ${ANTHROPIC_KEY_2}
    priority: 2

  # Cloud providers for enterprise scale
  aws-bedrock:
    type: bedrock
    region: us-east-1
    priority: 3

  azure-ai:
    type: azure
    endpoint: https://my-resource.openai.azure.com
    api_key: ${AZURE_KEY}
    priority: 4

  # Free-tier providers for lightweight tasks
  groq-free:
    type: openai_compat
    base_url: https://api.groq.com/openai/v1
    api_key: ${GROQ_KEY}
    priority: 5

  # Local model via Ollama
  local-ollama:
    type: ollama
    base_url: http://localhost:11434
    model: qwen2.5-coder
    priority: 6

The pool automatically:

Routes requests through providers in priority order
Backs off per-provider when rate limited (parses Retry-After headers)
Opens per-provider circuit breakers on persistent failures
Tracks cost, latency, and RPM per provider
Falls through the entire chain before giving up

5 routing strategies: priority (default) · round_robin · weighted_random · least_cost · least_latency

Supported Providers

Provider	Type	Auth
Anthropic direct	`anthropic`	API key
Claude OAuth	`anthropic_oauth`	Auto-reads `~/.claude/.credentials.json`
Anthropic-format proxy	`anthropic_compat`	`x-api-key` or none (internal proxies)
AWS Bedrock	`bedrock`	IAM / instance role
Azure AI Foundry	`azure`	API key
Google Vertex AI	`vertex`	Application Default Credentials
Groq / Cerebras	`openai_compat`	API key
Any OpenAI-compat endpoint	`openai_compat`	Optional API key
Ollama (local)	`ollama`	Optional (usually none)

Agent Runtime

claw-forge uses the Claude Agent SDK for all agent execution. The SDK handles the tool-use loop, MCP server connections, permission hooks, streaming — claw-forge adds the orchestration, state management, and provider rotation layer on top.

Bidirectional sessions

from pathlib import Path
from claw_forge.agent import AgentSession
from claw_forge.agent.thinking import thinking_for_task
from claw_forge.agent.output import CODE_REVIEW_SCHEMA
from claude_agent_sdk import ClaudeAgentOptions, ResultMessage

options = ClaudeAgentOptions(
    model="claude-sonnet-4-6",
    cwd=Path("./my-project"),
    thinking=thinking_for_task("coding"),      # adaptive by default
    max_budget_usd=1.00,                        # hard cost cap
    betas=["context-1m-2025-08-07"],            # 1M token context
)

async with AgentSession(options) as session:
    # Primary task
    async for msg in session.run("Implement the OAuth2 login flow"):
        handle_message(msg)

    # Tests failed — guide without restarting the session
    async for msg in session.follow_up("Focus on fixing test_auth.py first"):
        handle_message(msg)

    # Refactor went wrong — rewind files to before the last step
    await session.rewind(steps_back=1)

    # Escalate to Opus for the hard security review
    await session.switch_model("claude-opus-4-6")

One-shot queries

from claw_forge.agent import collect_result, collect_structured_result

# Simple text result
result = await collect_result(
    "Write unit tests for src/auth.py",
    cwd=Path("./my-project"),
    agent_type="testing",   # gets correct tool list + max_turns
)

# Structured JSON output (schema enforced)
review = await collect_structured_result(
    "Review the PR diff for security issues",
    cwd=Path("./my-project"),
    agent_type="reviewer",
    output_format=CODE_REVIEW_SCHEMA,
)
# review = {"verdict": "request_changes", "blockers": [...], "security_issues": [...]}

Provider pool + agent

from claw_forge.pool import ProviderPoolManager
from claw_forge.agent import collect_result

pool = ProviderPoolManager.from_config("claw-forge.yaml")
provider = await pool.acquire("claude-sonnet-4-6")

result = await collect_result(
    "Fix the failing integration tests",
    cwd=Path("./project"),
    provider_config=provider,  # routes through pool with failover
)

Advanced SDK Features

claw-forge exposes all 20 Claude Agent SDK APIs. See docs/sdk-api-guide.md for detailed examples. Highlights:

Feature	API	Use case
File undo	`enable_file_checkpointing` + `rewind_files()`	Roll back bad refactors
Cost cap	`max_budget_usd`	Hard limit per session
Structured output	`output_format` schema	Typed review verdicts
Thinking depth	`ThinkingConfig`	Deep for planning, off for monitoring
Named sub-agents	`AgentDefinition`	Planner / Coder / Reviewer roles
OS sandbox	`SandboxSettings`	Filesystem + network isolation
Model fallback	`fallback_model`	Auto-retry on model error

Security

Three-layer defence:

CanUseTool callback — Python function runs before every tool; blocks dangerous commands, restricts writes to project directory, can mutate tool inputs
Bash security hook — hardcoded blocklist (sudo, dd, shutdown, ...) + per-project allowed_commands.yaml
SandboxSettings — OS-level bash isolation on macOS/Linux (optional)

Agent lock file (.claw-forge.lock) prevents two agents running on the same project simultaneously.

Commands

CLI Commands

Command	Description	Key flags
`claw-forge init`	Bootstrap project — scaffold `.claude/`, config, `app_spec.example.xml`	`--project`, `--config`
`claw-forge plan`	Parse spec → feature DAG in state DB (reconciles with existing session by default)	`spec` (positional), `--model`, `--concurrency`, `--fresh`
`claw-forge run`	Start agent pool, dispatch features in parallel	`--concurrency`, `--model`, `--yolo`, `--edit-mode`, `--loop-detect-threshold`, `--verify-on-exit`
`claw-forge add`	Add features to existing project (single or brownfield spec)	`--spec`, `--branch/--no-branch`
`claw-forge fix`	TDD bug-fix: write failing test → fix → regression suite	`--report`, `--branch/--no-branch`
`claw-forge status`	Project progress, phase bars, active agents, next action	`--config`
`claw-forge analyze`	Scan codebase → `brownfield_manifest.json` (stack, tests, conventions)	`--project`
`claw-forge ui`	Launch real-time Kanban board (React + WebSocket)	`--port`, `--session`, `--dev`
`claw-forge dev`	Start API (hot-reload) + UI (Vite HMR) for local development; add `--run` to also launch the agent orchestrator	`--state-port`, `--ui-port`, `--project`, `--run`
`claw-forge pool-status`	Provider health table (status, RPM, latency, cost)	`--config`
`claw-forge pause`	Drain mode — finish in-flight agents, start no new ones	—
`claw-forge resume`	Resume a paused project	—
`claw-forge input`	Answer human-input questions from blocked agents	—
`claw-forge state`	Start the state service REST API (port 8420)	`--port`, `--host`, `--reload`

Slash Commands (Claude Code)

Command	When to use	What it produces
`/create-spec`	Starting a new project or adding features	`app_spec.txt` or `additions_spec.xml`
`/expand-project`	Adding features mid-sprint	New tasks in state DB
`/check-code`	Pre-PR quality gate	Ruff + mypy + pytest report
`/checkpoint`	Saving progress at a known-good state	Git commit + JSON snapshot + CHECKPOINT.md
`/review-pr`	Code review before merge	Structured verdict (APPROVE / REQUEST CHANGES)
`/pool-status`	Diagnosing slow agents or cost spikes	Provider health + recommendations
`/create-bug-report`	Structured bug reporting before fix	`bug_report.md` → auto-runs `claw-forge fix`
`/claw-forge-status`	Re-entry after leaving a session	Full project status card

📚 Full details: docs/commands.md · End-to-end workflows: docs/workflows.md

Quick Workflow Cheatsheet

New project:

claw-forge init → /create-spec → claw-forge plan → claw-forge run

Add features to existing app:

claw-forge analyze → /create-spec → claw-forge add → claw-forge run

Fix a bug:

/create-bug-report → claw-forge fix

Check health:

claw-forge status → /pool-status → /check-code

Save progress:

/checkpoint → /review-pr → git push

Workflow Features

Feature	Command / Flag
YOLO mode	`claw-forge run my-app --yolo`
Fix a bug (one-liner)	`claw-forge fix "description"`
Fix a bug from report	`claw-forge fix --report report.md`
Pause (drain)	`claw-forge pause my-app`
Resume	`claw-forge resume my-app`
Human input	`claw-forge input my-app "Here's the API key"`
Project status	`claw-forge status`
Batch features	`--batch-size 3`
Specific features	`--batch-features 1,2,3`
Pool health	`claw-forge pool-status`

Kanban UI

Real-time board tracking feature progress across all agents. Cards flow across five columns as agents work — with live provider health, cost tracking, and regression status in the header.

claw-forge Kanban board — light mode with 18 features across Pending, In Progress, Passing, Failed, and Blocked columns

Dark mode

claw-forge Kanban board — dark mode

claw-forge dev               # API (:8420, hot-reload) + UI (:5173, HMR) in one command
# or separately:
claw-forge state --reload &  # start REST + WebSocket server on :8420 with hot-reload
claw-forge ui --dev          # http://localhost:5173/?session=<uuid>

Columns: Pending · In Progress · Passing · Failed · Blocked

Header: provider health dots (green/yellow/red per provider) · progress bar (X/Y passing) · live agent count · cost sparkline · regression health bar

Real-time: WebSocket pushes feature status changes, agent streaming logs, provider health, and cost updates. Drag-and-drop cards to retry failed tasks.

Claude Commands

Seven slash commands in .claude/commands/ for use inside Claude Code. Automatically scaffolded by claw-forge init. See docs/commands.md for full reference with real-world examples, and docs/workflows.md for end-to-end walkthroughs.

Command	Purpose	Key flags	When to use
`/create-spec`	Interactive spec creation — auto-detects greenfield vs brownfield, outputs XML	`--brownfield`, `--output`	Starting a new project or adding features
`/expand-project`	Add features to an existing project	`--append`	Mid-run when you think of more features
`/check-code`	Run ruff + mypy + pytest and report	`--fix`, `--strict`	After agent completes a batch of features
`/checkpoint`	Commit + DB snapshot + session summary	`--message`	Before stepping away or at phase boundaries
`/review-pr`	Structured PR review with verdict	`--strict`	Before merging any agent-generated branch
`/pool-status`	Provider health and cost analysis	`--watch`	When you suspect rate limits or cost spikes
`/claw-forge-status`	Project progress, phase bars, agent state, next action	—	Quick health check at any time
`/create-bug-report`	Guided 6-phase bug report creation → runs fix	`--tdd`	When a bug is reported and you need a proper fix

Four agent definitions in .claude/agents/:

Agent	Model	Purpose
`coding`	sonnet	Implement features, TDD-first
`testing`	sonnet	Run regression tests, report failures
`reviewing`	opus	Code review with blocking/suggestion/approve verdict
`initializer`	sonnet	Parse spec, create feature DAG

Quick Workflow Cheatsheet

Goal	Command chain
New project	`claw-forge init` → `/create-spec` → `claw-forge plan` → `claw-forge run`
Add features to existing app	`claw-forge analyze` → `/create-spec` → `claw-forge add`
Fix a bug (TDD)	`/create-bug-report` → `claw-forge fix` → `/review-pr`
Quality check	`/check-code` → `/review-pr`
Save progress	`/checkpoint` → `git push`
Monitor providers	`/pool-status`
Resume after crash	`claw-forge status` → `claw-forge run`

📖 Full workflow guides: docs/workflows.md 📖 Command reference: docs/commands.md

Pre-installed Skills (18) + Templates

Skills are auto-injected into agent sessions via three layers:

LSP by file extension — detect_lsp_plugins() scans the project directory and injects the right language server based on file types found (.py → pyright, .ts → typescript-lsp, .go → gopls, etc.)
Agent type — skills_for_agent() injects role-appropriate skills (e.g. coding agents get systematic-debug + verification-gate, reviewers get code-review + security-audit)
Task keywords — same function scans the task description for keywords ("database" → database skill, "docker" → docker skill, "security" → security-audit, etc.)

All injection is controlled by auto_inject_skills=True on run_agent() / auto_detect_lsp=True.

LSP servers (6): pyright · gopls · rust-analyzer · typescript-lsp · clangd · solidity-lsp

Process skills (6): systematic-debug · verification-gate · parallel-dispatch · test-driven · code-review · web-research

Integration skills (6): git-workflow · api-client · docker · security-audit · performance · database

Templates: skills/bug_report.template.md — structured bug report template for use with claw-forge fix --report

Brownfield Projects

claw-forge works on existing codebases — not just greenfield projects. The brownfield workflow analyzes your project, learns its conventions, and adds features or fixes bugs without breaking existing patterns.

# 1. Analyze your project (creates brownfield_manifest.json)
claw-forge analyze

# 2. Add a feature
claw-forge add "Add rate limiting to all API endpoints"

# 3. Fix a bug
claw-forge fix "Login fails when email contains uppercase letters"

The analyzer detects your stack, parses git history for hot files, establishes a test baseline, and writes brownfield_manifest.json. Subsequent add/fix runs load this manifest to match your existing conventions.

Fixing bugs

Quick fix (one-liner):

claw-forge fix "users get 500 when uploading files > 5MB"

Structured fix (recommended for complex bugs):

# Generate a structured bug report interactively
/create-bug-report

# Or copy the template and fill it in
cp skills/bug_report.template.md bug_report.md
# edit bug_report.md...

# Run the fix
claw-forge fix --report bug_report.md

The bug-fix agent follows a strict RED → GREEN protocol:

Writes a failing regression test first (RED)
Locates root cause using systematic-debug skill
Makes surgical fix — touches only what's needed
Confirms regression test passes (GREEN)
Runs full test suite — must be 100% green
Commits with fix: <title>\n\nRegression test: <test_name>

Status: analyze, add, and fix commands are planned — see docs/brownfield.md for the full design.

Plugin System

Extend claw-forge with custom agent types via Python entry points — no fork required:

Built-in plugins (registered in pyproject.toml):

Entry point	Plugin	Purpose
`initializer`	`InitializerPlugin`	Parse spec, create feature DAG
`coding`	`CodingPlugin`	TDD-first feature implementation
`testing`	`TestingPlugin`	Run regression tests, report failures
`reviewer`	`ReviewerPlugin`	Structured code review with verdict
`bugfix`	`BugFixPlugin`	Reproduce-first bug fix: RED→GREEN protocol

Add your own:

# your-package/pyproject.toml
[project.entry-points."claw_forge.plugins"]
my_agent = "my_package.plugin:MyAgentPlugin"

from claw_forge.plugins.base import AgentPlugin, PluginContext, PluginResult

class MyAgentPlugin(AgentPlugin):
    name = "my_agent"
    description = "Does something custom"
    version = "1.0.0"

    def get_system_prompt(self, context: PluginContext) -> str:
        return "You are a specialist in ..."

    async def execute(self, context: PluginContext) -> PluginResult:
        from claw_forge.agent import collect_result
        result = await collect_result(
            self.get_system_prompt(context),
            cwd=context.project_dir,
            agent_type="coding",
        )
        return PluginResult(success=True, output=result)

Development

Prerequisites

Python 3.12+
uv package manager
Node.js 18+ (for the Kanban UI)

Setup

git clone https://github.com/clawinfra/claw-forge.git
cd claw-forge
uv sync --extra dev          # install Python deps (including dev tools)
npm install --prefix ui      # install UI deps

Local Dev (API + UI with hot-reload)

The fastest way to develop locally — one command starts both servers with automatic reload on file changes:

claw-forge dev

This starts:

State API on http://localhost:8420 — uvicorn with --reload, restarts on any Python file change under claw_forge/
Kanban UI on http://localhost:5173 — Vite dev server with HMR, instant updates on React/TypeScript changes

Options:

claw-forge dev --state-port 9000    # custom API port
claw-forge dev --ui-port 3000       # custom UI port
claw-forge dev --project /path/to/app  # point at a different project directory
claw-forge dev --no-open            # don't auto-open the browser

Press Ctrl+C to stop both servers.

Running servers separately

If you prefer separate terminals (useful for cleaner log output):

# Terminal 1 — API with hot-reload
claw-forge state --reload

# Terminal 2 — UI with Vite HMR
claw-forge ui --dev

The --reload flag on state enables uvicorn file watching. The --dev flag on ui runs Vite instead of serving the pre-built static bundle.

Running Vite directly (without the CLI)

If you want to run npm run dev from the ui/ directory without going through claw-forge ui --dev, set the proxy target port via environment variables:

cd ui
VITE_API_PORT=8420 npm run dev

Or create ui/.env.local:

VITE_API_PORT=8420
VITE_WS_PORT=8420

Tests & linting

uv run pytest tests/ -q                                          # full test suite
uv run pytest tests/path/to/test.py::Class::test_name -v         # single test
uv run pytest tests/ -q --cov=claw_forge --cov-report=term-missing  # with coverage (must reach 90%)
uv run ruff check claw_forge/ tests/                             # lint
uv run ruff check claw_forge/ tests/ --fix                       # auto-fix
uv run mypy claw_forge/ --ignore-missing-imports                 # type check

Documentation

Document	Contents
`ARCHITECTURE.md`	System design, data flow, component details
`docs/commands.md`	Full CLI command reference with options and examples
`docs/workflows.md`	End-to-end workflow walkthroughs
`docs/sdk-api-guide.md`	20 Claude Agent SDK APIs with claw-forge examples
`docs/bmad-integration.md`	Using claw-forge with BMAD Method — convert epics/stories to spec
`docs/brownfield.md`	Brownfield mode — analyze existing codebases, add features, fix bugs
`docs/agent-skill.md`	OpenClaw agent skill — install via `clawhub install claw-forge-cli`
`docs/middleware/pre-completion-checklist.md`	Design doc: PreCompletionChecklistMiddleware (issue #4)
`docs/middleware/loop-detection.md`	Design doc: LoopDetectionMiddleware (issue #5)
`docs/benchmarks/terminal-bench.md`	Terminal Bench 2.0 evaluation harness design (issue #6)
`claw-forge.yaml`	Annotated configuration reference
`website/tutorial.html`	End-to-end getting started guide
`website/features.html`	Full feature list

Harness Design Patterns

claw-forge includes advanced harness patterns inspired by Anthropic's research on long-running AI applications. These patterns improve output quality, manage context limits, and enable strategic decision-making during multi-iteration agent workflows.

Context Resets (`--reset-threshold`)

Problem: Long-running agent sessions eventually hit context limits or degrade due to accumulated conversation history ("context anxiety"). The agent may prematurely wrap up work to avoid losing context.

Solution: After N tool calls, the builder saves a structured HANDOFF.md artifact and spawns a fresh builder with it as context. The new builder gets a clean slate with just the essential state.

from claw_forge.harness import ContextResetManager, HandoffArtifact

# Configure: trigger reset after 80 tool calls
reset_mgr = ContextResetManager(project_dir="/path/to/project", threshold=80)

# Track progress
for task in tasks:
    result = await agent.run(task)
    if reset_mgr.record_tool_call():  # Returns True when threshold reached
        # Save handoff and spawn fresh builder
        handoff = HandoffArtifact(
            completed=["feat: auth (abc123)", "feat: database (def456)"],
            state=["src/auth.py — 150 lines", "tests/ — 85% coverage"],
            next_steps=["Add rate limiting", "Write API docs"],
            decisions_made=["Using SQLite for local dev"],
            quality_bar="6.5/10 — needs error handling",
        )
        reset_mgr.save_handoff(handoff)
        # Spawn new builder with HANDOFF.md as context
        await spawn_fresh_builder(project_dir="/path/to/project")

HANDOFF.md schema:

## Completed — Items done (with commit hashes for traceability)
## State — Current codebase structure, test coverage, file counts
## Next Steps — Ordered action items for the next builder
## Decisions Made — Architectural decisions to avoid revisiting
## Quality Bar — Current evaluator score, what needs improvement

Adversarial Review (`--adversarial-review`)

Problem: Agents are overly optimistic about their own work. Self-evaluation scores are inflated (average 8.5/10 even for mediocre output), and vague praise ("looks good!") provides no actionable feedback.

Solution: Separate the generator from the evaluator. The evaluator uses a skeptical, evidence-based rubric with weighted grading dimensions:

Dimension	Weight	Focus
Correctness	×3	Bugs, edge cases, error handling
Completeness	×3	Missing features, unimplemented specs
Quality	×2	Maintainability, duplication, architecture
Originality	×1	Novel approaches, design choices

from claw_forge.harness import AdversarialEvaluator, GradingDimension

evaluator = AdversarialEvaluator(
    approval_threshold=7.0,
    dimensions=[
        GradingDimension.CORRECTNESS,  # ×3
        GradingDimension.COMPLETENESS,  # ×3
        GradingDimension.QUALITY,  # ×2
        GradingDimension.ORIGINALITY,  # ×1
    ],
)

# Run adversarial review (via reviewer plugin with --adversarial flag)
# The evaluator returns:
# - Per-dimension scores (1-10) with specific evidence
# - Overall weighted score
# - Verdict: APPROVE or REQUEST_CHANGES
# - Actionable findings (blocking issues, suggestions)

Adversarial prompt features:

"Assume the code has bugs until proven otherwise"
Requires specific evidence (file names, line numbers, test names)
Few-shot examples calibrated to distinguish good vs. bad reviews
Generic praise ("well done!") is explicitly discouraged

Strategic Pivot

Problem: Agents can get stuck in a local optimum, iterating endlessly on an approach that isn't converging. Declining scores over multiple iterations signal that the current direction is flawed.

Solution: Track evaluator scores across iterations and force a strategic decision after each review cycle:

Score ≥ threshold (7.0) → APPROVE — work meets quality bar
Score trending up → REFINE — continue with improvements
Score flat, below threshold → REFINE — iterate with specific changes
Score declining for 2+ iterations → PIVOT — abandon approach, try something different

from claw_forge.harness import PivotTracker, PivotAction

tracker = PivotTracker(
    forced_pivot_streak=2,  # Force PIVOT after 2 consecutive declining scores
    approval_threshold=7.0,
)

# After each evaluator cycle:
decision = tracker.decide(score=6.2, iteration=3)
if decision.action == PivotAction.PIVOT:
    logger.warning("Scores declining: 7.5 → 6.8 → 6.2. Pivoting to new approach.")
    # Log to PLAN.md for traceability
    tracker.log_to_plan("PLAN.md")
    # Agent switches strategies entirely
elif decision.action == PivotAction.REFINE:
    logger.info("Score improving or flat. Continue with evaluator feedback.")
    # Agent iterates on current approach
elif decision.action == PivotAction.APPROVE:
    logger.info("Score meets threshold. Work approved.")
    # Agent finalizes and exits

Pivot log in PLAN.md:

## Pivot Decision Log

### Iteration 3 — PIVOT
- **Score:** 6.2/10
- **Trend:** 7.5 → 6.8 → 6.2
- **Reasoning:** Score declining for 2+ consecutive iterations (7.5 → 6.8 → 6.2). Current approach is not converging — pivot to a different strategy.
- **Time:** 2026-03-28T12:34:56Z

Usage

These patterns are integrated into the reviewer and coding plugins:

# Enable adversarial review in the reviewer plugin
claw-forge run --config reviewer.yaml  # with config: {adversarial: true}

# Coding plugin automatically loads HANDOFF.md when resuming
claw-forge run  # continues from previous handoff if HANDOFF.md exists

See the claw_forge.harness module for full API documentation and docs/harness-patterns.md for detailed usage patterns.

Benchmark Results

Ablation study on claw-forge-bench — 30 Python coding tasks (easy / medium / hard), model: claude-opus-4-6.

Config	Middleware	Pass Rate	Δ vs Baseline
A — baseline	none	96.7%	—
B — hashline	hashline edit mode	96.7%	+0.0pp
C — loop	loop detection	96.7%	+0.0pp
D — verify	verify-on-exit	96.7%	+0.0pp
E — full stack	hashline + loop + verify	100%	+3.3pp

Finding: Individual middleware layers show no uplift in isolation. The combination of all three reaches 100% — an interaction effect where each layer compensates for the others' blind spots.

Recommended production stack:

claw-forge run --edit-mode hashline --loop-detect-threshold 5 --verify-on-exit

→ Full results + methodology · Benchmark repo

OpenClaw Agent Skill

claw-forge is available as an installable OpenClaw agent skill on ClawHub. Any AI agent running on OpenClaw can use it to run the full claw-forge workflow autonomously.

clawhub install claw-forge-cli

See docs/agent-skill.md for details on --edit-mode hashline, the middleware stack, Terminal Bench ablation numbers, and how the skill integrates with OpenClaw.

License

Apache-2.0 · Built by ClawInfra

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.12

Apr 13, 2026

This version

0.5.11

Apr 13, 2026

0.5.10

Apr 12, 2026

0.5.9

Apr 12, 2026

0.5.8

Apr 12, 2026

0.5.6

Mar 28, 2026

0.5.5

Mar 25, 2026

0.5.4

Mar 23, 2026

0.5.3

Mar 22, 2026

0.5.2

Mar 22, 2026

0.5.1

Mar 22, 2026

0.5.0

Mar 21, 2026

0.4.8

Mar 20, 2026

0.4.7

Mar 15, 2026

0.4.6

Mar 14, 2026

0.4.5

Mar 13, 2026

0.4.2

Mar 13, 2026

0.4.1

Mar 12, 2026

0.4.0

Mar 12, 2026

0.3.2

Mar 8, 2026

0.3.1

Mar 8, 2026

0.1.0a63 pre-release

Mar 7, 2026

0.1.0a61 pre-release

Mar 6, 2026

0.1.0a60 pre-release

Mar 6, 2026

0.1.0a59 pre-release

Mar 5, 2026

0.1.0a58 pre-release

Mar 5, 2026

0.1.0a56 pre-release

Mar 4, 2026

0.1.0a55 pre-release

Mar 4, 2026

0.1.0a54 pre-release

Mar 4, 2026

0.1.0a52 pre-release

Mar 4, 2026

0.1.0a51 pre-release

Mar 4, 2026

0.1.0a50 pre-release

Mar 4, 2026

0.1.0a49 pre-release

Mar 4, 2026

0.1.0a48 pre-release

Mar 4, 2026

0.1.0a47 pre-release

Mar 4, 2026

0.1.0a46 pre-release

Mar 4, 2026

0.1.0a45 pre-release

Mar 4, 2026

0.1.0a44 pre-release

Mar 4, 2026

0.1.0a43 pre-release

Mar 4, 2026

0.1.0a42 pre-release

Mar 4, 2026

0.1.0a41 pre-release

Mar 4, 2026

0.1.0a40 pre-release

Mar 3, 2026

0.1.0a34 pre-release

Mar 3, 2026

0.1.0a33 pre-release

Mar 3, 2026

0.1.0a32 pre-release

Mar 3, 2026

0.1.0a31 pre-release

Mar 3, 2026

0.1.0a30 pre-release

Mar 3, 2026

0.1.0a29 pre-release

Mar 3, 2026

0.1.0a27 pre-release

Mar 3, 2026

0.1.0a26 pre-release

Mar 3, 2026

0.1.0a25 pre-release

Mar 3, 2026

0.1.0a24 pre-release

Mar 3, 2026

0.1.0a23 pre-release

Mar 3, 2026

0.1.0a22 pre-release

Mar 3, 2026

0.1.0a21 pre-release

Mar 3, 2026

0.1.0a20 pre-release

Mar 3, 2026

0.1.0a19 pre-release

Mar 3, 2026

0.1.0a16 pre-release

Mar 3, 2026

0.1.0a15 pre-release

Mar 3, 2026

0.1.0a14 pre-release

Mar 3, 2026

0.1.0a13 pre-release

Mar 3, 2026

0.1.0a12 pre-release

Mar 3, 2026

0.1.0a11 pre-release

Mar 3, 2026

0.1.0a10 pre-release

Mar 3, 2026

0.1.0a9 pre-release

Mar 3, 2026

0.1.0a8 pre-release

Mar 3, 2026

0.1.0a7 pre-release

Mar 3, 2026

0.1.0a6 pre-release

Mar 3, 2026

0.1.0a5 pre-release

Mar 3, 2026

0.1.0a4 pre-release

Mar 3, 2026

0.1.0a3 pre-release

Mar 3, 2026

0.1.0a2 pre-release

Mar 3, 2026

0.1.0a1 pre-release

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claw_forge-0.5.11.tar.gz (3.2 MB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

claw_forge-0.5.11-py3-none-any.whl (605.2 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file claw_forge-0.5.11.tar.gz.

File metadata

Download URL: claw_forge-0.5.11.tar.gz
Upload date: Apr 13, 2026
Size: 3.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for claw_forge-0.5.11.tar.gz
Algorithm	Hash digest
SHA256	`b95b36fdcea2d1f43ff0c03d0dd3bc93048f44a9310d80898f0a1f7437b25949`
MD5	`d0223ce2e246d847e59bb7e0aec6e0c9`
BLAKE2b-256	`12f17d1fe9a53227636a2097e206b8cfed8900503473b0c5e59cbc31363dde27`

See more details on using hashes here.

File details

Details for the file claw_forge-0.5.11-py3-none-any.whl.

File metadata

Download URL: claw_forge-0.5.11-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 605.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for claw_forge-0.5.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53143b2cde3aa2e84a27602309885850ce671424341ee649db48e7f67019764a`
MD5	`7dc71f867e36d546859a1844f25468d8`
BLAKE2b-256	`0bd9ded360eac784ed6168d225ad2df278e64de9e6a63b414b234af3f570d375`

See more details on using hashes here.

claw-forge 0.5.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔥 claw-forge

What it does

Quick Start

Writing a Project Spec

4 ways to create a spec

Spec format reference

Tips for a good spec

Adding features to a running project

Provider Pool

Supported Providers

Agent Runtime

Bidirectional sessions

One-shot queries

Provider pool + agent

Advanced SDK Features

Security

Commands

CLI Commands

Slash Commands (Claude Code)

Quick Workflow Cheatsheet

Workflow Features

Kanban UI

Dark mode

Claude Commands

Quick Workflow Cheatsheet

Pre-installed Skills (18) + Templates

Brownfield Projects

Fixing bugs

Plugin System

Development

Prerequisites

Setup

Local Dev (API + UI with hot-reload)

Running servers separately

Running Vite directly (without the CLI)

Tests & linting

Documentation

Harness Design Patterns

Context Resets (--reset-threshold)

Adversarial Review (--adversarial-review)

Strategic Pivot

Usage

Benchmark Results

OpenClaw Agent Skill

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Context Resets (`--reset-threshold`)

Adversarial Review (`--adversarial-review`)