Skip to main content

Graduate fuzzy AI skills into deterministic, reliable workflows

Project description

rote

Graduate fuzzy AI skills into deterministic, reliable workflows.

rote is a CLI that takes an Anthropic-style Skill (a SKILL.md plus references/) and turns it into a runnable background pipeline in one shot. An LLM agent (itself defined as a skill) reads the source skill, applies a structured graduation rubric, and emits:

  • a runtime-agnostic intermediate representation (pipeline.yaml),
  • extracted Python modules for the deterministic parts of the skill,
  • typed signature stubs for the LLM-judge parts,
  • and runtime code for your durable execution engine of choice.
pip install rote-cli    # or zero-install: uvx --from rote-cli rote ...

# Default target is DBOS — durable execution as a plain Python library,
# no orchestrator to run, SQLite for dev / Postgres for prod:
rote graduate ./examples/bdr-outreach/skill --out ./graduated/

# Or target Temporal (Python) or Cloudflare Workflows (TypeScript):
rote graduate ./examples/bdr-outreach/skill --runtime temporal   --out ./graduated/
rote graduate ./examples/bdr-outreach/skill --runtime cloudflare --out ./graduated/

The name comes from rote learning — doing something so many times, so reliably, that it becomes mechanical. That's what graduation does to a skill: a fuzzy 10-20 minute agent loop becomes a deterministic pipeline that runs in the background, costs a fraction of the tokens, and can be regression-tested.


Why compile agents?

There's now third-party data for what compilation buys. "Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation" (Trooskens et al., Apr 2026) measured compiling LLM workflows into deterministic code against running them through direct LLM calls: 57× fewer tokens at 1,000 transactions, 450× lower median latency, 100% reproducibility where direct inference at temperature 0 managed 95%, and roughly 40× lower TCO at a million transactions a month. The numbers come from a structured function-calling benchmark (BFCL), the friendliest case for compilation, and the token and cost multiples grow with volume. But the shape of the result holds: once a workflow is proven, every run through an agent loop pays LLM prices for work code does for free.

One distinction worth being precise about. Durable-execution vendors make fuzzy agents durable: wrap the loop in retries and state so it survives crashes, still fuzzy inside. rote removes the fuzzy loop: compile the proven parts to deterministic code and keep the LLM only where inputs are genuinely unbounded. The two compose rather than compete — Temporal and Cloudflare Workflows are rote's compile targets, not its rivals.

When not to use rote: exploratory and one-off work should stay an agent loop; flexibility is the whole point there, and there's nothing proven to compile yet. rote is for the skill you've run twenty times and want to run a thousand more, unattended.


What just happened on the bundled example

The repository includes a real BDR outreach skill (lead generation, contact vetting, CRM upload, mandatory exclusion checks, email personalization, manual enrollment handoff). Running rote graduate on it with the default claude driver and Sonnet 4.6 produces:

Output Value
Total nodes in the produced IR 22
Codifiable percentage 78.9 % (15 of 19 non-gate nodes)
Extracted Python modules 5 (zoominfo, hubspot, conference, exclusions, report)
Typed LLM-judge signatures 2 (vet_contact, personalize_email)
Mandatory nodes (cannot be skipped) 4
Human-in-the-loop gates 3
Wall-clock time ~13 minutes
Subscription cost ~$0.70 (Sonnet 4.6 via Claude Code)

The graduator independently:

  • Identified the three MANDATORY exclusion checks from prose-only enforcement and marked them mandatory: true in the IR.
  • Extracted four batch-size constants (10, 100, 250, 30-day window) that lived only in prompt prose.
  • Lifted a literal Python keyword classifier out of a reference file into a real module.
  • Modeled a parallel "conference list" entry path with its own HITL gate that the human-written baseline missed.
  • Surfaced five Open Questions with explicit "review ask" notes for the human reviewer (e.g. "how does the adapter dispatch external_call nodes — via the impl Python function or via an MCP tool registry?").

After the agent finishes, a runtime adapter consumes the IR and emits the target runtime's native code shape:

  • The DBOS adapter (the default) emits a single main.py — one @DBOS.workflow DAG plus a @DBOS.step per node, checkpointing to SQLite or Postgres. python main.py is the runtime; there is no orchestrator to deploy.
  • The Temporal adapter emits workflow.py (the orchestration class with @workflow.defn and signal handlers for the HITL gates) and activities.py (one @activity.defn per node, lazy-importing the extracted functions).
  • The Cloudflare Workflows adapter emits a TypeScript WorkflowEntrypoint class with step.do(...) for each unit of work and step.waitForEvent(...) for each HITL gate, plus signatures/*.ts (Zod schemas + Anthropic SDK calls) and the supporting wrangler.jsonc / package.json / tsconfig.json. The output is wrangler deploy-ready.
  • The DBOS (TypeScript) adapter (--runtime dbos-ts) emits a Node.js app for DBOS Transact: src/main.ts registers one durable workflow (DBOS.registerWorkflow) running the DAG waves and one DBOS.registerStep per node, with DBOS.recv(...) parking each HITL gate durably in Postgres until DBOS.send(...) resumes it. Zero-orchestrator like the Python DBOS target — node dist/main.js is the runtime — but note the TS SDK is Postgres-only (no SQLite mode; npx dbos postgres start covers local dev).

None of the emitted code references an MCP runtime, in either language — the agent's crystallization step replaces tool calls with deterministic implementations.


Why this exists

Fuzzy AI skills work, but in production they're slow, expensive, and non-deterministic:

  • Slow. A 10-20 minute agent loop per run is fine for human-in-the-loop use, but unacceptable as a background job.
  • Expensive. Multi-agent loops use ~15× the tokens of a single chat, per Anthropic's own measurements. Most of those tokens go to re-deriving procedures the skill author already wrote down.
  • Non-deterministic. A "MANDATORY" check enforced only by prose can be silently skipped if prompt drift is bad or the trajectory gets long. There's no way to regression-test a behavior the LLM has to remember.

The fix is to identify which parts of a skill are actually fuzzy and which are deterministic procedures wearing fuzzy clothing. Then move the deterministic parts into code, keep the LLM at the points where the input is genuinely unbounded (parsing, classifying, drafting), and wrap the whole thing in a durable execution engine with explicit HITL gates.

That graduation step is what rote automates.


How it works

rote is a three-layer system. Each layer has one job and contracts on a small interface:

   ┌────────────────────┐
   │  SKILL.md +        │   Source skill bundle (untouched)
   │  references/       │
   └─────────┬──────────┘
             │
             │  rote graduate
             ▼
   ┌────────────────────┐
   │  graduator agent   │   LLM agent runs the rote-graduate
   │  (Claude / Codex / │   skill against the source bundle.
   │   Anthropic SDK)   │   Pluggable driver layer.
   └─────────┬──────────┘
             │
             │  filesystem contract:
             │    work_dir/pipeline.yaml + extracted/ + signatures/
             ▼
   ┌────────────────────┐
   │  Pipeline IR       │   Pydantic-validated DAG of typed
   │  (pipeline.yaml)   │   nodes. Five node kinds. Runtime-
   │                    │   agnostic.
   └─────────┬──────────┘
             │
             │  rote.adapters.<runtime>
             ▼
   ┌────────────────────┐
   │  emitted runtime   │   Workflow + activities for the
   │  code              │   target durable execution engine.
   └────────────────────┘

The three layers are:

  1. The graduator agent (skills/rote-graduate/) — a regular Anthropic Skill with a SKILL.md and four reference files (node-kinds.md, crystallization-heuristics.md, ir-schema.md, llm-judge-extraction.md). This is the brain of rote. It can run inside any Skills-compatible surface; you don't need rote to use it.
  2. The IR (src/rote/ir.py) — Pydantic models for the five node kinds plus edges, retries, HITL gates, and pipeline metadata. The IR is the source of truth; everything downstream is template substitution from it.
  3. Runtime adapters (src/rote/adapters/<runtime>.py) — pluggable modules that consume an IR and emit runnable code for a specific durable execution engine.

The graduator's job ends when it has produced a valid pipeline.yaml (plus extracted modules and signatures). It does not emit runtime code. Code emission is deterministic Python in rote.adapters — never agent-driven — so the same IR always produces byte-identical output.


Quickstart

Use from Claude Code (recommended)

rote ships as a Claude Code plugin, so you can graduate a skill without leaving Claude or touching Python tooling:

/plugin marketplace add trevhud/rote
/plugin install rote@rote

Then say "graduate this skill" (or run /rote:graduate directly). The plugin confirms the source skill directory, asks which runtime you want (Temporal, Cloudflare Workflows, or DBOS), runs the CLI via uv in the background, and reports the emitted pipeline. A second skill, /rote:serve, wires graduated pipelines up as MCP tools so Claude can trigger the deployed workflows.

Prefer a terminal? The same thing is one uvx command:

uvx --from rote-cli rote graduate ./my-skill --runtime dbos --out ./graduated

# or straight from GitHub for unreleased changes:
uvx --from git+https://github.com/trevhud/rote rote graduate \
  ./my-skill --runtime dbos --out ./graduated

Naming note: the rote package on PyPI is an unrelated memoization library that also installs import rote, so the two can't share an environment. This project's distribution is rote-cli while the CLI command and import name stay rote — hence uvx --from rote-cli rote .... See docs/releasing.md.

Install from source (development)

Clone and install in editable mode:

git clone https://github.com/trevhud/rote.git
cd rote
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

This installs rote plus everything you need to run the tests (temporalio, anthropic, pytest, pytest-asyncio).

Run on the bundled example

The repository includes a real BDR outreach skill in examples/bdr-outreach/skill/. Graduate it:

rote graduate examples/bdr-outreach/skill --out /tmp/bdr-graduated

This targets DBOS by default; pass --runtime temporal or --runtime cloudflare for the other adapters.

By default rote auto-detects an available agent driver in this order: claude (Claude Code CLI) → codex (Codex CLI) → api (Anthropic SDK with ANTHROPIC_API_KEY). Override with --agent:

rote graduate examples/bdr-outreach/skill --agent api --out /tmp/bdr-graduated

The output directory is structured as:

/tmp/bdr-graduated/
├── graduated/                   # produced by the graduator agent
│   ├── pipeline.yaml            # the IR
│   ├── extracted/*.py           # deterministic functions
│   ├── signatures/*.py          # typed LLM-judge signatures
│   ├── evals/*.jsonl            # seed eval examples
│   └── graduation-report.md     # human-readable summary
└── runtime/dbos/                # produced by the adapter
    ├── main.py                  # @DBOS.workflow + one @DBOS.step per node
    ├── extracted/*.py           # copied deterministic functions
    ├── signatures/*.py          # generated Pydantic + vendor-SDK judges
    ├── dbos-config.yaml
    └── README.md                # how to run, signal HITL gates, deploy

Render an IR without re-running the agent

If you already have a pipeline.yaml (hand-written or from a previous graduation), rote emit runs just the adapter step:

rote emit /path/to/pipeline.yaml --out /tmp/emitted/            # dbos (default)
rote emit /path/to/pipeline.yaml --runtime temporal --out /tmp/emitted/

This is the cheap inner loop while iterating on adapters or IR shapes.


The five node kinds

Every step in a graduated pipeline is exactly one of five kinds. The target runtime's adapter knows how to emit each one. Full classification guidance lives in skills/rote-graduate/references/node-kinds.md.

Kind What it is Where the LLM lives
pure_function Fixed logic, deterministic I/O Not involved
external_call Vendor API call with fixed semantics + retries Not involved
llm_judge Fuzzy classification against a rubric, typed I/O Typed signature: DSPy/BAML in Python; Zod + vendor SDK in TypeScript. The IR carries a runtime-agnostic signature_spec (JSON Schema + prompt) so each adapter derives the right native shape.
agent_loop Genuinely exploratory tool use Bounded agent loop
hitl_gate Explicit human approval, suspend until signal Durable suspend/resume

The guiding rule: keep the LLM at points where the input is unbounded or ambiguous, and codify everything else. When a step could be classified two ways, prefer the more deterministic kind.


Driver matrix

rote ships three interchangeable graduator drivers. Pick whichever matches your auth situation; the same pipeline.yaml comes out either way.

Driver Backend Auth Install Default model
claude claude -p subprocess Claude Max/Pro OAuth or CLAUDE_CODE_OAUTH_TOKEN Install Claude Code separately claude-sonnet-4-6
codex codex exec subprocess ChatGPT Plus/Pro OAuth Install Codex CLI separately (driver default)
api anthropic Python SDK ANTHROPIC_API_KEY env var pip install 'rote-cli[api]' claude-sonnet-4-6

The claude driver is the default for subscription users — it scrubs ANTHROPIC_API_KEY from the subprocess environment so the user's Claude Code login wins, sets CLAUDE_CODE_DISABLE_NONINTERACTIVE_ANIMATIONS=1 for clean output, and limits the agent to read/write/glob/grep tools (no shell, no network). See docs/agent-runtime.md for the full design record including the auth gotcha that motivates the env scrub.

The model defaults to Sonnet 4.6 rather than Opus because the graduator's task is structured-rubric-following, not deep reasoning. Sonnet brings per-run cost from ~$3.50 to ~$0.70 in subscription accounting, which makes iterative rubric tuning feasible. Override with rote graduate --model claude-opus-4-6 for complex skills where Opus's extra reasoning earns its cost.


Status

rote is pre-1.0. The end-to-end flow works on the BDR example and the test suite covers each layer (231 tests in the fast suite, plus 5 slow tests that run the emitted code against real runtimes: a DBOS runtime over SQLite, Temporal's time-skipping test server, the emitted Cloudflare TypeScript compiled against real @cloudflare/workers-types and driven through both HITL gates via wrangler dev, and the MCP server over a real stdio transport). Run pytest tests/ for the fast suite; pytest tests/ -m slow for the toolchain-dependent integration tests.

Component Status
IR (Pydantic schema, validation, YAML loader) working
Temporal adapter working (validated with mocked-activities e2e test)
Cloudflare Workflows adapter working (validated with tsc --noEmit over the real emitted output)
DBOS adapter working (validated against a real DBOS runtime over SQLite in the e2e test)
DBOS TypeScript adapter (dbos-ts) working (validated with tsc --noEmit over the real emitted output and a live run on the DBOS TS runtime against Docker Postgres)
Graduator orchestrator working
rote graduate / rote emit CLI commands working
claude driver working
api (Anthropic SDK) driver working
codex driver stub (is_available works; run not implemented)
Inngest / Restate adapters planned
Real implementations of the extracted modules the agent produces stubs that raise NotImplementedError; humans fill them in with real API client code
Workflow data flow between activities working — nodes declare inputs: bindings and all four adapters (Temporal, Cloudflare, DBOS, DBOS-TS) thread real payloads through the DAG, validated in the runtime e2e tests
Distribution via PyPI published as rote-cli (pip install rote-cli); tag-driven Trusted Publishing releases — see docs/releasing.md

The project explicitly does not depend on claude-agent-sdk. Anthropic's terms of service forbid third-party agents built on the Agent SDK from using claude.ai login credentials without prior approval, which would defeat the subscription path. We use the bare anthropic SDK or spawn claude directly instead.


Repository layout

rote/
├── README.md
├── LICENSE                                  # Apache-2.0
├── pyproject.toml                           # rote + optional [temporal] / [api] / [dev] extras
├── docs/
│   └── agent-runtime.md                     # decision record for the driver layer
├── skills/
│   └── rote-graduate/
│       ├── SKILL.md                         # the graduator agent's instructions
│       └── references/
│           ├── node-kinds.md                # 5-kind classification rubric
│           ├── crystallization-heuristics.md  # patterns for moving prose into code
│           ├── ir-schema.md                 # pipeline.yaml reference
│           └── llm-judge-extraction.md      # how to design typed signatures
├── src/rote/
│   ├── cli.py                               # rote graduate / rote emit
│   ├── ir.py                                # Pydantic IR models + load_pipeline
│   ├── graduator/
│   │   ├── __init__.py                      # Graduator orchestrator
│   │   └── drivers/
│   │       ├── __init__.py                  # Protocol + registry + auto_detect
│   │       ├── claude.py                    # ClaudeDriver (subprocess)
│   │       ├── codex.py                     # CodexDriver (stub)
│   │       └── anthropic_api.py             # AnthropicApiDriver (in-process)
│   └── adapters/
│       ├── __init__.py                      # adapter registry
│       ├── temporal.py                      # TemporalAdapter (Python emitter)
│       └── cloudflare.py                    # CloudflareAdapter (TypeScript emitter)
├── examples/
│   └── bdr-outreach/
│       ├── skill/                           # the source skill (graduator input)
│       ├── expected/                        # hand-drafted IR + stubs (regression baseline)
│       └── runs/                            # snapshots of real graduator runs
└── tests/                                   # 136 passing tests across 11 files

How it differs from other tools

  • vs. raw Temporal / Cloudflare Workflows / Inngest / Restate: durable execution engines give you the workflow runtime; they don't help you decide what should be a workflow in the first place. rote is the missing step that converts a working skill into something worth running on a durable engine.
  • vs. LangGraph: LangGraph is an excellent state machine for agent loops, but its graph is hand-built. rote produces a graph from prose, classifies its nodes by determinism, and pushes work out of the agent loop wherever the data supports it.
  • vs. just using Claude Code Skills directly: Skills run great in interactive use. rote is what you reach for when a skill becomes business-critical and needs to run unattended in the background with hard reliability guarantees and per-step regression tests.
  • vs. claude-agent-sdk: see the Status section. The Agent SDK is API-key-only for third-party tooling per Anthropic's ToS, which defeats the subscription path that rote's primary claude driver enables.

Documentation index


Roadmap

In rough priority order:

  1. CodexDriver implementation. Same shape as ClaudeDriver but spawning codex exec. Unlocks ChatGPT subscribers.
  2. End-to-end re-graduation of BDR with signature_spec. The current bundled pipeline.yaml was hand-extended with structured schemas for the Cloudflare adapter; the rubric in skills/rote-graduate/references/ was updated to teach the graduator the new field, but no real graduator run has produced one yet. Re-running rote graduate examples/bdr-outreach/skill should produce the structured form natively.
  3. A third runtime adapter. Probably Inngest, since its programming model is meaningfully different from both Temporal and Cloudflare. Each new adapter is also a stress test on whether the IR is genuinely runtime-agnostic vs. accidentally shaped like one of the existing targets.
  4. Pre-filter as pure_function node. Today the rubric lifts hard thresholds into a Python forward() method, which works for Temporal but not for Cloudflare. Modeling the pre-filter as a separate pure_function node before the llm_judge makes the short-circuit work uniformly across runtimes.
  5. Explicit data-flow threading. (Done.) Nodes declare inputs: — a parameter → source-reference mapping with a deliberately tiny grammar (pipeline.input[.field] / <node_id>.output[.field]) — and both adapters thread real payloads through the DAG. Remaining follow-up: per-element dispatch for fan_out nodes, which currently receive the whole upstream list in one invocation.
  6. More example skills. BDR is rich but it's one shape of skill. Additional examples (research-heavy, retrieval-heavy, code-review) stress-test the IR and the rubric in different ways.
  7. PyPI distribution. Once the API is stable enough.
  8. The graduator graduating itself. The rote-graduate skill is itself a SKILL.md. Pointing rote graduate at it should produce a graduated meta-graduator where the rubric-grade pieces are crystallized into Python and only the genuinely fuzzy judgments stay in the agent loop.

Contributing

The most useful contributions right now are:

  • Run rote graduate on a real skill of your own and report what happens. The rubric was designed against one skill (BDR); it needs to be tested against more.
  • Add a runtime adapter. The Temporal adapter in src/rote/adapters/temporal.py is ~450 lines and follows a clear pattern. Inngest, Restate, and Hatchet are all good targets.
  • Add a graduator driver. The Protocol in src/rote/graduator/drivers/__init__.py is simple. Aider, Gemini CLI, and Cursor Agent are reasonable additions.
  • Improve the rubric. Every change to a file under skills/rote-graduate/references/ is tracked in git, so improvements can be A/B tested across runs.

The test suite (pytest tests/) covers each layer in isolation plus the full pipeline against the BDR example. New work should land with matching tests.


License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rote_cli-0.4.0.tar.gz (120.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rote_cli-0.4.0-py3-none-any.whl (130.6 kB view details)

Uploaded Python 3

File details

Details for the file rote_cli-0.4.0.tar.gz.

File metadata

  • Download URL: rote_cli-0.4.0.tar.gz
  • Upload date:
  • Size: 120.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rote_cli-0.4.0.tar.gz
Algorithm Hash digest
SHA256 fd505d66e9de61ff4316ff956ec06a0649f05067963d5177c4c25f50c81f686f
MD5 bcd57e3c2032394c0317ef53b1b616e0
BLAKE2b-256 4dec867b26b733e6edda108d15ee427df2248ddd555ec949ed8d976d95324df4

See more details on using hashes here.

Provenance

The following attestation bundles were made for rote_cli-0.4.0.tar.gz:

Publisher: release.yml on trevhud/rote

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rote_cli-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: rote_cli-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 130.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rote_cli-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bb9f4576ef79f391ce1c6889daff74ddd0e9ce141d262d010f6943b118f0d172
MD5 7857cf310da4e6f102f97f7519abed71
BLAKE2b-256 19ba41d682be429a2d947287a3fab0d5e6705b478dffb3600837b0fdc1439ba5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rote_cli-0.4.0-py3-none-any.whl:

Publisher: release.yml on trevhud/rote

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page