Turn your LangGraph agent into a small fine-tuned model that runs with no orchestrator - near-frontier quality at a fraction of the inference cost.
Project description
agent2model
Turn your LangGraph agent into a small open model that runs with no orchestrator.
Point agent2model at an agent procedure (a LangGraph graph, or a YAML flowchart) and it bakes the whole procedure into a small model's weights — so the model self-orchestrates with no runtime orchestrator and no per-turn frontier calls. The method reports near-frontier quality at 128–462× lower inference cost (Dennis et al. 2026); those are the paper's figures, not yet independently reproduced in this repo — see Benchmarks.
The idea in one diagram
┌──────────────────────────────────────────────────────────────────────────┐
│ │
│ THE OLD WAY THE agent2model WAY │
│ ─────────── ────────────────── │
│ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ User message │ │ User message │ │
│ └──────┬───────┘ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ LangGraph │ (every turn: │ Compiled small │ one call: │
│ │ orchestrator │ prompt + entire │ model (Qwen) │ procedure │
│ └──────┬───────┘ flowchart sent └────────┬─────────┘ is in the │
│ │ to frontier) │ weights │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ Sonnet/ │ │ Agent response │ │
│ │ GPT-4 │ ($$$ per turn) │ (no Claude) │ (¢¢/turn) │
│ └──────┬───────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Agent response│ │
│ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
You describe your agent's procedure once — or import it from a LangGraph
StateGraph. agent2model walks the procedure, has Claude write thousands of
synthetic conversations through every possible path, and fine-tunes a small
open-source model on those conversations. The result: a self-contained model
that knows the procedure — no external orchestrator, no per-turn frontier calls.
How it differs from what you already use: prompt-optimizers like DSPy and GEPA
make a frontier model follow a procedure better but keep a runtime program;
orchestrators like LangGraph and CrewAI run the procedure live on every turn.
agent2model is the only one that removes the orchestrator entirely by baking the
procedure into the weights.
Based on Dennis et al. 2026, Compiling Agentic Workflows into LLM Weights (arXiv:2605.22502).
The 4-command pipeline
# 1. Validate a workflow and emit the canonical IR. [free, offline, no GPU]
agent2model compile examples/travel_booking/flowchart.yaml --out build/travel
# 2. Generate synthetic training conversations via Claude. [Anthropic API $; --budget caps it]
agent2model generate build/travel --n 2000 --budget 60
# 3. Fine-tune Qwen 2.5/3 on the generated data. [needs a GPU + the [train] extra]
agent2model train build/travel --base Qwen/Qwen2.5-3B-Instruct --epochs 20
# 4. Evaluate against baselines, then serve via vLLM. [eval = Anthropic API $; serve needs a GPU]
agent2model eval build/travel --baselines in_context,langgraph --n 200
agent2model serve build/travel --port 8000
Cost/hardware at a glance:
compileis free and offline.generateandevalmake Anthropic API calls (each prints an estimate first and is capped by--budget).trainandserveneed a GPU — locally with the[train]/[serve]extras, or on Modal (below) if you have no GPU. A full paper reproduction (generate → train → eval) runs ~$30–50 end-to-end.
Don't have a GPU? The whole pipeline runs on Modal:
agent2model cloud setup # one-time wizard
agent2model cloud run my_workflow.yaml --size 3b --n 2000 --epochs 20
Start from the agent you already have
You don't have to write YAML. If your procedure already lives in a LangGraph
StateGraph, point compile straight at the .py and agent2model extracts
the nodes, edges, and decision points into the IR for you:
pip install "agent2model[langgraph]" # needed only to import a .py graph
agent2model compile my_graph.py --out build/mine # LangGraph → IR, automatically
Compiling a
.pygraph imports and runs it — only pointcompileat files you trust (see docs/adapters.md). The adapter recovers the procedure's structure (nodes/edges/branches); you then fill in node prompts and terminal types where they can't be inferred (generatewill refuse until the placeholderTODO:prompts are replaced). CrewAI Flows and Oracle Agent Spec import are on the near-term roadmap.
Prefer to author from scratch, or have no existing graph? Write the Flowchart IR directly — it's a few lines of YAML.
Results from the paper (Dennis et al. 2026)
These are the paper's published targets, reproduced here as the benchmark we hold ourselves to. Independently-reproduced numbers (this repo, your hardware) are tracked in
benchmarks/— see Benchmarks & reproduction.
Compiled small models vs. frontier baselines, n=200 conversations per condition, scored 1–5 on each of 5 criteria.
Travel booking (3B, 14 nodes)
| Criterion | agent2model 3B (paper) | Same-model orch. | LangGraph | In-context (Sonnet) |
|---|---|---|---|---|
| Task Success | 4.11 | 3.93 | 4.17 | 4.53 |
| Information Acc. | 4.75 | 4.69 | 4.21 | 4.64 |
| Consistency | 4.34 | 4.12 | 4.32 | 4.96 |
| Graceful Handling | 4.07 | 3.87 | 4.62 | 4.96 |
| Naturalness | 4.12 | 3.96 | 4.84 | 5.00 |
Cost per conversation
| Domain | In-context Sonnet | LangGraph | agent2model (paper target) | Reduction |
|---|---|---|---|---|
| Travel (14 nodes) | $0.133 | $0.077 | $0.0010 | 128× |
| Zoom Support (14) | $0.103 | $0.054 | $0.0003 | 296× |
| Insurance Claims (55+) | $0.327 | $0.174 | $0.0007 | 462× |
→ The paper reports 87–98% of frontier quality at a fraction of a percent of the cost.
These figures are from Dennis et al. 2026; this repo's own reproduced numbers live in
benchmarks/ and are still being filled in.
The Flowchart IR
Procedures are described as YAML. This is the library's public contract.
name: travel_booking
description: Help a customer plan and book a trip
start: greet
nodes:
greet:
role: agent
prompt: Warmly greet the customer and ask what they need.
next: [gather_preferences]
gather_preferences:
role: agent
prompt: Ask about destination, dates, budget, group size — one at a time.
next: [assess_readiness]
assess_readiness:
role: decision # an LLM picks at *generation* time, never runtime
next:
- to: present_options
when: user has provided all required info
- to: gather_preferences
when: details are still missing
booking_confirmed:
terminal: success # success | abandonment | escalation
scenario_variables:
destination_pool: [Japan, Italy, Iceland, Portugal]
budget_range: [500, 5000]
user_styles: [decisive, indecisive, skeptical, enthusiastic]
The validator enforces every invariant the paper requires:
- ✓ Every non-terminal node has ≥1 outgoing edge
- ✓ Every terminal is reachable from
start - ✓ Cycles must contain a terminal-reaching escape edge (no dead-end loops)
- ✓
role: decisionis resolved only during data generation — the trained model self-orchestrates with no runtime router
Already have a LangGraph workflow? Skip the YAML — point compile at the .py:
agent2model compile my_graph.py --out build/mine
What you actually get from a run
build/travel/
├── flowchart.json # canonical IR
├── dataset.jsonl # n synthetic conversations (HF chat-template)
├── cost.json # Anthropic token + USD ledger
├── generation_state.json # checkpoint — resumes on rerun
├── model/best/ # fine-tuned Qwen, the compiled model
├── eval_report.pdf # per-criterion bar charts, CIs, costs
└── eval_report.json # machine-readable scores + bootstrap CIs
Install
pip install agent2model
That core install is enough to compile a flowchart, generate data, and eval
(no GPU needed). Pull in extras for the heavier paths:
pip install "agent2model[train]" # torch + trl + transformers + deepspeed (GPU)
pip install "agent2model[serve]" # vLLM OpenAI-compatible serving (GPU/Linux)
pip install "agent2model[cloud]" # Modal cloud recipes
pip install "agent2model[langgraph]" # compile FROM a LangGraph .py graph
Cloud-first (no local GPU)
pip install "agent2model[cloud]"
agent2model cloud setup # wizard: Modal account, token, anthropic-secret
agent2model cloud doctor # checklist; tells you what's missing
From source (development)
git clone https://github.com/kamaalg/agent2model && cd agent2model
pip install -e ".[dev]"
Reproduce the paper
The library ships three reproduction entrypoints — each a single Modal command:
modal run -m agent2model.cloud.modal_app::reproduce_travel # 3B, ~3.5h, ~$30-50
modal run -m agent2model.cloud.modal_app::reproduce_zoom # 8B, ~30 min train
modal run -m agent2model.cloud.modal_app::reproduce_insurance # 8B, 55+ nodes
Each chains generate → train → evaluate end-to-end and writes a PDF report.
The CI gate fails any release whose numbers regress > 5% below
benchmarks/targets.json.
Architecture
┌───────────────────────────────────────────┐
│ YOUR WORKFLOW │
│ ( YAML flowchart or LangGraph .py ) │
└──────────────────┬────────────────────────┘
│
agent2model compile
│
▼
┌───────────────────────────────────────────┐
│ Canonical Flowchart IR │
│ (Pydantic, validated, version-stable) │
└──────────────────┬────────────────────────┘
│
agent2model generate
(Claude Sonnet 4.5,
async, prompt-cached,
budget-capped, resumable)
│
▼
┌───────────────────────────────────────────┐
│ N synthetic conversations (JSONL) │
│ flowchart NEVER appears in training data │
└──────────────────┬────────────────────────┘
│
agent2model train
(TRL SFTTrainer,
full-param only,
DeepSpeed ZeRO-3 for 8B)
│
▼
┌───────────────────────────────────────────┐
│ Compiled Qwen 2.5/3 model │
│ (best checkpoint by held-out eval loss) │
└──────────┬────────────────────────┬───────┘
│ │
agent2model eval agent2model serve
(5-criterion LLM-judge, (vLLM, OpenAI-compatible
user simulator, baselines, endpoint, autoscaling)
SciPy stats, PDF report)
Benchmarks
Anyone can fine-tune a model. What's hard — and what agent2model ships as a
first-class artifact — is measuring whether a model actually follows a
multi-turn procedure. The eval harness is a standalone, reproducible benchmark:
- a 5-criterion LLM-judge rubric (task success, information accuracy, consistency, graceful handling, naturalness) modelled on the paper's criteria, with agent2model's own behavioral anchors;
- a dynamic user simulator that role-plays customers with no knowledge of the flowchart, so scores reflect generalisation, not memorisation;
- baselines in the same harness — in-context frontier, LangGraph orchestrator, and same-base-model-orchestrated — to isolate the effect of compilation;
- proper statistics: bootstrap 95% CIs (10k resamples), Wilcoxon/Mann-Whitney, Holm-Bonferroni correction, failure rates, and cost-per-conversation.
No competing agent or distillation tool ships procedure-adherence evaluation — they benchmark QA/math/tool tasks. Run the whole thing in one command:
agent2model eval build/travel --baselines in_context,langgraph,same_model_orch --n 200
The live leaderboard (compiled-3B/8B vs. every baseline, reproduced on real
hardware) lives in benchmarks/. Paper targets are in
benchmarks/targets.json; a release is blocked if any
measured criterion regresses > 5% below target.
Documentation
| Page | What it covers |
|---|---|
| Quickstart | First 10 minutes, end-to-end |
| Cloud Quickstart | Brutally explicit cloud prereqs + costs |
| IR Spec Reference | Every field of the flowchart YAML |
| Training Guide | Hyperparameters, GPU sizing, DeepSpeed |
| Evaluation Guide | The 5-criterion rubric, baselines, stats |
| Cloud Deployment | Modal recipes + RunPod templates |
| Troubleshooting | Common errors and fixes |
| FAQ | What it does, what it doesn't, edge cases |
Build locally: pip install -e ".[docs]" && mkdocs serve.
Project status
- ✅ All 8 phases complete — IR, generation, LangGraph adapter, training, serving, eval, cloud, examples/docs/release
- ✅ Unit tests passing,
ruff/black/mypy --strictclean - ✅ Verified end-to-end —
compile,generate(real Anthropic),eval(real harness with judge + simulator + baseline + stats + PDF); 3B training verified on Modal - ⏳ vLLM serving — container-verified (model loads, OpenAI routes register); HTTP end-to-end pending
- ⏳ PyPI publish — pending first paper-faithful reproduction
What v1 ships
| Feature | Status |
|---|---|
| Flowchart IR (YAML schema + validator) | ✅ |
LangGraph adapter (.py → IR) |
✅ |
| Synthetic data generation (async, prompt-cached, resumable) | ✅ |
| Full-parameter SFT (Qwen 3B + Qwen 8B ZeRO-3) | ✅ |
| vLLM serving (OpenAI-compatible endpoint) | ⏳* |
| 5-criterion LLM-judge eval + user simulator + baselines | ✅ |
| Bootstrap CIs, Wilcoxon/Mann-Whitney, Holm-Bonferroni | ✅ |
| PDF eval report | ✅ |
| Modal + RunPod cloud recipes | ✅ |
cloud doctor / cloud setup / cost-prompt UX |
✅ |
| 3 paper reproductions ready to run | ✅ |
Generic cloud run entrypoint for arbitrary workflows |
✅ |
* vLLM serving is container-verified (the model loads and the OpenAI routes register); the HTTP /v1/chat/completions path is pending end-to-end verification.
The scope is the feature
agent2model does one thing: internalise a single-agent procedural conversation
into a small model's weights, and prove it worked. It is deliberately not a
do-everything agent framework — that focus is what makes the compilation actually
work and what keeps it distinct from prompt-optimizers and orchestrators:
- Full-parameter SFT only — no LoRA. Dennis et al. 2026b shows LoRA fails to
internalise procedures at any rank, so the CLI refuses
--lorawith a link to the companion paper. Shipping a known-broken path would only erode trust. - One agent, declared procedure, no tool-use. This is the exact slice that bakes cleanly into weights — as opposed to cloning open-ended tool/reasoning trajectories.
- RLHF/DPO, online learning, multi-agent handoffs, tool-use during inference are v2+. Out of scope on purpose, not missing.
Development
ruff check . && black --check . && mypy src && pytest tests/unit
Three test tiers:
unit— fast, mocked, every PR (defaultpytest)integration— real Anthropic API, tiny budget, nightly CI (pytest -m integration)e2e— full reproduction on Modal, release gate (pytest -m e2e)
CI workflow: .github/workflows/ci.yml.
Citation
If you use this library, please cite the paper it reproduces:
@misc{dennis2026compiling,
title = {Compiling Agentic Workflows into LLM Weights:
Near-Frontier Quality at Two Orders of Magnitude Less Cost},
author = {Dennis, et al.},
year = {2026},
eprint = {2605.22502},
archivePrefix = {arXiv},
}
License
Apache-2.0 © 2026 agent2model contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent2model-0.1.0.tar.gz.
File metadata
- Download URL: agent2model-0.1.0.tar.gz
- Upload date:
- Size: 125.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66fb3765fb255d2754e9592418aadf6a662565cbfd0cb90999b3c69b04ddfc95
|
|
| MD5 |
f078d0e0109844e6df036f63e2142092
|
|
| BLAKE2b-256 |
370e4a82b3586df93745fb2019d446bf4bb394c96fe53754b5584c0ab803f118
|
Provenance
The following attestation bundles were made for agent2model-0.1.0.tar.gz:
Publisher:
release.yml on kamaalg/agent2model
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent2model-0.1.0.tar.gz -
Subject digest:
66fb3765fb255d2754e9592418aadf6a662565cbfd0cb90999b3c69b04ddfc95 - Sigstore transparency entry: 1675498147
- Sigstore integration time:
-
Permalink:
kamaalg/agent2model@9641160872629a687bd7b6863dd2a2fe8a019010 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kamaalg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9641160872629a687bd7b6863dd2a2fe8a019010 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agent2model-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent2model-0.1.0-py3-none-any.whl
- Upload date:
- Size: 147.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c66bcb16f105c51169aff85b928d2471683cc58f0ca4b8a7206d7229d69697e3
|
|
| MD5 |
5abcc378c3f8a4c6d4e46db903c86877
|
|
| BLAKE2b-256 |
3e03bc476d1087c04111cab97f4f991729d89cdd6c9b82dba838a541045555bc
|
Provenance
The following attestation bundles were made for agent2model-0.1.0-py3-none-any.whl:
Publisher:
release.yml on kamaalg/agent2model
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent2model-0.1.0-py3-none-any.whl -
Subject digest:
c66bcb16f105c51169aff85b928d2471683cc58f0ca4b8a7206d7229d69697e3 - Sigstore transparency entry: 1675498156
- Sigstore integration time:
-
Permalink:
kamaalg/agent2model@9641160872629a687bd7b6863dd2a2fe8a019010 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kamaalg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9641160872629a687bd7b6863dd2a2fe8a019010 -
Trigger Event:
push
-
Statement type: