Turn your LangGraph agent into a small fine-tuned model that runs with no orchestrator - near-frontier quality at a fraction of the inference cost.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kamaalg

These details have not been verified by PyPI

Project description

agent2model

Turn your LangGraph agent into a small open model that runs with no orchestrator.

Point agent2model at an agent procedure (a LangGraph graph, or a YAML flowchart) and it bakes the whole procedure into a small model's weights — so the model self-orchestrates with no runtime orchestrator and no per-turn frontier calls. The method reports near-frontier quality at 128–462× lower inference cost (Dennis et al. 2026); those are the paper's figures, not yet independently reproduced in this repo — see Benchmarks.

The idea in one diagram

┌──────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│   THE OLD WAY                            THE agent2model WAY              │
│   ───────────                            ──────────────────              │
│                                                                          │
│   ┌──────────────┐                       ┌──────────────────┐            │
│   │ User message │                       │   User message   │            │
│   └──────┬───────┘                       └────────┬─────────┘            │
│          │                                        │                      │
│          ▼                                        ▼                      │
│   ┌──────────────┐                       ┌──────────────────┐            │
│   │  LangGraph   │  (every turn:         │  Compiled small  │  one call: │
│   │ orchestrator │   prompt + entire     │   model (Qwen)   │  procedure │
│   └──────┬───────┘   flowchart sent      └────────┬─────────┘  is in the │
│          │           to frontier)                 │             weights  │
│          ▼                                        ▼                      │
│   ┌──────────────┐                       ┌──────────────────┐            │
│   │   Sonnet/    │                       │  Agent response  │            │
│   │   GPT-4      │  ($$$ per turn)       │   (no Claude)    │  (¢¢/turn) │
│   └──────┬───────┘                       └──────────────────┘            │
│          │                                                               │
│          ▼                                                               │
│   ┌──────────────┐                                                       │
│   │Agent response│                                                       │
│   └──────────────┘                                                       │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

You describe your agent's procedure once — or import it from a LangGraph StateGraph. agent2model walks the procedure, has Claude write thousands of synthetic conversations through every possible path, and fine-tunes a small open-source model on those conversations. The result: a self-contained model that knows the procedure — no external orchestrator, no per-turn frontier calls.

How it differs from what you already use: prompt-optimizers like DSPy and GEPA make a frontier model follow a procedure better but keep a runtime program; orchestrators like LangGraph and CrewAI run the procedure live on every turn. agent2model is the only one that removes the orchestrator entirely by baking the procedure into the weights.

Based on Dennis et al. 2026, Compiling Agentic Workflows into LLM Weights (arXiv:2605.22502).

The 4-command pipeline

# 1. Validate a workflow and emit the canonical IR.   [free, offline, no GPU]
agent2model compile examples/travel_booking/flowchart.yaml --out build/travel

# 2. Generate synthetic training conversations via Claude.  [Anthropic API $; --budget caps it]
agent2model generate build/travel --n 2000 --budget 60

# 3. Fine-tune Qwen 2.5/3 on the generated data.      [needs a GPU + the [train] extra]
agent2model train build/travel --base Qwen/Qwen2.5-3B-Instruct --epochs 20

# 4. Evaluate against baselines, then serve via vLLM.  [eval = Anthropic API $; serve needs a GPU]
agent2model eval  build/travel --baselines in_context,langgraph --n 200
agent2model serve build/travel --port 8000

Cost/hardware at a glance: compile is free and offline. generate and eval make Anthropic API calls (each prints an estimate first and is capped by --budget). train and serve need a GPU — locally with the [train]/[serve] extras, or on Modal (below) if you have no GPU. A full paper reproduction (generate → train → eval) runs ~$30–50 end-to-end.

Don't have a GPU? The whole pipeline runs on Modal:

agent2model cloud setup                                # one-time wizard
agent2model cloud run my_workflow.yaml --size 3b --n 2000 --epochs 20

Start from the agent you already have

You don't have to write YAML. If your procedure already lives in a LangGraph StateGraph, point compile straight at the .py and agent2model extracts the nodes, edges, and decision points into the IR for you:

pip install "agent2model[langgraph]"                  # needed only to import a .py graph
agent2model compile my_graph.py --out build/mine      # LangGraph → IR, automatically

Compiling a .py graph imports and runs it — only point compile at files you trust (see docs/adapters.md). The adapter recovers the procedure's structure (nodes/edges/branches); you then fill in node prompts and terminal types where they can't be inferred (generate will refuse until the placeholder TODO: prompts are replaced). CrewAI Flows and Oracle Agent Spec import are on the near-term roadmap.

Prefer to author from scratch, or have no existing graph? Write the Flowchart IR directly — it's a few lines of YAML.

Results from the paper (Dennis et al. 2026)

These are the paper's published targets, reproduced here as the benchmark we hold ourselves to. Independently-reproduced numbers (this repo, your hardware) are tracked in benchmarks/ — see Benchmarks & reproduction.

Compiled small models vs. frontier baselines, n=200 conversations per condition, scored 1–5 on each of 5 criteria.

Travel booking (3B, 14 nodes)

Criterion	agent2model 3B (paper)	Same-model orch.	LangGraph	In-context (Sonnet)
Task Success	4.11	3.93	4.17	4.53
Information Acc.	4.75	4.69	4.21	4.64
Consistency	4.34	4.12	4.32	4.96
Graceful Handling	4.07	3.87	4.62	4.96
Naturalness	4.12	3.96	4.84	5.00

Cost per conversation

Domain	In-context Sonnet	LangGraph	agent2model (paper target)	Reduction
Travel (14 nodes)	$0.133	$0.077	$0.0010	128×
Zoom Support (14)	$0.103	$0.054	$0.0003	296×
Insurance Claims (55+)	$0.327	$0.174	$0.0007	462×

→ The paper reports 87–98% of frontier quality at a fraction of a percent of the cost. These figures are from Dennis et al. 2026; this repo's own reproduced numbers live in benchmarks/ and are still being filled in.

The Flowchart IR

Procedures are described as YAML. This is the library's public contract.

name: travel_booking
description: Help a customer plan and book a trip
start: greet

nodes:
  greet:
    role: agent
    prompt: Warmly greet the customer and ask what they need.
    next: [gather_preferences]

  gather_preferences:
    role: agent
    prompt: Ask about destination, dates, budget, group size — one at a time.
    next: [assess_readiness]

  assess_readiness:
    role: decision                # an LLM picks at *generation* time, never runtime
    next:
      - to: present_options
        when: user has provided all required info
      - to: gather_preferences
        when: details are still missing

  booking_confirmed:
    terminal: success             # success | abandonment | escalation

scenario_variables:
  destination_pool: [Japan, Italy, Iceland, Portugal]
  budget_range: [500, 5000]
  user_styles: [decisive, indecisive, skeptical, enthusiastic]

The validator enforces every invariant the paper requires:

✓ Every non-terminal node has ≥1 outgoing edge
✓ Every terminal is reachable from start
✓ Cycles must contain a terminal-reaching escape edge (no dead-end loops)
✓ role: decision is resolved only during data generation — the trained model self-orchestrates with no runtime router

Already have a LangGraph workflow? Skip the YAML — point compile at the .py:

agent2model compile my_graph.py --out build/mine

What you actually get from a run

build/travel/
├── flowchart.json          # canonical IR
├── dataset.jsonl           # n synthetic conversations (HF chat-template)
├── cost.json               # Anthropic token + USD ledger
├── generation_state.json   # checkpoint — resumes on rerun
├── model/best/             # fine-tuned Qwen, the compiled model
├── eval_report.pdf         # per-criterion bar charts, CIs, costs
└── eval_report.json        # machine-readable scores + bootstrap CIs

Install

pip install agent2model

That core install is enough to compile a flowchart, generate data, and eval (no GPU needed). Pull in extras for the heavier paths:

pip install "agent2model[train]"      # torch + trl + transformers + deepspeed (GPU)
pip install "agent2model[serve]"      # vLLM OpenAI-compatible serving (GPU/Linux)
pip install "agent2model[cloud]"      # Modal cloud recipes
pip install "agent2model[langgraph]"  # compile FROM a LangGraph .py graph

Cloud-first (no local GPU)

pip install "agent2model[cloud]"
agent2model cloud setup           # wizard: Modal account, token, anthropic-secret
agent2model cloud doctor          # checklist; tells you what's missing

From source (development)

git clone https://github.com/kamaalg/agent2model && cd agent2model
pip install -e ".[dev]"

Reproduce the paper

The library ships three reproduction entrypoints — each a single Modal command:

modal run -m agent2model.cloud.modal_app::reproduce_travel       # 3B, ~3.5h, ~$30-50
modal run -m agent2model.cloud.modal_app::reproduce_zoom         # 8B, ~30 min train
modal run -m agent2model.cloud.modal_app::reproduce_insurance    # 8B, 55+ nodes

Each chains generate → train → evaluate end-to-end and writes a PDF report. The CI gate fails any release whose numbers regress > 5% below benchmarks/targets.json.

Architecture

                       ┌───────────────────────────────────────────┐
                       │              YOUR WORKFLOW                │
                       │     ( YAML flowchart  or  LangGraph .py ) │
                       └──────────────────┬────────────────────────┘
                                          │
                              agent2model compile
                                          │
                                          ▼
                       ┌───────────────────────────────────────────┐
                       │        Canonical Flowchart IR             │
                       │   (Pydantic, validated, version-stable)   │
                       └──────────────────┬────────────────────────┘
                                          │
                              agent2model generate
                                  (Claude Sonnet 4.5,
                                  async, prompt-cached,
                                  budget-capped, resumable)
                                          │
                                          ▼
                       ┌───────────────────────────────────────────┐
                       │     N synthetic conversations (JSONL)     │
                       │  flowchart NEVER appears in training data │
                       └──────────────────┬────────────────────────┘
                                          │
                              agent2model train
                                  (TRL SFTTrainer,
                                  full-param only,
                                  DeepSpeed ZeRO-3 for 8B)
                                          │
                                          ▼
                       ┌───────────────────────────────────────────┐
                       │        Compiled Qwen 2.5/3 model          │
                       │   (best checkpoint by held-out eval loss) │
                       └──────────┬────────────────────────┬───────┘
                                  │                        │
                       agent2model eval         agent2model serve
                  (5-criterion LLM-judge,         (vLLM, OpenAI-compatible
                  user simulator, baselines,       endpoint, autoscaling)
                  SciPy stats, PDF report)

Benchmarks

Anyone can fine-tune a model. What's hard — and what agent2model ships as a first-class artifact — is measuring whether a model actually follows a multi-turn procedure. The eval harness is a standalone, reproducible benchmark:

a 5-criterion LLM-judge rubric (task success, information accuracy, consistency, graceful handling, naturalness) modelled on the paper's criteria, with agent2model's own behavioral anchors;
a dynamic user simulator that role-plays customers with no knowledge of the flowchart, so scores reflect generalisation, not memorisation;
baselines in the same harness — in-context frontier, LangGraph orchestrator, and same-base-model-orchestrated — to isolate the effect of compilation;
proper statistics: bootstrap 95% CIs (10k resamples), Wilcoxon/Mann-Whitney, Holm-Bonferroni correction, failure rates, and cost-per-conversation.

No competing agent or distillation tool ships procedure-adherence evaluation — they benchmark QA/math/tool tasks. Run the whole thing in one command:

agent2model eval build/travel --baselines in_context,langgraph,same_model_orch --n 200

The live leaderboard (compiled-3B/8B vs. every baseline, reproduced on real hardware) lives in benchmarks/. Paper targets are in benchmarks/targets.json; a release is blocked if any measured criterion regresses > 5% below target.

Documentation

Page	What it covers
Quickstart	First 10 minutes, end-to-end
Cloud Quickstart	Brutally explicit cloud prereqs + costs
IR Spec Reference	Every field of the flowchart YAML
Training Guide	Hyperparameters, GPU sizing, DeepSpeed
Evaluation Guide	The 5-criterion rubric, baselines, stats
Cloud Deployment	Modal recipes + RunPod templates
Troubleshooting	Common errors and fixes
FAQ	What it does, what it doesn't, edge cases

Build locally: pip install -e ".[docs]" && mkdocs serve.

Project status

✅ All 8 phases complete — IR, generation, LangGraph adapter, training, serving, eval, cloud, examples/docs/release
✅ Unit tests passing, ruff / black / mypy --strict clean
✅ Verified end-to-end — compile, generate (real Anthropic), eval (real harness with judge + simulator + baseline + stats + PDF); 3B training verified on Modal
⏳ vLLM serving — container-verified (model loads, OpenAI routes register); HTTP end-to-end pending
⏳ PyPI publish — pending first paper-faithful reproduction

What v1 ships

Feature	Status
Flowchart IR (YAML schema + validator)	✅
LangGraph adapter (`.py` → IR)	✅
Synthetic data generation (async, prompt-cached, resumable)	✅
Full-parameter SFT (Qwen 3B + Qwen 8B ZeRO-3)	✅
vLLM serving (OpenAI-compatible endpoint)	⏳*
5-criterion LLM-judge eval + user simulator + baselines	✅
Bootstrap CIs, Wilcoxon/Mann-Whitney, Holm-Bonferroni	✅
PDF eval report	✅
Modal + RunPod cloud recipes	✅
`cloud doctor` / `cloud setup` / cost-prompt UX	✅
3 paper reproductions ready to run	✅
Generic `cloud run` entrypoint for arbitrary workflows	✅

_{* vLLM serving is container-verified (the model loads and the OpenAI routes register); the HTTP /v1/chat/completions path is pending end-to-end verification.}

The scope is the feature

agent2model does one thing: internalise a single-agent procedural conversation into a small model's weights, and prove it worked. It is deliberately not a do-everything agent framework — that focus is what makes the compilation actually work and what keeps it distinct from prompt-optimizers and orchestrators:

Full-parameter SFT only — no LoRA. Dennis et al. 2026b shows LoRA fails to internalise procedures at any rank, so the CLI refuses --lora with a link to the companion paper. Shipping a known-broken path would only erode trust.
One agent, declared procedure, no tool-use. This is the exact slice that bakes cleanly into weights — as opposed to cloning open-ended tool/reasoning trajectories.
RLHF/DPO, online learning, multi-agent handoffs, tool-use during inference are v2+. Out of scope on purpose, not missing.

Development

ruff check . && black --check . && mypy src && pytest tests/unit

Three test tiers:

unit — fast, mocked, every PR (default pytest)
integration — real Anthropic API, tiny budget, nightly CI (pytest -m integration)
e2e — full reproduction on Modal, release gate (pytest -m e2e)

CI workflow: .github/workflows/ci.yml.

Citation

If you use this library, please cite the paper it reproduces:

@misc{dennis2026compiling,
  title  = {Compiling Agentic Workflows into LLM Weights:
            Near-Frontier Quality at Two Orders of Magnitude Less Cost},
  author = {Dennis, et al.},
  year   = {2026},
  eprint = {2605.22502},
  archivePrefix = {arXiv},
}

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kamaalg

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent2model-0.1.0.tar.gz (125.1 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent2model-0.1.0-py3-none-any.whl (147.2 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file agent2model-0.1.0.tar.gz.

File metadata

Download URL: agent2model-0.1.0.tar.gz
Upload date: May 30, 2026
Size: 125.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent2model-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`66fb3765fb255d2754e9592418aadf6a662565cbfd0cb90999b3c69b04ddfc95`
MD5	`f078d0e0109844e6df036f63e2142092`
BLAKE2b-256	`370e4a82b3586df93745fb2019d446bf4bb394c96fe53754b5584c0ab803f118`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent2model-0.1.0.tar.gz:

Publisher: release.yml on kamaalg/agent2model

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent2model-0.1.0.tar.gz
- Subject digest: 66fb3765fb255d2754e9592418aadf6a662565cbfd0cb90999b3c69b04ddfc95
- Sigstore transparency entry: 1675498147
- Sigstore integration time: May 30, 2026
Source repository:
- Permalink: kamaalg/agent2model@9641160872629a687bd7b6863dd2a2fe8a019010
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kamaalg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9641160872629a687bd7b6863dd2a2fe8a019010
- Trigger Event: push

File details

Details for the file agent2model-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent2model-0.1.0-py3-none-any.whl
Upload date: May 30, 2026
Size: 147.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent2model-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c66bcb16f105c51169aff85b928d2471683cc58f0ca4b8a7206d7229d69697e3`
MD5	`5abcc378c3f8a4c6d4e46db903c86877`
BLAKE2b-256	`3e03bc476d1087c04111cab97f4f991729d89cdd6c9b82dba838a541045555bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent2model-0.1.0-py3-none-any.whl:

Publisher: release.yml on kamaalg/agent2model

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent2model-0.1.0-py3-none-any.whl
- Subject digest: c66bcb16f105c51169aff85b928d2471683cc58f0ca4b8a7206d7229d69697e3
- Sigstore transparency entry: 1675498156
- Sigstore integration time: May 30, 2026
Source repository:
- Permalink: kamaalg/agent2model@9641160872629a687bd7b6863dd2a2fe8a019010
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kamaalg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9641160872629a687bd7b6863dd2a2fe8a019010
- Trigger Event: push

agent2model 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agent2model

Turn your LangGraph agent into a small open model that runs with no orchestrator.

The idea in one diagram

The 4-command pipeline

Start from the agent you already have

Results from the paper (Dennis et al. 2026)

Travel booking (3B, 14 nodes)

Cost per conversation

The Flowchart IR

What you actually get from a run

Install

Cloud-first (no local GPU)

From source (development)

Reproduce the paper

Architecture

Benchmarks

Documentation

Project status

What v1 ships

The scope is the feature

Development

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance