Cost-aware model routing for the Claude API — reference implementation with a Claude Code subagent hook
Project description
claude-model-router
Reference implementation of cost-aware model routing for the Claude API — plus a Claude Code hook that stops subagents from silently inheriting your most expensive model.
A cheap Haiku 4.5 call classifies each prompt's complexity; the prompt then executes on the cheapest capable tier. Small, tested, and built around the parts naive routers get wrong.
Why naive Claude routers break
Most routing examples treat model IDs as interchangeable strings. On the Claude 5 family they are not:
| Trap | What actually happens | Where handled |
|---|---|---|
| Refusals are not exceptions | Fable 5 refusals return HTTP 200 with stop_reason: "refusal" — a try/except fallback never fires |
server-side fallback beta + explicit stop_reason check (router.py) |
thinking differs per model |
Fable 5: always on, explicit config → 400. Opus 4.8: omitted = off. Sonnet 5: omitted = adaptive. Haiku 4.5: adaptive → 400 | per-model request shape in build_request() |
effort is not universal |
output_config: {"effort": ...} → 400 on Haiku 4.5 |
effort omitted on the trivial tier |
content[0].text crashes |
refusals ship empty content; thinking blocks precede text blocks |
block-type-filtered extraction |
Small max_tokens + high effort = truncation |
thinking tokens spend from max_tokens; xhigh turns run minutes |
64K max_tokens + streaming everywhere |
| Cache dies on model switch | prompt caches are model-scoped; per-turn routing invalidates the cache every turn | route once per conversation, reuse via execute() |
| Context windows differ | Haiku 4.5 = 200K; every other tier model = 1M | trivial bumped to low above 180K context tokens |
Tiers
| Tier | Model | Effort | Notes |
|---|---|---|---|
| trivial | claude-haiku-4-5 |
— | reformatting, extraction, classification |
| low | claude-sonnet-5 |
medium | summaries, single-function edits, simple Q&A |
| mid | claude-opus-4-8 |
high | debugging, security review, single-file refactors |
| high | claude-fable-5 |
xhigh | multi-file refactors, migrations, architecture |
The keyword-heuristic fallback (used when the classifier call fails) never assigns trivial — misrouting real work to Haiku costs more in retries than the tier saves, so only the classifier may pick it.
Usage
import anthropic
from model_router import run, execute
client = anthropic.Anthropic()
result = run(client, "Refactor the auth module across services")
print(result.route.model, result.route.source) # claude-fable-5 classifier
print(result.served_by) # may be opus-4-8 after a fallback
print(result.text)
Multi-turn: prompt caches are model-scoped, so route once per conversation and reuse the route for follow-ups —
route = result.route
result2 = execute(client, route, full_message_history)
Cost model
1,000 prompts, mixed workload (30% trivial / 40% low / 20% mid / 10% high), list pricing, output token counts held equal across models (conservative — Fable at xhigh thinks longer than smaller models on the same prompt):
| Tier | Prompts | Routed model | Routed cost | All-Fable cost |
|---|---|---|---|---|
| trivial | 300 | claude-haiku-4-5 |
$0.54 | $5.40 |
| low | 400 | claude-sonnet-5 |
$6.60 | $22.00 |
| mid | 200 | claude-opus-4-8 |
$18.50 | $37.00 |
| high | 100 | claude-fable-5 |
$80.00 | $80.00 |
| classifier overhead | 1000 | claude-haiku-4-5 |
$0.72 | — |
| total | $106.36 | $144.40 |
26% saved on this mix — and the mix is the whole story: the 10% of prompts that legitimately need Fable dominate the bill. A support-bot-shaped workload (mostly trivial/low) saves 70%+. Rerun with your own shape:
python benchmarks/cost_model.py
This is a pricing calculation, not a live benchmark — it spends no tokens and its assumptions are at the top of the script.
Live benchmark (real tokens, real dollars)
benchmarks/live_bench.py runs a 17-prompt
mixed workload through both arms — routed vs. all-Fable — with real API
calls, and prices each call from the usage block the API actually
returned (by the model that actually served it, which matters after a
fallback). Classifier overhead is metered from real Haiku usage, not
estimated.
ANTHROPIC_API_KEY=... python benchmarks/live_bench.py --tiers trivial,low # cheap smoke run
ANTHROPIC_API_KEY=... python benchmarks/live_bench.py --out live_results.md # full run, ~$2-6
It spends real money (the all-Fable arm at xhigh dominates), refuses to
start without a key, and flags every prompt where the classifier's tier
disagreed with the expected tier — so you can audit misroutes instead of
trusting a single savings number. RoutedResult.usage exposes the same
real token counts in your own code.
This is the honest version of the standard methodology: the same cost-quality comparison RouterBench and RouteLLM run offline, at a scale one person can afford to reproduce.
Claude Code hook: stop subagents from burning Fable tokens
Separate deliverable, same philosophy. In Claude Code, subagents inherit the session model by default — run your session on an expensive model and every spawned agent (including greps and file listings) bills at that rate.
hooks/subagent-router.py is a PreToolUse hook
that denies any Agent spawn missing an explicit model param. The deny
reason carries a routing table, so the session model immediately re-issues
the spawn with the cheapest capable tier. Self-correcting, one file, no
dependencies.
Install:
cp hooks/subagent-router.py ~/.claude/hooks/
Then add to ~/.claude/settings.json:
"hooks": {
"PreToolUse": [
{
"matcher": "Agent",
"hooks": [
{
"type": "command",
"command": "python3 ~/.claude/hooks/subagent-router.py"
}
]
}
]
}
New sessions then enforce: unrouted spawn → denied with policy → re-spawned
with explicit model (haiku for greps/locates, sonnet for single-file work
and reviews, opus for multi-file/hard debugging, session model only for
judgment-critical synthesis).
Install
pip install claude-model-router
Or from source:
git clone https://github.com/junoseong/claude-model-router
cd claude-model-router
python3 -m venv .venv && .venv/bin/pip install -e '.[dev]'
.venv/bin/pytest
Notes
- Fable 5 requires 30-day data retention; zero-data-retention orgs get 400s
on the high tier — remap
hightoclaude-opus-4-8in_TIER_ROUTES. - Sonnet 5 has intro pricing ($2/$10 per MTok) through 2026-08-31; the cost model uses sticker prices ($3/$15).
MIT licensed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file claude_model_router-0.2.0.tar.gz.
File metadata
- Download URL: claude_model_router-0.2.0.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39b88a082b703af900d03972cd6ac9126b26ccbf6a596cdcb8607738726246d0
|
|
| MD5 |
f37b9a10eb02ace0c9d7002fb152c5e0
|
|
| BLAKE2b-256 |
7a270428110e88ccaf00160cd0a43ed40ee188c0eec786781ed1781bb05a29db
|
Provenance
The following attestation bundles were made for claude_model_router-0.2.0.tar.gz:
Publisher:
publish.yml on junoseong/claude-model-router
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
claude_model_router-0.2.0.tar.gz -
Subject digest:
39b88a082b703af900d03972cd6ac9126b26ccbf6a596cdcb8607738726246d0 - Sigstore transparency entry: 2048611626
- Sigstore integration time:
-
Permalink:
junoseong/claude-model-router@9cc60f4d78e0e55552bb35b1552e00100f660f79 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/junoseong
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9cc60f4d78e0e55552bb35b1552e00100f660f79 -
Trigger Event:
release
-
Statement type:
File details
Details for the file claude_model_router-0.2.0-py3-none-any.whl.
File metadata
- Download URL: claude_model_router-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d82c6a1130571e37ebc1caa06b8b4e52b26a82cb5e995b38005174638676cda
|
|
| MD5 |
720cfd39912b9cfebc9b326d6d3b4325
|
|
| BLAKE2b-256 |
6dd318b540046d58a265cb8612473333fa62923d8b9154d2108b09e0ce42f7ee
|
Provenance
The following attestation bundles were made for claude_model_router-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on junoseong/claude-model-router
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
claude_model_router-0.2.0-py3-none-any.whl -
Subject digest:
1d82c6a1130571e37ebc1caa06b8b4e52b26a82cb5e995b38005174638676cda - Sigstore transparency entry: 2048611633
- Sigstore integration time:
-
Permalink:
junoseong/claude-model-router@9cc60f4d78e0e55552bb35b1552e00100f660f79 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/junoseong
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9cc60f4d78e0e55552bb35b1552e00100f660f79 -
Trigger Event:
release
-
Statement type: