Intelligent LLM task planner — decompose tasks, route to optimal models, enforce budgets
Project description
TokenWise
Production-grade LLM routing with budget ceilings, tiered escalation, and multi-provider failover.
30-Second Demo
pip install tokenwise-llm
# Route a query to the best model within budget
tokenwise route "Debug this segfault" --strategy best_quality --budget 0.05
# Decompose a task, execute it, track spend
tokenwise plan "Write a Python function to validate email addresses, \
then write unit tests for it" --budget 0.05 --execute
Example output:
Plan: 4 steps | Budget: $0.05 | Estimated: $0.0002
Status: Success | Total cost: $0.0007 | Budget remaining: $0.0493
If a step fails, TokenWise automatically escalates to a stronger model and retries within budget.
Why
Most LLM frameworks optimize for capability. TokenWise optimizes for cost governance. You declare a budget per request — TokenWise enforces it, selects the best model within that constraint, and escalates only when needed. No hidden traffic allocation, no implicit ceilings. All decisions are explicit and per-request.
Quick Start
Set your API key
export OPENROUTER_API_KEY="sk-or-..."
TokenWise uses OpenRouter as the default gateway. Set
OPENAI_API_KEY,ANTHROPIC_API_KEY, orGOOGLE_API_KEYto bypass OpenRouter for those providers.
Route a query
tokenwise route "Write a haiku about Python"
Route with budget ceiling
tokenwise route "Debug this segfault" --strategy best_quality --budget 0.05
Plan and execute
tokenwise plan "Build a REST API" --budget 0.50 --execute
Inspect spend
tokenwise ledger --summary
Routing strategies
| Strategy | When to Use | How It Works |
|---|---|---|
cheapest |
Minimize cost | Lowest-price capable model |
best_quality |
Maximize quality | Best flagship-tier capable model |
balanced |
Default | Matches model tier to query complexity |
Budget is a universal parameter on all strategies. Pass
budget_strict=False to fall back to best-effort.
Python API
from tokenwise import Router, Planner, Executor
# Route a single query
router = Router()
model = router.route("Explain quantum computing", strategy="balanced", budget=0.10)
print(f"{model.id} (${model.input_price}/M input)")
# Plan and execute a complex task
planner = Planner()
plan = planner.plan(task="Build a REST API for a todo app", budget=0.50)
executor = Executor()
result = executor.execute(plan)
print(f"Cost: ${result.total_cost:.4f}, success: {result.success}")
OpenAI-compatible proxy
tokenwise serve --port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="auto", # TokenWise picks the best model
messages=[{"role": "user", "content": "Hello!"}],
)
Core Features
- Budget-aware routing — cost ceilings enforced via
max_tokenscaps with conservative estimation (details). - Tiered escalation — budget, mid, flagship; escalates upward on failure, never downward.
- Capability-aware fallback — routes and fallbacks filtered by
code,reasoning,math, orgeneral. - Task decomposition — LLM-powered planning with per-step model assignment and async DAG scheduling.
- Cost ledger — structured per-call accounting including failures and retries, persisted to JSONL.
- Multi-provider failover — OpenRouter, OpenAI, Anthropic, and Google with connection pooling.
Benchmark: Cost–Quality Frontier
X-axis: average cost per task (USD). Y-axis: success rate (%).
The star marks the TokenWise escalation strategy, which uses Router.route()
with escalating strategies (cheapest → balanced → best_quality) and
escalates to higher tiers when a task fails validation (max 1 escalation
per tier). Baselines use a single fixed model for all tasks.
On this 20-task benchmark set, TokenWise Escalation is the only strategy reaching 100% success, while reducing average cost per task by ~5x versus Flagship Only.
| Strategy | Success | Avg Cost / Task | Cost Std | Models |
|---|---|---|---|---|
| Budget Only | 85% | $0.000177 | $0.000120 | gpt-4.1-nano |
| Mid Only | 90% | $0.003842 | $0.002430 | gpt-4.1 |
| Flagship Only | 95% | $0.009492 | $0.005994 | claude-sonnet-4 |
| TokenWise Escalation | 100% | $0.001985 | $0.004062 | Router-selected |
Results from a single fixed-seed run. Model pricing and outputs may vary over time.
In multi-step workflows (10–50 steps), the per-step savings compound: a 5x reduction per step means 5x for the entire workflow. With more steps there are more opportunities for escalation to save on easy sub-tasks while still escalating when needed.
Success metric: 15 of 20 tasks have dedicated validators (reasoning
correctness, code structure, substantiveness checks). The remaining 5
simple tasks use a length-check fallback (>20 chars). Validators are
defined in benchmarks/strategy_pareto.py.
Budget: $0.03/task soft target — used for Router model filtering but not enforced as a hard ceiling.
How to reproduce
# Requires an OpenRouter API key (or direct provider keys)
export OPENROUTER_API_KEY="sk-or-..."
uv sync --group benchmark
uv run python benchmarks/strategy_pareto.py
Generated artifacts:
| File | Contents |
|---|---|
assets/pareto.png |
Cost–quality scatter plot |
benchmarks/strategy_results.csv |
Per-task results: strategy, task, category, success, cost, model, escalated, latency, budget, budget_violation |
Use --dry-run to preview the plan without making API calls.
Limitations
- Small task set (20 tasks across 4 categories); not a comprehensive LLM benchmark.
- Validators are heuristic — they check for known correct patterns, not full semantic correctness.
- Model pricing and availability change over time; results are provider-dependent.
- Single-run results; no confidence intervals. The run date is printed in the script output.
- Escalation in the benchmark uses a validation-retry loop; in production, the Executor escalates on execution failure.
Comparison
Most routing tools optimize per-request model choice. TokenWise treats routing as a workflow-level control system.
High-level comparison (as of February 2026). Corrections welcome via issues.
| Feature | TokenWise | RouteLLM | LiteLLM | Not Diamond | Martian | Portkey | OpenRouter |
|---|---|---|---|---|---|---|---|
| Task decomposition | Yes | - | - | - | - | - | - |
| Strict budget ceiling | Yes | - | Yes | - | Per-request | Yes | Yes |
| Tier-based escalation | Yes | - | Yes | - | - | Yes | - |
| Capability-aware fallback | Yes | - | - | Partial | Yes | Partial | Partial |
| Cost ledger | Yes | - | Yes | - | - | Yes | Dashboard |
| OpenAI-compatible proxy | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| CLI | Yes | - | Yes | - | - | - | - |
| Self-hosted / open source | Yes | Yes | Yes | - | - | Gateway only | - |
How It Works
Architecture
┌───────────────────────────────────────────────────────┐
│ TokenWise │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Router │ │ Planner │ │ Executor │ │
│ │ │ │ │ │ │ │
│ │ 1. Detect │ │ Breaks │ │ Runs the │ │
│ │ scenario │ │ task into │ │ plan, │ │
│ │ 2. Route │ │ steps + │ │ tracks │ │
│ │ within │ │ assigns │ │ spend, │ │
│ │ budget │ │ models │ │ retries │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ ProviderResolver │ ← LLM calls │
│ │ │ │
│ │ OpenAI · Anthropic │ │
│ │ Google · OpenRouter │ │
│ └──────────────────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Registry │ ← metadata + pricing │
│ └──────────────┘ │
└───────────────────────────────────────────────────────┘
Router pipeline
The router uses a two-stage pipeline:
detect (capabilities + complexity) then
route (filter by budget, apply strategy: cheapest /
balanced / best_quality).
Planner and Executor
Planner decomposes a task into subtasks using a cheap LLM, assigns the optimal model to each step within budget, and auto-downgrades expensive steps if over budget.
Executor runs the plan via async DAG scheduling, tracks
actual cost via CostLedger, and escalates to stronger models
on failure (flagship before mid, filtered by capability).
If executor.execute(plan) is called inside an existing event
loop (Jupyter, FastAPI), it falls back to sequential execution.
Use await executor.aexecute(plan) directly for concurrent DAG
scheduling in async code.
Observability
Every execution produces a structured trace:
result = executor.execute(plan)
for sr in result.step_results:
print(f"Step {sr.step_id}: model={sr.model_id}, "
f"cost=${sr.actual_cost:.4f}, escalated={sr.escalated}")
for entry in result.ledger.entries:
print(f" {entry.reason}: {entry.model_id} "
f"({entry.input_tokens}in/{entry.output_tokens}out) "
f"${entry.cost:.6f} {'ok' if entry.success else 'FAIL'}")
print(f"Total: ${result.total_cost:.4f}, "
f"wasted: ${result.ledger.wasted_cost:.4f}, "
f"remaining: ${result.budget_remaining:.4f}")
Example output when step 1 fails and escalates:
Step 1: model=openai/gpt-4.1, cost=$0.0052, escalated=True
step 1 attempt 1: openai/gpt-4.1-mini (82in/0out) $0.000000 FAIL
step 1 escalation attempt 1: openai/gpt-4.1 (82in/204out) $0.001800 ok
Total: $0.0052, wasted: $0.0000, remaining: $0.9948
Budget Semantics
TokenWise enforces budget ceilings by capping max_tokens
before each LLM call. Input token counts are estimated using a
chars / 4 heuristic with a 1.2x safety margin — not a
tokenizer. The budget ceiling is real and enforced, but small
overruns are possible when the heuristic underestimates input
tokens. A future release will support pluggable tokenizer-based
estimation for stricter guarantees.
Known Limitations (v0.4)
All three v0.3 limitations have been resolved:
Planner cost not budgeted— tracked and deducted (v0.4)Linear execution— parallel DAG scheduling (v0.4)No persistent spend tracking— JSONL ledger (v0.4)
Configuration
TokenWise reads configuration from environment variables and
an optional config file (~/.config/tokenwise/config.yaml).
| Variable | Required | Description | Default |
|---|---|---|---|
OPENROUTER_API_KEY |
Yes | OpenRouter API key | — |
OPENAI_API_KEY |
Optional | Direct OpenAI API key | — |
ANTHROPIC_API_KEY |
Optional | Direct Anthropic API key | — |
GOOGLE_API_KEY |
Optional | Direct Google AI API key | — |
OPENROUTER_BASE_URL |
Optional | OpenRouter base URL | https://openrouter.ai/api/v1 |
TOKENWISE_DEFAULT_STRATEGY |
Optional | Routing strategy | balanced |
TOKENWISE_DEFAULT_BUDGET |
Optional | Budget in USD | 1.00 |
TOKENWISE_PLANNER_MODEL |
Optional | Decomposition model | openai/gpt-4.1-mini |
TOKENWISE_PROXY_HOST |
Optional | Proxy bind host | 127.0.0.1 |
TOKENWISE_PROXY_PORT |
Optional | Proxy bind port | 8000 |
TOKENWISE_CACHE_TTL |
Optional | Registry cache TTL (s) | 3600 |
TOKENWISE_LEDGER_PATH |
Optional | Ledger JSONL path | ~/.config/tokenwise/ledger.jsonl |
TOKENWISE_MIN_OUTPUT_TOKENS |
Optional | Min output tokens per step | 100 |
TOKENWISE_LOCAL_MODELS |
Optional | Local models YAML | — |
# ~/.config/tokenwise/config.yaml
default_strategy: balanced
default_budget: 0.50
planner_model: openai/gpt-4.1-mini
Development
git clone https://github.com/itsarbit/tokenwise.git
cd tokenwise
uv sync
uv run pytest
uv run ruff check src/ tests/
uv run mypy src/
src/tokenwise/
├── models.py # Pydantic data models
├── config.py # Settings from env vars and config file
├── registry.py # ModelRegistry — fetches/caches models
├── router.py # Two-stage pipeline: scenario → strategy
├── planner.py # Decomposes tasks, assigns models
├── executor.py # Runs plans, tracks spend, escalates
├── ledger_store.py # Persistent JSONL spend history
├── cli.py # Typer CLI
├── proxy.py # FastAPI OpenAI-compatible proxy
├── providers/ # LLM provider adapters
│ ├── openrouter.py
│ ├── openai.py
│ ├── anthropic.py
│ ├── google.py
│ └── resolver.py # Maps model IDs → provider instances
└── data/
└── model_capabilities.json
Philosophy
LLM systems should be treated like distributed systems. That means clear failure semantics, explicit cost ceilings, predictable escalation, and observability. TokenWise is designed with that philosophy.
Background reading: LLM Routers Are Not Enough — the blog post that motivated TokenWise's design.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenwise_llm-0.4.5.tar.gz.
File metadata
- Download URL: tokenwise_llm-0.4.5.tar.gz
- Upload date:
- Size: 671.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee321f979e4bbc7fcb12d9e28797d475377d0078504c1ca64543a180370f0810
|
|
| MD5 |
fa3e29adfd7b20957ed649344e296ed6
|
|
| BLAKE2b-256 |
0fbbd727aa4ae12d4169e0a5b46e01870c5ef83f13b2843710cfa65e54efd58e
|
Provenance
The following attestation bundles were made for tokenwise_llm-0.4.5.tar.gz:
Publisher:
release.yml on itsarbit/tokenwise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenwise_llm-0.4.5.tar.gz -
Subject digest:
ee321f979e4bbc7fcb12d9e28797d475377d0078504c1ca64543a180370f0810 - Sigstore transparency entry: 976935439
- Sigstore integration time:
-
Permalink:
itsarbit/tokenwise@d1eba9d90e31056b53781f2b697295940b8e9e52 -
Branch / Tag:
refs/tags/v0.4.5 - Owner: https://github.com/itsarbit
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1eba9d90e31056b53781f2b697295940b8e9e52 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tokenwise_llm-0.4.5-py3-none-any.whl.
File metadata
- Download URL: tokenwise_llm-0.4.5-py3-none-any.whl
- Upload date:
- Size: 45.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe05100eb4f84c8f7662623b868ae8298cd1b918bb6c1df1e5882f22aead4564
|
|
| MD5 |
b1e58a4cfa726cdc30d1e39110ff27ca
|
|
| BLAKE2b-256 |
f067aae6750d5f3109bc15a3ef224d3ae01b55d3294220fa762812779d8a85a7
|
Provenance
The following attestation bundles were made for tokenwise_llm-0.4.5-py3-none-any.whl:
Publisher:
release.yml on itsarbit/tokenwise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenwise_llm-0.4.5-py3-none-any.whl -
Subject digest:
fe05100eb4f84c8f7662623b868ae8298cd1b918bb6c1df1e5882f22aead4564 - Sigstore transparency entry: 976935458
- Sigstore integration time:
-
Permalink:
itsarbit/tokenwise@d1eba9d90e31056b53781f2b697295940b8e9e52 -
Branch / Tag:
refs/tags/v0.4.5 - Owner: https://github.com/itsarbit
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1eba9d90e31056b53781f2b697295940b8e9e52 -
Trigger Event:
push
-
Statement type: