tokenwise-llm

Intelligent LLM task planner — decompose tasks, route to optimal models, enforce budgets

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

arbit30719

These details have not been verified by PyPI

Project description

TokenWise

Production-grade LLM routing with budget ceilings, tiered escalation, and multi-provider failover.

TokenWise is not just a model picker.

It is a lightweight control layer for LLM systems that need:

Strict budget enforcement — hard cost ceilings that fail fast, never silently overspend*
Capability-aware routing — routes and fallbacks filtered by what the task actually needs (code, reasoning, math)
Deterministic escalation — budget to mid to flagship, never downward
Task decomposition — break complex work into subtasks, each routed to the right model
Multi-provider failover — OpenRouter, OpenAI, Anthropic, Google — with a shared connection pool
An OpenAI-compatible proxy — drop-in replacement for any existing SDK

Modern LLM applications are production systems. Production systems need guardrails. TokenWise provides those guardrails.

Why TokenWise Exists

Most LLM routers do one thing: pick a model per request. That is not enough for real systems.

In production, you need a hard budget ceiling per task. You need tiered escalation that tries stronger models when weaker ones fail. You need provider failover. You need capability-aware routing that knows a coding task should not fall back to a model that cannot code. You need deterministic behavior you can reason about.

TokenWise treats routing as infrastructure — not a convenience feature.

Note: TokenWise uses OpenRouter as the default model gateway for model discovery and routing. You can also use direct provider APIs (OpenAI, Anthropic, Google) by setting the corresponding API keys — when a direct key is available, requests for that provider bypass OpenRouter automatically.

Comparison

Feature	TokenWise	RouteLLM	LiteLLM	Not Diamond	Martian	Portkey	OpenRouter
Task decomposition	Yes	-	-	-	-	-	-
Strict budget ceiling	Yes	-	Yes	-	Per-request	Yes	Yes
Tier-based escalation	Yes	-	Yes	-	-	Yes	-
Capability-aware fallback	Yes	-	-	Partial	Yes	Partial	Partial
Cost ledger	Yes	-	Yes	-	-	Yes	Dashboard
OpenAI-compatible proxy	Yes	Yes	Yes	Yes	Yes	Yes	Yes
CLI	Yes	-	Yes	-	-	-	-
Python API	Yes	Yes	Yes	Yes	Via OpenAI SDK	Yes	Yes
Self-hosted / open source	Yes	Yes	Yes	-	-	Gateway only	-

What these terms mean in TokenWise's context:

Task decomposition — breaks a complex prompt into multiple LLM steps, each assigned to a different model. Not just model selection per request.
Strict budget ceiling — hard cap on total USD spend; execution stops rather than overshooting. Some tools offer per-request limits but not cross-step budgets.
Tier-based escalation — on failure, retries with a stronger-tier model (budget, mid, flagship), never downward.
Capability-aware fallback — fallback candidates are filtered by required capabilities (code, reasoning, math), not just price or tier.
Cost ledger — structured per-call log of model, tokens, cost, and success/failure — including failed attempts and escalations.

Note: some competitors may partially cover these features. The table reflects our understanding as of February 2026; corrections welcome via issues.

Core Features

Budget-Aware Routing

Enforce a strict maximum cost per request or workflow. If no model fits within the ceiling, TokenWise fails fast. No silent overspending.*

router = Router()
model = router.route(
    "Debug this segfault",
    strategy="best_quality",
    budget=0.05,
)
# Raises ValueError if nothing fits

The executor caps max_tokens per call using a 1.2x safety margin on input token estimates. Steps producing fewer than min_output_tokens (default 100) are skipped — configure via TOKENWISE_MIN_OUTPUT_TOKENS or Executor(min_output_tokens=N) for workflows that need tiny outputs under tight budgets.

Tiered Escalation

Three model tiers: budget, mid, flagship.

If a model fails, TokenWise escalates strictly upward. It never downgrades. Escalation preserves required capabilities — a failed code model is replaced by a stronger code model, not a generic one.

Capability-Aware Selection

Routing considers capabilities: code, reasoning, math, general.

Fallback never selects a model that cannot perform the required task. Capabilities are tracked per step, not inferred at retry time.

Task Decomposition

Break complex tasks into subtasks. Each step gets the right model at the right price.

planner = Planner()
plan = planner.plan(
    "Build a REST API for a todo app",
    budget=0.50,
)
# 4 steps, each with the cheapest viable model

Cost Ledger

All LLM calls are recorded in a structured CostLedger, including failed attempts and escalations. See exactly where your money went.

Multi-Provider Failover

Supports OpenRouter, OpenAI, Anthropic, and Google. Direct API keys bypass OpenRouter automatically. The proxy shares a single httpx.AsyncClient across all providers for connection pooling.

Install

pip install tokenwise-llm

Quick Start

1. Set your API key

export OPENROUTER_API_KEY="sk-or-..."

2. Use it

CLI:

# Route a query
tokenwise route "Write a haiku about Python"

# Route with budget ceiling
tokenwise route "Debug this segfault" \
  --strategy best_quality --budget 0.05

# Plan and execute a complex task
tokenwise plan "Build a REST API for a todo app" \
  --budget 0.50 --execute

# View spend history
tokenwise ledger
tokenwise ledger --summary

# Start the OpenAI-compatible proxy
tokenwise serve --port 8000

# List models and pricing
tokenwise models

Python API:

from tokenwise import Router, Planner, Executor

# Route a single query
router = Router()
model = router.route(
    "Explain quantum computing",
    strategy="balanced",
    budget=0.10,
)
print(f"Use model: {model.id} "
      f"(${model.input_price}/M input tokens)")

# Plan a complex task
planner = Planner()
plan = planner.plan(
    task="Build a REST API for a todo app",
    budget=0.50,
)
print(f"Plan: {len(plan.steps)} steps, "
      f"estimated ${plan.total_estimated_cost:.4f}")

# Execute the plan — tracks spend, escalates on failure
executor = Executor()
result = executor.execute(plan)
print(f"Done! Cost: ${result.total_cost:.4f}, "
      f"success: {result.success}")

OpenAI-compatible proxy:

tokenwise serve --port 8000

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="unused",
)
response = client.chat.completions.create(
    model="auto",  # TokenWise picks the best model
    messages=[{"role": "user", "content": "Hello!"}],
)

Background reading: LLM Routers Are Not Enough — the blog post that motivated TokenWise's design.

Example

Plan a task, execute it, and inspect the cost ledger — all in three commands:

# 1. Plan and execute a task ($0.05 budget)
tokenwise plan "Write a Python function to validate \
  email addresses, then write unit tests for it" \
  --budget 0.05 --execute

# 2. View your spend history
tokenwise ledger --summary

Example output:

Plan for: Write a Python function to validate email addresses...
Budget: $0.05
Estimated cost: $0.0023

┌─────────────────────────────────────────────────────────────┐
│ #  Description              Model               Est. Cost   │
│ 1  Write validation func    openai/gpt-4.1-mini  $0.0009    │
│ 2  Write unit tests         openai/gpt-4.1-mini  $0.0014    │
└─────────────────────────────────────────────────────────────┘

Status: Success
Total cost: $0.0019
Budget remaining: $0.0481

How It Works

┌───────────────────────────────────────────────────────┐
│                       TokenWise                       │
│                                                       │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐       │
│  │   Router   │  │  Planner   │  │  Executor  │       │
│  │            │  │            │  │            │       │
│  │  1. Detect │  │  Breaks    │  │  Runs the  │       │
│  │  scenario  │  │  task into │  │  plan,     │       │
│  │  2. Route  │  │  steps +   │  │  tracks    │       │
│  │  within    │  │  assigns   │  │  spend,    │       │
│  │  budget    │  │  models    │  │  retries   │       │
│  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘       │
│        │               │               │              │
│        └───────────────┼───────────────┘              │
│                        ▼                              │
│          ┌──────────────────────────┐                 │
│          │    ProviderResolver      │  ← LLM calls    │
│          │                          │                 │
│          │  OpenAI    · Anthropic   │                 │
│          │  Google    · OpenRouter  │                 │
│          └──────────────────────────┘                 │
│                                                       │
│            ┌──────────────┐                           │
│            │   Registry   │  ← metadata + pricing     │
│            └──────────────┘                           │
└───────────────────────────────────────────────────────┘

Router uses a two-stage pipeline for every request:

            ┌───────────────────┐    ┌──────────────────┐
 query ───▶ │  1. Detect        │───▶│  2. Route        │───▶ model
            │     Scenario      │    │     w/ Strategy   │
            │                   │    │                   │
            │  · capabilities   │    │  · filter budget  │
            │    (code, reason, │    │  · cheapest /     │
            │     math)         │    │    balanced /     │
            │  · complexity     │    │    best_quality   │
            │    (simple→hard)  │    │                   │
            └───────────────────┘    └──────────────────┘

Router separates understanding what the query needs from choosing how to spend. Budget is a universal parameter — not a strategy. By default, the router enforces the budget as a hard ceiling: if no model fits, it raises an error instead of silently exceeding the limit.

Planner decomposes a complex task into subtasks using a cheap LLM, then assigns the optimal model to each step within your budget. If the plan exceeds budget, it automatically downgrades expensive steps.

Executor runs a plan step by step, tracks actual token usage and cost via a CostLedger, and escalates to a stronger model if a step fails. Escalation tries stronger tiers first (flagship before mid) and filters by the step's required capabilities.

Observability

Every execution produces a structured trace. Inspect which model was used, whether escalation occurred, and where each dollar went:

result = executor.execute(plan)

# Per-step: which model ran, whether it was escalated
for sr in result.step_results:
    print(f"Step {sr.step_id}: model={sr.model_id}, "
          f"cost=${sr.actual_cost:.4f}, "
          f"escalated={sr.escalated}")

# Cost ledger: every LLM call including failed attempts
for entry in result.ledger.entries:
    print(f"  {entry.reason}: {entry.model_id} "
          f"({entry.input_tokens}in/"
          f"{entry.output_tokens}out) "
          f"${entry.cost:.6f} "
          f"{'ok' if entry.success else 'FAIL'}")

# Aggregate
print(f"Total: ${result.total_cost:.4f}, "
      f"wasted: ${result.ledger.wasted_cost:.4f}, "
      f"remaining: ${result.budget_remaining:.4f}")

Example output when step 1 fails and escalates:

Step 1: model=openai/gpt-4.1, cost=$0.0052, escalated=True
  step 1 attempt 1: openai/gpt-4.1-mini (82in/0out) $0.000000 FAIL
  step 1 escalation attempt 1: openai/gpt-4.1 (82in/204out) $0.001800 ok
Total: $0.0052, wasted: $0.0000, remaining: $0.9948

Routing Strategies

Strategy	When to Use	How It Works
`cheapest`	Minimize cost	Picks the lowest-price capable model
`best_quality`	Maximize quality	Picks the best flagship-tier capable model
`balanced`	Default	Matches model tier to query complexity

All strategies enforce the budget as a hard ceiling. Pass budget_strict=False in the Python API to fall back to best-effort behavior.

Configuration

TokenWise reads configuration from environment variables and an optional config file (~/.config/tokenwise/config.yaml).

Variable	Required	Description	Default
`OPENROUTER_API_KEY`	Yes	OpenRouter API key	—
`OPENAI_API_KEY`	Optional	Direct OpenAI API key	—
`ANTHROPIC_API_KEY`	Optional	Direct Anthropic API key	—
`GOOGLE_API_KEY`	Optional	Direct Google AI API key	—
`OPENROUTER_BASE_URL`	Optional	OpenRouter base URL	`https://openrouter.ai/api/v1`
`TOKENWISE_DEFAULT_STRATEGY`	Optional	Routing strategy	`balanced`
`TOKENWISE_DEFAULT_BUDGET`	Optional	Budget in USD	`1.00`
`TOKENWISE_PLANNER_MODEL`	Optional	Decomposition model	`openai/gpt-4.1-mini`
`TOKENWISE_PROXY_HOST`	Optional	Proxy bind host	`127.0.0.1`
`TOKENWISE_PROXY_PORT`	Optional	Proxy bind port	`8000`
`TOKENWISE_CACHE_TTL`	Optional	Registry cache TTL (s)	`3600`
`TOKENWISE_LEDGER_PATH`	Optional	Ledger JSONL path	`~/.config/tokenwise/ledger.jsonl`
`TOKENWISE_MIN_OUTPUT_TOKENS`	Optional	Min output tokens per step	`100`
`TOKENWISE_LOCAL_MODELS`	Optional	Local models YAML	—

# ~/.config/tokenwise/config.yaml
default_strategy: balanced
default_budget: 0.50
planner_model: openai/gpt-4.1-mini

Architecture

src/tokenwise/
├── models.py        # Pydantic data models
├── config.py        # Settings from env vars and config file
├── registry.py      # ModelRegistry — fetches/caches models
├── router.py        # Two-stage pipeline: scenario → strategy
├── planner.py       # Decomposes tasks, assigns models
├── executor.py      # Runs plans, tracks spend, escalates
├── ledger_store.py  # Persistent JSONL spend history
├── cli.py           # Typer CLI
├── proxy.py         # FastAPI OpenAI-compatible proxy
├── providers/       # LLM provider adapters
│   ├── openrouter.py
│   ├── openai.py
│   ├── anthropic.py
│   ├── google.py
│   └── resolver.py  # Maps model IDs → provider instances
└── data/
    └── model_capabilities.json

Philosophy

LLM systems should be treated like distributed systems.

That means clear failure semantics, explicit cost ceilings, predictable escalation, and observability. TokenWise is designed with that philosophy.

Benchmarks

benchmarks/pareto.py runs 5 tasks across models at different price tiers and reports cost vs success rate. Single command to reproduce (outputs benchmarks/results.csv and benchmarks/pareto.png):

uv sync --group benchmark && uv run python benchmarks/pareto.py \
  --models openai/gpt-4.1-nano deepseek/deepseek-chat \
    openai/gpt-4.1-mini google/gemini-2.5-flash \
    openai/gpt-4.1 anthropic/claude-sonnet-4 \
    google/gemini-2.5-pro anthropic/claude-opus-4.6 \
    openai/o4-mini google/gemini-3.1-pro-preview \
  --csv benchmarks/results.csv --output benchmarks/pareto.png

Sample results (February 2026, 5 simple tasks per model):

Model	Tier	Success	Avg Cost / Task
openai/gpt-4.1-nano	budget	100%	$0.000059
deepseek/deepseek-chat	budget	100%	$0.000174
openai/gpt-4.1-mini	budget	100%	$0.000238
google/gemini-2.5-flash	budget	100%	$0.000498
openai/o4-mini	mid	100%	$0.001137
openai/gpt-4.1	mid	100%	$0.001201
anthropic/claude-sonnet-4	mid	100%	$0.002681
google/gemini-2.5-pro	mid	100%	$0.002913
google/gemini-3.1-pro-preview	flagship	100%	$0.003490
anthropic/claude-opus-4.6	flagship	100%	$0.005029

All models pass simple tasks — the value shows in cost: ~85x spread between cheapest and most expensive. Harder tasks (multi-step reasoning, long-context coding) will show quality differentiation.

Known Limitations (v0.4)

All three v0.3 limitations have been resolved:

~~Planner cost not budgeted~~ — planner LLM cost is now tracked and deducted from budget (v0.4)
~~Linear execution~~ — independent steps now run in parallel via async DAG scheduling (v0.4)
~~No persistent spend tracking~~ — execution history is persisted to JSONL; see tokenwise ledger (v0.4)

Note on execute() inside async contexts: If you call executor.execute(plan) from inside an existing event loop (Jupyter, FastAPI, etc.), it automatically falls back to sequential step execution. For concurrent DAG scheduling in async code, use await executor.aexecute(plan) directly.

* Budget accuracy note: TokenWise enforces budget ceilings by capping max_tokens before each LLM call. Input token counts are estimated using a chars / 4 heuristic with a 1.2x safety margin — not a tokenizer. This means actual input cost may differ slightly from the estimate. The budget ceiling is real and enforced, but small overruns are possible when the heuristic underestimates input tokens. A future release will support pluggable tokenizer-based estimation for stricter guarantees.

Development

git clone https://github.com/itsarbit/tokenwise.git
cd tokenwise
uv sync
uv run pytest
uv run ruff check src/ tests/
uv run mypy src/

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

arbit30719

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

Feb 23, 2026

0.4.5

Feb 23, 2026

0.4.4

Feb 22, 2026

This version

0.4.3

Feb 22, 2026

0.4.2

Feb 22, 2026

0.4.1

Feb 21, 2026

0.4.0

Feb 21, 2026

0.3.0

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenwise_llm-0.4.3.tar.gz (612.9 kB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenwise_llm-0.4.3-py3-none-any.whl (46.9 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file tokenwise_llm-0.4.3.tar.gz.

File metadata

Download URL: tokenwise_llm-0.4.3.tar.gz
Upload date: Feb 22, 2026
Size: 612.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenwise_llm-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`0c2d55291456ceaab77bbc3914cb692b2e24bd5e34e32734f6e407f64f3cf4f9`
MD5	`f1fbcd3f489ba6ec0a12d6100f696e37`
BLAKE2b-256	`35ca11ea75fe115947f6fc34eb46c305a6fe7774900f9a9d377fab8d6f1c1f12`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenwise_llm-0.4.3.tar.gz:

Publisher: release.yml on itsarbit/tokenwise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tokenwise_llm-0.4.3.tar.gz
- Subject digest: 0c2d55291456ceaab77bbc3914cb692b2e24bd5e34e32734f6e407f64f3cf4f9
- Sigstore transparency entry: 976471290
- Sigstore integration time: Feb 22, 2026
Source repository:
- Permalink: itsarbit/tokenwise@8073dd39bb88da5779a4d9a25ea257fee6493529
- Branch / Tag: refs/tags/v0.4.3
- Owner: https://github.com/itsarbit
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8073dd39bb88da5779a4d9a25ea257fee6493529
- Trigger Event: push

File details

Details for the file tokenwise_llm-0.4.3-py3-none-any.whl.

File metadata

Download URL: tokenwise_llm-0.4.3-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 46.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenwise_llm-0.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b26d5d835e76d478d173f29145ee7bebb7caa94d6abe33a11e87cbbc813aa531`
MD5	`8defab0fba9701a7cbed12d30c19b436`
BLAKE2b-256	`373a0527ab0e0e6d72b64059979fa6972d9f3f977e265bb6432a2c5a2a8c620b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenwise_llm-0.4.3-py3-none-any.whl:

Publisher: release.yml on itsarbit/tokenwise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tokenwise_llm-0.4.3-py3-none-any.whl
- Subject digest: b26d5d835e76d478d173f29145ee7bebb7caa94d6abe33a11e87cbbc813aa531
- Sigstore transparency entry: 976471291
- Sigstore integration time: Feb 22, 2026
Source repository:
- Permalink: itsarbit/tokenwise@8073dd39bb88da5779a4d9a25ea257fee6493529
- Branch / Tag: refs/tags/v0.4.3
- Owner: https://github.com/itsarbit
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8073dd39bb88da5779a4d9a25ea257fee6493529
- Trigger Event: push

tokenwise-llm 0.4.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TokenWise

Why TokenWise Exists

Comparison

Core Features

Budget-Aware Routing

Tiered Escalation

Capability-Aware Selection

Task Decomposition

Cost Ledger

Multi-Provider Failover

Install

Quick Start

1. Set your API key

2. Use it

Example

How It Works

Observability

Routing Strategies

Configuration

Architecture

Philosophy

Benchmarks

Known Limitations (v0.4)

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance