Skip to main content

Intelligent LLM task planner — decompose tasks, route to optimal models, enforce budgets

Project description

TokenWise

TokenWise

CI Python License: MIT PyPI

Production-grade LLM routing with budget ceilings, tiered escalation, and multi-provider failover.

30-Second Demo

pip install tokenwise-llm

# Route a query to the best model within budget
tokenwise route "Debug this segfault" --strategy best_quality --budget 0.05

# Decompose a task, execute it, track spend
tokenwise plan "Write a Python function to validate email addresses, \
  then write unit tests for it" --budget 0.05 --execute

Example output:

Plan: 4 steps | Budget: $0.05 | Estimated: $0.0002
Step 2 failed (nano) → escalated to mid

Status: Success | Total cost: $0.0007 | Budget remaining: $0.0493

Quick Start

Set your API key

export OPENROUTER_API_KEY="sk-or-..."

TokenWise uses OpenRouter as the default gateway. Set OPENAI_API_KEY, ANTHROPIC_API_KEY, or GOOGLE_API_KEY to bypass OpenRouter for those providers.

Route a query

tokenwise route "Write a haiku about Python"

Route with budget ceiling

tokenwise route "Debug this segfault" --strategy best_quality --budget 0.05

Plan and execute

tokenwise plan "Build a REST API" --budget 0.50 --execute

Inspect spend

tokenwise ledger --summary

Routing strategies

Strategy When to Use How It Works
cheapest Minimize cost Lowest-price capable model
best_quality Maximize quality Best flagship-tier capable model
balanced Default Matches model tier to query complexity

Budget is a universal parameter on all strategies. Pass budget_strict=False to fall back to best-effort.

Python API

from tokenwise import Router, Planner, Executor

# Route a single query
router = Router()
model = router.route("Explain quantum computing", strategy="balanced", budget=0.10)
print(f"{model.id} (${model.input_price}/M input)")

# Plan and execute a complex task
planner = Planner()
plan = planner.plan(task="Build a REST API for a todo app", budget=0.50)

executor = Executor()
result = executor.execute(plan)
print(f"Cost: ${result.total_cost:.4f}, success: {result.success}")

OpenAI-compatible proxy

tokenwise serve --port 8000
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="auto",  # TokenWise picks the best model
    messages=[{"role": "user", "content": "Hello!"}],
)

Core Features

  • Budget-aware routing — strict cost ceilings enforced via max_tokens caps with 1.2x safety margin.
  • Tiered escalation — budget, mid, flagship; escalates upward on failure, never downward.
  • Capability-aware fallback — routes and fallbacks filtered by code, reasoning, math, or general.
  • Task decomposition — LLM-powered planning with per-step model assignment and async DAG scheduling.
  • Cost ledger — structured per-call accounting including failures and retries, persisted to JSONL.
  • Multi-provider failover — OpenRouter, OpenAI, Anthropic, and Google with connection pooling.

Benchmarks

Cost-Quality Frontier

TokenWise escalation achieves flagship-level reliability at ~9x lower cost than flagship-only execution.

benchmarks/strategy_pareto.py runs 20 tasks (simple, reasoning, coding, hard) across four strategies and validates answer correctness. Single command to reproduce:

uv sync --group benchmark && uv run python benchmarks/strategy_pareto.py

Results (February 2026, 20 tasks per strategy):

Strategy Success Avg Cost / Task
Budget Only (gpt-4.1-nano) 90% $0.000183
Mid Only (gpt-4.1) 90% $0.003573
Flagship Only (claude-sonnet-4) 100% $0.009354
TokenWise Escalation 100% $0.001009

Budget-only is cheapest but gets reasoning tasks wrong. Flagship is strongest but 51x more expensive. Escalation starts cheap, detects failures, and upgrades — achieving 100% success at a fraction of the flagship cost. This matters when deploying multi-step LLM workflows in production, where cost compounds across retries and task decomposition.

Comparison

Most routing tools optimize per-request model choice. TokenWise treats routing as a workflow-level control system.

High-level comparison (as of February 2026). Corrections welcome via issues.

Feature TokenWise RouteLLM LiteLLM Not Diamond Martian Portkey OpenRouter
Task decomposition Yes - - - - - -
Strict budget ceiling Yes - Yes - Per-request Yes Yes
Tier-based escalation Yes - Yes - - Yes -
Capability-aware fallback Yes - - Partial Yes Partial Partial
Cost ledger Yes - Yes - - Yes Dashboard
OpenAI-compatible proxy Yes Yes Yes Yes Yes Yes Yes
CLI Yes - Yes - - - -
Self-hosted / open source Yes Yes Yes - - Gateway only -

How It Works

Architecture

┌───────────────────────────────────────────────────────┐
│                       TokenWise                       │
│                                                       │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐       │
│  │   Router   │  │  Planner   │  │  Executor  │       │
│  │            │  │            │  │            │       │
│  │  1. Detect │  │  Breaks    │  │  Runs the  │       │
│  │  scenario  │  │  task into │  │  plan,     │       │
│  │  2. Route  │  │  steps +   │  │  tracks    │       │
│  │  within    │  │  assigns   │  │  spend,    │       │
│  │  budget    │  │  models    │  │  retries   │       │
│  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘       │
│        │               │               │              │
│        └───────────────┼───────────────┘              │
│                        ▼                              │
│          ┌──────────────────────────┐                 │
│          │    ProviderResolver      │  ← LLM calls    │
│          │                          │                 │
│          │  OpenAI    · Anthropic   │                 │
│          │  Google    · OpenRouter  │                 │
│          └──────────────────────────┘                 │
│                                                       │
│            ┌──────────────┐                           │
│            │   Registry   │  ← metadata + pricing     │
│            └──────────────┘                           │
└───────────────────────────────────────────────────────┘

Router pipeline

The router uses a two-stage pipeline: detect (capabilities + complexity) then route (filter by budget, apply strategy: cheapest / balanced / best_quality).

Planner and Executor

Planner decomposes a task into subtasks using a cheap LLM, assigns the optimal model to each step within budget, and auto-downgrades expensive steps if over budget.

Executor runs the plan via async DAG scheduling, tracks actual cost via CostLedger, and escalates to stronger models on failure (flagship before mid, filtered by capability).

If executor.execute(plan) is called inside an existing event loop (Jupyter, FastAPI), it falls back to sequential execution. Use await executor.aexecute(plan) directly for concurrent DAG scheduling in async code.

Observability

Every execution produces a structured trace:

result = executor.execute(plan)

for sr in result.step_results:
    print(f"Step {sr.step_id}: model={sr.model_id}, "
          f"cost=${sr.actual_cost:.4f}, escalated={sr.escalated}")

for entry in result.ledger.entries:
    print(f"  {entry.reason}: {entry.model_id} "
          f"({entry.input_tokens}in/{entry.output_tokens}out) "
          f"${entry.cost:.6f} {'ok' if entry.success else 'FAIL'}")

print(f"Total: ${result.total_cost:.4f}, "
      f"wasted: ${result.ledger.wasted_cost:.4f}, "
      f"remaining: ${result.budget_remaining:.4f}")

Example output when step 1 fails and escalates:

Step 1: model=openai/gpt-4.1, cost=$0.0052, escalated=True
  step 1 attempt 1: openai/gpt-4.1-mini (82in/0out) $0.000000 FAIL
  step 1 escalation attempt 1: openai/gpt-4.1 (82in/204out) $0.001800 ok
Total: $0.0052, wasted: $0.0000, remaining: $0.9948

Budget Semantics

TokenWise enforces budget ceilings by capping max_tokens before each LLM call. Input token counts are estimated using a chars / 4 heuristic with a 1.2x safety margin — not a tokenizer. The budget ceiling is real and enforced, but small overruns are possible when the heuristic underestimates input tokens. A future release will support pluggable tokenizer-based estimation for stricter guarantees.

Known Limitations (v0.4)

All three v0.3 limitations have been resolved:

  • Planner cost not budgeted — tracked and deducted (v0.4)
  • Linear execution — parallel DAG scheduling (v0.4)
  • No persistent spend tracking — JSONL ledger (v0.4)

Configuration

TokenWise reads configuration from environment variables and an optional config file (~/.config/tokenwise/config.yaml).

Variable Required Description Default
OPENROUTER_API_KEY Yes OpenRouter API key
OPENAI_API_KEY Optional Direct OpenAI API key
ANTHROPIC_API_KEY Optional Direct Anthropic API key
GOOGLE_API_KEY Optional Direct Google AI API key
OPENROUTER_BASE_URL Optional OpenRouter base URL https://openrouter.ai/api/v1
TOKENWISE_DEFAULT_STRATEGY Optional Routing strategy balanced
TOKENWISE_DEFAULT_BUDGET Optional Budget in USD 1.00
TOKENWISE_PLANNER_MODEL Optional Decomposition model openai/gpt-4.1-mini
TOKENWISE_PROXY_HOST Optional Proxy bind host 127.0.0.1
TOKENWISE_PROXY_PORT Optional Proxy bind port 8000
TOKENWISE_CACHE_TTL Optional Registry cache TTL (s) 3600
TOKENWISE_LEDGER_PATH Optional Ledger JSONL path ~/.config/tokenwise/ledger.jsonl
TOKENWISE_MIN_OUTPUT_TOKENS Optional Min output tokens per step 100
TOKENWISE_LOCAL_MODELS Optional Local models YAML
# ~/.config/tokenwise/config.yaml
default_strategy: balanced
default_budget: 0.50
planner_model: openai/gpt-4.1-mini

Development

git clone https://github.com/itsarbit/tokenwise.git
cd tokenwise
uv sync
uv run pytest
uv run ruff check src/ tests/
uv run mypy src/
src/tokenwise/
├── models.py        # Pydantic data models
├── config.py        # Settings from env vars and config file
├── registry.py      # ModelRegistry — fetches/caches models
├── router.py        # Two-stage pipeline: scenario → strategy
├── planner.py       # Decomposes tasks, assigns models
├── executor.py      # Runs plans, tracks spend, escalates
├── ledger_store.py  # Persistent JSONL spend history
├── cli.py           # Typer CLI
├── proxy.py         # FastAPI OpenAI-compatible proxy
├── providers/       # LLM provider adapters
│   ├── openrouter.py
│   ├── openai.py
│   ├── anthropic.py
│   ├── google.py
│   └── resolver.py  # Maps model IDs → provider instances
└── data/
    └── model_capabilities.json

Philosophy

LLM systems should be treated like distributed systems. That means clear failure semantics, explicit cost ceilings, predictable escalation, and observability. TokenWise is designed with that philosophy.

Background reading: LLM Routers Are Not Enough — the blog post that motivated TokenWise's design.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenwise_llm-0.4.4.tar.gz (665.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenwise_llm-0.4.4-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file tokenwise_llm-0.4.4.tar.gz.

File metadata

  • Download URL: tokenwise_llm-0.4.4.tar.gz
  • Upload date:
  • Size: 665.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenwise_llm-0.4.4.tar.gz
Algorithm Hash digest
SHA256 35b507898daffe5bb44d14748557986a34bc126c3b179a569804673d99361d83
MD5 13927d4b6e85e90f1efc01d0a5be08ca
BLAKE2b-256 a3dd30e9230401fd229cad457dbe89acd2d92982aff31cd85c804bec6becec28

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenwise_llm-0.4.4.tar.gz:

Publisher: release.yml on itsarbit/tokenwise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tokenwise_llm-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: tokenwise_llm-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenwise_llm-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8f7c1972336dcc57383d1aac9a5035ffe3ad5428039d4c25b14123184213403c
MD5 330b73e59a7226c756dc3785d9e9626f
BLAKE2b-256 a9011f10f9a5bd5a9a8903f05ef86f0a313e2ae9253c00756a4b51898a48928b

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenwise_llm-0.4.4-py3-none-any.whl:

Publisher: release.yml on itsarbit/tokenwise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page