Skip to main content

Cut LLM reasoning-token costs by 60% with one line of code.

Project description

ThinkRouter

CI PyPI version Python 3.9+ License: MIT Open In Colab

Pre-inference query routing for LLM reasoning models.
Cut thinking-token costs by 60% with one line of code.


The problem

Reasoning models (o1, DeepSeek-R1, Claude thinking) apply the same 8,000-token compute budget to every query — whether it is simple arithmetic or a complex proof.

"What is 2 + 3?"                   →  8,000 thinking tokens   ← 99% wasted
"Prove that sqrt(2) is irrational"  →  8,000 thinking tokens   ← correctly used

At 100,000 queries per day, that is $192,635/month in avoidable spend.


The solution

from thinkrouter import ThinkRouter

client   = ThinkRouter(provider="openai")
response = client.chat("What is the capital of France?")
# Routed to NO_THINK → 50 tokens used, not 8,000

client.usage.print_dashboard()
  ThinkRouter — Usage Dashboard
  ──────────────────────────────────────────────
  Total calls          : 13
  Tokens saved         : 55,650
  Compute savings      : 53.5%
  Avg classifier time  : 0.02 ms

  Routing breakdown:
    no_think        :      7  (53.8%)  — Direct answer
    short_think     :      0  ( 0.0%)  — Moderate reasoning
    full_think      :      6  (46.2%)  — Full extended reasoning

How it works

ThinkRouter intercepts each query, runs a lightweight classifier in under 1ms, and routes to the minimum compute budget:

Tier Budget Use case
NO_THINK 50 tokens Arithmetic, definitions, lookups, translations
SHORT 800 tokens Multi-step reasoning, moderate chaining
FULL 8,000 tokens Proofs, system design, algorithm implementation

Installation

# Base install — works immediately, zero ML dependencies
pip install thinkrouter

# With fine-tuned DistilBERT classifier (higher accuracy)
pip install thinkrouter[classifier]

# With OpenAI client
pip install thinkrouter[openai]

# With Anthropic client
pip install thinkrouter[anthropic]

# Everything
pip install thinkrouter[all]

Quick start

Try it now — no API key needed

Open In Colab

OpenAI

from thinkrouter import ThinkRouter

client = ThinkRouter(
    provider="openai",
    api_key="sk-...",      # or set OPENAI_API_KEY
    model="gpt-4o",
    verbose=True,
)

response = client.chat("Explain how merge sort works.")
print(response.content)
print(response.routing)
# ClassifierResult(tier=FULL, confidence=0.87, budget=8000 tokens, latency=1.2ms)

client.usage.print_dashboard()

Anthropic

client = ThinkRouter(
    provider="anthropic",
    api_key="sk-ant-...",  # or set ANTHROPIC_API_KEY
    model="claude-haiku-4-5-20251001",
)

response = client.chat("What is 144 divided by 12?")
# Routed to NO_THINK → 50 tokens, not 8,000

Streaming

for chunk in client.stream("Explain quantum entanglement step by step."):
    print(chunk, end="", flush=True)

Classify without an API call

results = client.classify_batch([
    "What is 7 * 8?",
    "Design a distributed caching system.",
    "How many days are in a leap year?",
])

for r in results:
    print(f"{r.tier.name:<12}  budget={r.token_budget:>6} tokens  conf={r.confidence:.2f}")
NO_THINK      budget=    50 tokens  conf=0.88
FULL          budget=  8000 tokens  conf=0.85
NO_THINK      budget=    50 tokens  conf=0.80

Cost savings at scale

Volume Savings/day Savings/month
10,000 queries/day $642 $19,263
100,000 queries/day $6,421 $192,635
1,000,000 queries/day $64,212 $1,926,346

Based on 53.5% savings rate, $15/million reasoning tokens (approximate o1 rate).


Classifier backends

Heuristic (default)

Zero dependencies. Regex patterns and word-count heuristics. Runs in under 1ms.

client = ThinkRouter(classifier_backend="heuristic")

DistilBERT

Fine-tuned on GSM8K. Achieves 93%+ quality retention at 60% compute savings.
Requires pip install thinkrouter[classifier].

client = ThinkRouter(
    classifier_backend="distilbert",
    confidence_threshold=0.75,
)

Confidence threshold

Threshold Savings Quality retained Use case
0.65 ~59% ~91% High cost sensitivity
0.75 ~55% ~93% Recommended
0.85 ~44% ~96% Quality-sensitive

Queries below the threshold fall back to FULL — never degrades output quality.


API reference

ThinkRouter

ThinkRouter(
    provider             = "openai",      # "openai" | "anthropic" | "generic"
    api_key              = None,          # falls back to OPENAI_API_KEY / ANTHROPIC_API_KEY
    model                = None,          # default model for all calls
    classifier_backend   = "heuristic",   # "heuristic" | "distilbert"
    confidence_threshold = 0.75,
    max_records          = 10_000,
    verbose              = False,
)

RouterResponse

response.content       # str — generated text
response.routing       # ClassifierResult
response.provider      # "openai" | "anthropic"
response.model         # model identifier
response.usage_tokens  # {"prompt_tokens": N, "completion_tokens": M, ...}

ClassifierResult

result.tier          # Tier.NO_THINK | Tier.SHORT | Tier.FULL
result.confidence    # float in [0, 1]
result.token_budget  # int — thinking tokens assigned
result.latency_ms    # classifier wall-clock time in ms
result.backend       # "heuristic" | "distilbert:cuda" | "distilbert:cpu"

Running tests

git clone https://github.com/saikoushiknalubola/thinkrouter.git
cd thinkrouter
pip install -e ".[dev]"
pytest tests/ -v

Roadmap

  • Heuristic classifier
  • OpenAI and Anthropic adapters
  • Streaming support
  • Thread-safe usage dashboard
  • GitHub Actions CI (Python 3.9–3.12)
  • DistilBERT model on HuggingFace Hub
  • Multi-domain training (MMLU, HumanEval, ARC-Challenge)
  • Async support (achat(), astream())
  • Continuous budget regression
  • Hosted API proxy (api.thinkrouter.ai)

Research basis

  • Zhao et al. (2025). SelfBudgeter. arXiv:2505.11274 — 74.47% savings validated
  • Wang et al. (2025). TALE-EP. ACL Findings 2025 — 67% output token reduction
  • Sanh et al. (2019). DistilBERT. arXiv:1910.01108
  • Cobbe et al. (2021). GSM8K. arXiv:2110.14168

Contributing

See CONTRIBUTING.md. Issues and pull requests welcome.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinkrouter-0.3.0.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thinkrouter-0.3.0-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file thinkrouter-0.3.0.tar.gz.

File metadata

  • Download URL: thinkrouter-0.3.0.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thinkrouter-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0e2bdd4b2fe2e8c74f3361527365325eb85ebf3eeca1d957bb6df28f9015396f
MD5 192a2976c077547674949619c50c2c09
BLAKE2b-256 5fa8e7579b10ba0510b727a8ea2305c19fd499ed4242ece53ab3be0907edfc97

See more details on using hashes here.

File details

Details for the file thinkrouter-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: thinkrouter-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 27.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thinkrouter-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6e7c76d138f73f4b8df5a30b35949cc8d9a9ede60caf990d09d83dc6d6001523
MD5 042ee4a39b2975f03b85ae105387fc76
BLAKE2b-256 f2bf646447cb3bfdac1e947439622fc8cbfc39a05dac0d90d80d307fdf094e03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page