Cut LLM reasoning-token costs by 60% with one line of code.
Project description
ThinkRouter
Pre-inference query routing for LLM reasoning models.
Cut thinking-token costs by 60% with one line of code.
The problem
Reasoning models (o1, DeepSeek-R1, Claude thinking) apply the same 8,000-token compute budget to every query — whether it is simple arithmetic or a complex proof.
"What is 2 + 3?" → 8,000 thinking tokens ← 99% wasted
"Prove that sqrt(2) is irrational" → 8,000 thinking tokens ← correctly used
At 100,000 queries per day, that is $192,635/month in avoidable spend.
The solution
from thinkrouter import ThinkRouter
client = ThinkRouter(provider="openai")
response = client.chat("What is the capital of France?")
# Routed to NO_THINK → 50 tokens used, not 8,000
client.usage.print_dashboard()
ThinkRouter — Usage Dashboard
──────────────────────────────────────────────
Total calls : 13
Tokens saved : 55,650
Compute savings : 53.5%
Avg classifier time : 0.02 ms
Routing breakdown:
no_think : 7 (53.8%) — Direct answer
short_think : 0 ( 0.0%) — Moderate reasoning
full_think : 6 (46.2%) — Full extended reasoning
How it works
ThinkRouter intercepts each query, runs a lightweight classifier in under 1ms, and routes to the minimum compute budget:
| Tier | Budget | Use case |
|---|---|---|
| NO_THINK | 50 tokens | Arithmetic, definitions, lookups, translations |
| SHORT | 800 tokens | Multi-step reasoning, moderate chaining |
| FULL | 8,000 tokens | Proofs, system design, algorithm implementation |
Installation
# Base install — works immediately, zero ML dependencies
pip install thinkrouter
# With fine-tuned DistilBERT classifier (higher accuracy)
pip install thinkrouter[classifier]
# With OpenAI client
pip install thinkrouter[openai]
# With Anthropic client
pip install thinkrouter[anthropic]
# Everything
pip install thinkrouter[all]
Quick start
Try it now — no API key needed
OpenAI
from thinkrouter import ThinkRouter
client = ThinkRouter(
provider="openai",
api_key="sk-...", # or set OPENAI_API_KEY
model="gpt-4o",
verbose=True,
)
response = client.chat("Explain how merge sort works.")
print(response.content)
print(response.routing)
# ClassifierResult(tier=FULL, confidence=0.87, budget=8000 tokens, latency=1.2ms)
client.usage.print_dashboard()
Anthropic
client = ThinkRouter(
provider="anthropic",
api_key="sk-ant-...", # or set ANTHROPIC_API_KEY
model="claude-haiku-4-5-20251001",
)
response = client.chat("What is 144 divided by 12?")
# Routed to NO_THINK → 50 tokens, not 8,000
Streaming
for chunk in client.stream("Explain quantum entanglement step by step."):
print(chunk, end="", flush=True)
Classify without an API call
results = client.classify_batch([
"What is 7 * 8?",
"Design a distributed caching system.",
"How many days are in a leap year?",
])
for r in results:
print(f"{r.tier.name:<12} budget={r.token_budget:>6} tokens conf={r.confidence:.2f}")
NO_THINK budget= 50 tokens conf=0.88
FULL budget= 8000 tokens conf=0.85
NO_THINK budget= 50 tokens conf=0.80
Cost savings at scale
| Volume | Savings/day | Savings/month |
|---|---|---|
| 10,000 queries/day | $642 | $19,263 |
| 100,000 queries/day | $6,421 | $192,635 |
| 1,000,000 queries/day | $64,212 | $1,926,346 |
Based on 53.5% savings rate, $15/million reasoning tokens (approximate o1 rate).
Classifier backends
Heuristic (default)
Zero dependencies. Regex patterns and word-count heuristics. Runs in under 1ms.
client = ThinkRouter(classifier_backend="heuristic")
DistilBERT
Fine-tuned on GSM8K. Achieves 93%+ quality retention at 60% compute savings.
Requires pip install thinkrouter[classifier].
client = ThinkRouter(
classifier_backend="distilbert",
confidence_threshold=0.75,
)
Confidence threshold
| Threshold | Savings | Quality retained | Use case |
|---|---|---|---|
| 0.65 | ~59% | ~91% | High cost sensitivity |
| 0.75 | ~55% | ~93% | Recommended |
| 0.85 | ~44% | ~96% | Quality-sensitive |
Queries below the threshold fall back to FULL — never degrades output quality.
API reference
ThinkRouter
ThinkRouter(
provider = "openai", # "openai" | "anthropic" | "generic"
api_key = None, # falls back to OPENAI_API_KEY / ANTHROPIC_API_KEY
model = None, # default model for all calls
classifier_backend = "heuristic", # "heuristic" | "distilbert"
confidence_threshold = 0.75,
max_records = 10_000,
verbose = False,
)
RouterResponse
response.content # str — generated text
response.routing # ClassifierResult
response.provider # "openai" | "anthropic"
response.model # model identifier
response.usage_tokens # {"prompt_tokens": N, "completion_tokens": M, ...}
ClassifierResult
result.tier # Tier.NO_THINK | Tier.SHORT | Tier.FULL
result.confidence # float in [0, 1]
result.token_budget # int — thinking tokens assigned
result.latency_ms # classifier wall-clock time in ms
result.backend # "heuristic" | "distilbert:cuda" | "distilbert:cpu"
Running tests
git clone https://github.com/saikoushiknalubola/thinkrouter.git
cd thinkrouter
pip install -e ".[dev]"
pytest tests/ -v
Roadmap
- Heuristic classifier
- OpenAI and Anthropic adapters
- Streaming support
- Thread-safe usage dashboard
- GitHub Actions CI (Python 3.9–3.12)
- DistilBERT model on HuggingFace Hub
- Multi-domain training (MMLU, HumanEval, ARC-Challenge)
- Async support (
achat(),astream()) - Continuous budget regression
- Hosted API proxy (api.thinkrouter.ai)
Research basis
- Zhao et al. (2025). SelfBudgeter. arXiv:2505.11274 — 74.47% savings validated
- Wang et al. (2025). TALE-EP. ACL Findings 2025 — 67% output token reduction
- Sanh et al. (2019). DistilBERT. arXiv:1910.01108
- Cobbe et al. (2021). GSM8K. arXiv:2110.14168
Contributing
See CONTRIBUTING.md. Issues and pull requests welcome.
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thinkrouter-0.3.0.tar.gz.
File metadata
- Download URL: thinkrouter-0.3.0.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e2bdd4b2fe2e8c74f3361527365325eb85ebf3eeca1d957bb6df28f9015396f
|
|
| MD5 |
192a2976c077547674949619c50c2c09
|
|
| BLAKE2b-256 |
5fa8e7579b10ba0510b727a8ea2305c19fd499ed4242ece53ab3be0907edfc97
|
File details
Details for the file thinkrouter-0.3.0-py3-none-any.whl.
File metadata
- Download URL: thinkrouter-0.3.0-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e7c76d138f73f4b8df5a30b35949cc8d9a9ede60caf990d09d83dc6d6001523
|
|
| MD5 |
042ee4a39b2975f03b85ae105387fc76
|
|
| BLAKE2b-256 |
f2bf646447cb3bfdac1e947439622fc8cbfc39a05dac0d90d80d307fdf094e03
|