Skip to main content

Cut LLM reasoning-token costs by 60% with one line of code — pre-inference query difficulty routing.

Project description

ThinkRouter

CI PyPI version Python 3.9+ License: MIT Open In Colab

Pre-inference query routing for LLM reasoning models.
Cut thinking-token costs by 60% with one line of code.


The problem

Reasoning models (o1, DeepSeek-R1, Claude thinking) apply the same 8,000-token compute budget to every query — whether it is simple arithmetic or a complex proof.

"What is 2 + 3?"                   →  8,000 thinking tokens   ← 99% wasted
"Prove that sqrt(2) is irrational"  →  8,000 thinking tokens   ← correctly used

At 100,000 queries per day, that is $192,635/month in avoidable spend.


The solution

from thinkrouter import ThinkRouter

client   = ThinkRouter(provider="openai")
response = client.chat("What is the capital of France?")
# Routed to NO_THINK → 50 tokens used, not 8,000

client.usage.print_dashboard()
  ThinkRouter — Usage Dashboard
  ──────────────────────────────────────────────
  Total calls          : 13
  Tokens saved         : 55,650
  Compute savings      : 53.5%
  Avg classifier time  : 0.02 ms

  Routing breakdown:
    no_think        :      7  (53.8%)  — Direct answer
    short_think     :      0  ( 0.0%)  — Moderate reasoning
    full_think      :      6  (46.2%)  — Full extended reasoning

How it works

ThinkRouter intercepts each query, runs a lightweight classifier in under 1ms, and routes to the minimum compute budget:

Tier Budget Use case
NO_THINK 50 tokens Arithmetic, definitions, lookups, translations
SHORT 800 tokens Multi-step reasoning, moderate chaining
FULL 8,000 tokens Proofs, system design, algorithm implementation

Installation

# Base install — works immediately, zero ML dependencies
pip install thinkrouter

# With fine-tuned DistilBERT classifier (higher accuracy)
pip install thinkrouter[classifier]

# With OpenAI client
pip install thinkrouter[openai]

# With Anthropic client
pip install thinkrouter[anthropic]

# Everything
pip install thinkrouter[all]

Quick start

Try it now — no API key needed

Open In Colab

OpenAI

from thinkrouter import ThinkRouter

client = ThinkRouter(
    provider="openai",
    api_key="sk-...",      # or set OPENAI_API_KEY
    model="gpt-4o",
    verbose=True,
)

response = client.chat("Explain how merge sort works.")
print(response.content)
print(response.routing)
# ClassifierResult(tier=FULL, confidence=0.87, budget=8000 tokens, latency=1.2ms)

client.usage.print_dashboard()

Anthropic

client = ThinkRouter(
    provider="anthropic",
    api_key="sk-ant-...",  # or set ANTHROPIC_API_KEY
    model="claude-haiku-4-5-20251001",
)

response = client.chat("What is 144 divided by 12?")
# Routed to NO_THINK → 50 tokens, not 8,000

Streaming

for chunk in client.stream("Explain quantum entanglement step by step."):
    print(chunk, end="", flush=True)

Classify without an API call

results = client.classify_batch([
    "What is 7 * 8?",
    "Design a distributed caching system.",
    "How many days are in a leap year?",
])

for r in results:
    print(f"{r.tier.name:<12}  budget={r.token_budget:>6} tokens  conf={r.confidence:.2f}")
NO_THINK      budget=    50 tokens  conf=0.88
FULL          budget=  8000 tokens  conf=0.85
NO_THINK      budget=    50 tokens  conf=0.80

Cost savings at scale

Volume Savings/day Savings/month
10,000 queries/day $642 $19,263
100,000 queries/day $6,421 $192,635
1,000,000 queries/day $64,212 $1,926,346

Based on 53.5% savings rate, $15/million reasoning tokens (approximate o1 rate).


Classifier backends

Heuristic (default)

Zero dependencies. Regex patterns and word-count heuristics. Runs in under 1ms.

client = ThinkRouter(classifier_backend="heuristic")

DistilBERT

Fine-tuned on GSM8K. Achieves 93%+ quality retention at 60% compute savings.
Requires pip install thinkrouter[classifier].

client = ThinkRouter(
    classifier_backend="distilbert",
    confidence_threshold=0.75,
)

Confidence threshold

Threshold Savings Quality retained Use case
0.65 ~59% ~91% High cost sensitivity
0.75 ~55% ~93% Recommended
0.85 ~44% ~96% Quality-sensitive

Queries below the threshold fall back to FULL — never degrades output quality.


API reference

ThinkRouter

ThinkRouter(
    provider             = "openai",      # "openai" | "anthropic" | "generic"
    api_key              = None,          # falls back to OPENAI_API_KEY / ANTHROPIC_API_KEY
    model                = None,          # default model for all calls
    classifier_backend   = "heuristic",   # "heuristic" | "distilbert"
    confidence_threshold = 0.75,
    max_records          = 10_000,
    verbose              = False,
)

RouterResponse

response.content       # str — generated text
response.routing       # ClassifierResult
response.provider      # "openai" | "anthropic"
response.model         # model identifier
response.usage_tokens  # {"prompt_tokens": N, "completion_tokens": M, ...}

ClassifierResult

result.tier          # Tier.NO_THINK | Tier.SHORT | Tier.FULL
result.confidence    # float in [0, 1]
result.token_budget  # int — thinking tokens assigned
result.latency_ms    # classifier wall-clock time in ms
result.backend       # "heuristic" | "distilbert:cuda" | "distilbert:cpu"

Running tests

git clone https://github.com/saikoushiknalubola/thinkrouter.git
cd thinkrouter
pip install -e ".[dev]"
pytest tests/ -v

Roadmap

  • Heuristic classifier
  • OpenAI and Anthropic adapters
  • Streaming support
  • Thread-safe usage dashboard
  • GitHub Actions CI (Python 3.9–3.12)
  • DistilBERT model on HuggingFace Hub
  • Multi-domain training (MMLU, HumanEval, ARC-Challenge)
  • Async support (achat(), astream())
  • Continuous budget regression
  • Hosted API proxy (api.thinkrouter.ai)

Research basis

  • Zhao et al. (2025). SelfBudgeter. arXiv:2505.11274 — 74.47% savings validated
  • Wang et al. (2025). TALE-EP. ACL Findings 2025 — 67% output token reduction
  • Sanh et al. (2019). DistilBERT. arXiv:1910.01108
  • Cobbe et al. (2021). GSM8K. arXiv:2110.14168

Contributing

See CONTRIBUTING.md. Issues and pull requests welcome.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinkrouter-0.2.0.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thinkrouter-0.2.0-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file thinkrouter-0.2.0.tar.gz.

File metadata

  • Download URL: thinkrouter-0.2.0.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thinkrouter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 77b07a94e4b4b389071cd5e583466aa2751242cef01ed7ba9a1baff581ca254c
MD5 cf8f93840f1e36e96fb62dabace3cc7f
BLAKE2b-256 d3f703d5293d3200b5261c6bf2fa81b534dbe4a14bcc2f375087713df748ff35

See more details on using hashes here.

File details

Details for the file thinkrouter-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: thinkrouter-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thinkrouter-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65491d7b8cce68c7d5ba5dc462d5357e27b21cc8adcbae50ee2e4911cca0901e
MD5 62be483d4a84ee44796ba06549ca2798
BLAKE2b-256 26d378b3f34061b66897a27f3d1e6921bfb3227c759c0840c03d83da25a202ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page