Adaptive inference budget controller for self-hosted LLMs. Controls thinking tokens, tracks GPU cost per query.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

siddhant-k-code

These details have not been verified by PyPI

Project description

ThinkBudget

A proxy for self-hosted LLMs that caps how much the model thinks per query.

Reasoning models burn thousands of tokens on internal monologue before answering. A greeting gets the same GPU time as a proof. ThinkBudget fixes that.

"What is 2+2?"     → TRIVIAL  →   32 thinking tokens  → $0.000002
"Debug this race    → COMPLEX  → 2048 thinking tokens  → $0.000180
 condition..."

It classifies the query, sets a token budget, forwards to your backend, and tracks GPU cost. No LLM call for classification. Under a millisecond.

The problem

DeepSeek-R1, Qwen-QwQ, and similar models wrap their reasoning in <think> tags. Without a budget, every query gets the same treatment. "Hello" triggers 4,000 tokens of internal deliberation. You pay for all of it.

Several papers describe adaptive reasoning (Ares, AutoThink, Conformal Thinking). None ship a tool you can deploy.

How it works

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Application │────▶│   ThinkBudget    │────▶│  vLLM / SGLang  │
│ (any client)│◀────│   Proxy          │◀────│  / Ollama       │
└─────────────┘     └──────────────────┘     └─────────────────┘

Classify. Heuristic scorer reads the query. No model call. Assigns a tier.
Budget. Sets a thinking token cap: 32, 128, 512, 2048, or 8192.
Forward. Sends the request to your backend with the cap applied.
Enforce. During streaming, injects </think> when the budget runs out.
Track. Samples GPU power via pynvml. Records cost per query in dollars and joules.

Quick start

Install

# Install the CLI
uv tool install thinkbudget

# Or add to a project as a dependency
uv add thinkbudget

# With GPU monitoring
uv tool install 'thinkbudget[gpu]'

# One-off run without installing
uvx thinkbudget classify "Hello"

Don't have uv? Install it or use pip: pip install thinkbudget

Run

thinkbudget serve \
  --backend-url http://localhost:8000 \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --gpu-cost 0.39 \
  --gpu-name "RTX 4090"

Point your app at http://localhost:9100 instead of the backend:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:9100/v1", api_key="unused")

# 32-token thinking budget
client.chat.completions.create(
    model="thinkbudget",
    messages=[{"role": "user", "content": "Hello"}],
)

# 2048-token thinking budget
client.chat.completions.create(
    model="thinkbudget",
    messages=[{"role": "user", "content": "Design a distributed task queue with at-least-once delivery"}],
)

Every response carries cost headers:

X-ThinkBudget-Tier: complex
X-ThinkBudget-Budget: 2048
X-ThinkBudget-Thinking-Tokens: 1847
X-ThinkBudget-Cost: $0.00018200
X-ThinkBudget-Energy-J: 42.3100

Dashboard

http://localhost:9100/dashboard shows a live feed: queries, tiers, budgets, tokens used, GPU power, cost per query, cumulative savings.

Classify only

# With thinkbudget installed
thinkbudget classify "What is the capital of France?"
# Tier: SIMPLE | Budget: 128 tokens

# Or as a one-off
uvx thinkbudget classify "Prove the halting problem is undecidable"
# Tier: COMPLEX | Budget: 2,048 tokens

Tiers

Tier	Budget	Example
TRIVIAL	32	"Hello"
SIMPLE	128	"What is the capital of France?"
MODERATE	512	"Compare REST vs GraphQL"
COMPLEX	2,048	"Debug this race condition"
DEEP	8,192	"Prove the halting problem is undecidable"

All budgets are configurable. Per tier, per user, per session.

Deploy on CloudRift

Built for CloudRift GPU instances.

git clone https://github.com/Siddhant-K-code/ThinkBudget
cd ThinkBudget
./cloudrift/run.sh

This starts vLLM with DeepSeek-R1-Distill-Qwen-7B and ThinkBudget in front of it. One command.

For a different model or GPU:

MODEL=Qwen/QwQ-32B GPU_COST=0.65 GPU_NAME="RTX 5090" ./cloudrift/run.sh

Docker works too:

docker build -t thinkbudget .
docker run --gpus all -p 9100:9100 -p 8000:8000 thinkbudget

Benchmarks

85 queries across all five tiers. Run them against your backend with and without ThinkBudget:

python benchmarks/run_benchmark.py --mode compare \
  --backend-url http://localhost:8000 \
  --proxy-url http://localhost:9100 \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Generate a markdown report:

python benchmarks/compare.py \
  benchmarks/results/summary_baseline_*.json \
  benchmarks/results/summary_budgeted_*.json \
  benchmarks/results/report.md

API

Proxy

Endpoint	Method	What it does
`/v1/chat/completions`	POST	Proxies with budget control
`/v1/models`	GET	Lists backend models
`/health`	GET	Status and GPU info

Dashboard

Endpoint	Method	What it does
`/api/stats`	GET	Totals and distributions
`/api/history`	GET	Recent queries
`/api/gpu`	GET	Live GPU metrics
`/api/classify`	POST	Classify without forwarding

Environment variables

Variable	Default
`THINKBUDGET_BACKEND_URL`	`http://localhost:8000`
`THINKBUDGET_MODEL`	`default`
`THINKBUDGET_PORT`	`9100`
`THINKBUDGET_GPU_COST_PER_HOUR`	`0.39`
`THINKBUDGET_GPU_NAME`	`RTX 4090`

Structure

src/thinkbudget/
  classifier.py      # Scores query complexity. No LLM call.
  budget.py           # Sets and enforces token budgets.
  gpu_monitor.py      # Reads GPU power via pynvml. Computes cost.
  proxy.py            # OpenAI-compatible proxy. Streams with enforcement.
  dashboard.py        # Serves the live dashboard.
  models.py           # Data types.
  config.py           # Loads config from file or env.
  cli.py              # Entry point.

benchmarks/
  datasets/           # 85 queries, 5 tiers.
  run_benchmark.py    # Runs baseline vs budgeted.
  compare.py          # Generates comparison report.

cloudrift/
  run.sh              # One-command GPU deploy.

See ARCHITECTURE.md for the full design.

Part of a stack

ThinkBudget fits between Distill and LLMTraceFX:

Distill       → fewer input tokens (clean context)
ThinkBudget   → fewer thinking tokens (adaptive budget)
LLMTraceFX    → faster inference (GPU kernel profiling)

License

AGPL-3.0

Siddhant Khare

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

siddhant-k-code

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinkbudget-0.1.0.tar.gz (55.2 kB view details)

Uploaded Mar 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thinkbudget-0.1.0-py3-none-any.whl (38.3 kB view details)

Uploaded Mar 15, 2026 Python 3

File details

Details for the file thinkbudget-0.1.0.tar.gz.

File metadata

Download URL: thinkbudget-0.1.0.tar.gz
Upload date: Mar 15, 2026
Size: 55.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for thinkbudget-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`21efbec0e3c9524112d03a60d58e5a36d07d5144f87fa0790476c07d5a3a015d`
MD5	`d800eaf44fc3726b1413a4b58805374a`
BLAKE2b-256	`0482f419889afd837790828052ae78c3915dbcd83009ef7865901d063e56e6ea`

See more details on using hashes here.

File details

Details for the file thinkbudget-0.1.0-py3-none-any.whl.

File metadata

Download URL: thinkbudget-0.1.0-py3-none-any.whl
Upload date: Mar 15, 2026
Size: 38.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for thinkbudget-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d90e4b222a8360db5c2bd50a4f26115ff65b206e67c43a6a674e02fb1d0e56dd`
MD5	`7e55984776270d9d37fb8e58824f6553`
BLAKE2b-256	`516258520feb2bd31baeb0fa1ab4bf6629ed954befac3e2599cd72d45a7ec9ca`

See more details on using hashes here.

thinkbudget 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ThinkBudget

The problem

How it works

Quick start

Install

Run

Dashboard

Classify only

Tiers

Deploy on CloudRift

Benchmarks

API

Proxy

Dashboard

Environment variables

Structure

Part of a stack

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes