Intelligent LLM request routing: classify prompts, route to optimal model, cut costs 40-70%
Project description
mmrouter
Intelligent LLM request routing. A classifier analyzes each prompt and routes it to the right model: simple queries go to cheap/fast models (Haiku), complex reasoning goes to powerful ones (Opus), everything else to the balanced middle (Sonnet).
This is product-driven routing, not a "user picks a model" proxy. The system decides which model fits the task.
Why this exists
Most LLM applications send every request to the same model. That's wasteful. "What's the capital of France?" doesn't need Opus. A multi-step reasoning problem shouldn't go to Haiku.
The cost difference is real:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude Haiku | $0.80 | $4.00 |
| Claude Sonnet | $3.00 | $15.00 |
| Claude Opus | $15.00 | $75.00 |
Haiku is 3.8x cheaper than Sonnet per request. Sonnet is 5x cheaper than Opus. If your traffic is mostly simple queries, routing saves real money.
Cost math
Assume 1,000 requests/day, 500 input + 1,000 output tokens each.
| Scenario | Monthly cost | vs All-Sonnet | vs All-Opus |
|---|---|---|---|
| All-Opus | $2,475 | +400% | baseline |
| All-Sonnet | $495 | baseline | -80% |
| Routed (60/30/10) | $475 | -4% | -81% |
| Routed (70/25/5) | $340 | -31% | -86% |
Where "60/30/10" means 60% simple (Haiku), 30% medium (Sonnet), 10% complex (Opus).
The savings depend entirely on your traffic mix. If most of your requests are simple lookups and straightforward tasks, routing pays off significantly. If your workload is mostly complex reasoning, you're sending most requests to Opus anyway and routing won't save much. The router also adds quality value: complex prompts get routed to more capable models instead of being handled by a cheaper one that might produce worse results.
Architecture
Prompt -> Classifier -> Router Engine -> Provider (LiteLLM) -> Response
|
+--------+--------+
| | |
Circuit Cascade Budget
Breaker Routing Manager
| | |
+--------+--------+
|
Tracker (SQLite) -> Alerts
|
+-----------+-----------+
| |
Dashboard REST API
(FastAPI+React) (OpenAI-compatible)
Classifier analyzes the prompt along two dimensions:
- Complexity: simple, medium, complex
- Category: factual, reasoning, creative, code
Router Engine maps (complexity, category) to a model using YAML config. Supports confidence-based escalation, cascade routing, budget constraints, adaptive reranking, and A/B testing.
Circuit Breaker tracks failures per model and per provider. After consecutive failures, routes to fallback models automatically.
Cascade Routing tries the cheapest model first. If the response fails a quality gate (too short, hedging phrases), escalates to a stronger model.
Budget Manager enforces daily spending limits. Dynamically downgrades model selection as spend approaches the limit.
Tracker logs every request to SQLite (WAL mode): model used, tokens, cost, latency, classification result.
Three classifier strategies
| Strategy | Accuracy | Cost | Speed | Notes |
|---|---|---|---|---|
| Rules | 78% overall (90% complexity, 85% category) | Free | <1ms | Pattern matching heuristics. No dependencies. |
| Embeddings | 78% overall (84% complexity, 90% category) | Free | ~50ms | kNN on sentence-transformers (MiniLM-L6-v2). Needs pip install mmrouter[embeddings]. |
| LLM | Not yet benchmarked | API cost per request | ~1s | Uses a cheap model (Haiku) to classify before routing. Requires API key. |
The embedding classifier is the best default for production. Rules work fine for development and testing. Custom training on your own data is supported via mmrouter train.
Quickstart
1. Install
pip install mmrouter
For embedding classifier support:
pip install "mmrouter[embeddings]"
Requires Python 3.11+.
2. Set up
mmrouter init
Walks you through provider selection (Anthropic, OpenAI, or Google), checks your API key, and generates a routing config.
Or set up manually:
export ANTHROPIC_API_KEY=your-key # or OPENAI_API_KEY, GOOGLE_API_KEY
3. Route your first prompt
mmrouter route "What is the capital of France?"
Output:
France's capital is Paris.
Model: claude-haiku-4-5-20251001
Complexity: simple
Category: factual
Confidence: 0.82
Tokens: 12 in / 8 out
Cost: $0.000042
Latency: 245ms
4. Check your savings
mmrouter stats --detailed
5. Try the REST API
mmrouter serve --port 8080
Drop-in replacement for OpenAI API. Point your existing code at http://localhost:8080/v1 and the router handles model selection.
Common issues
Missing API key: Set the env var for your provider before routing.
export ANTHROPIC_API_KEY=your-key
export OPENAI_API_KEY=your-key # if using OpenAI
export GOOGLE_API_KEY=your-key # if using Google
Config not found: Run mmrouter init to generate a config, or pass a custom path with mmrouter -c path/to/config.yaml route "prompt".
Python version: mmrouter requires Python 3.11 or higher.
Configuration
Routing is configured in YAML. The default config is generated by mmrouter init or found at configs/default.yaml.
routes:
simple:
factual:
model: claude-haiku-4-5-20251001
fallbacks:
- claude-sonnet-4-6
...
classifier:
strategy: rules # rules | embeddings | llm
threshold: "0.7" # confidence below this triggers escalation
provider:
timeout_ms: 30000
max_retries: 2
...
See configs/ for examples: default.yaml (Anthropic), openai.yaml (OpenAI), cascade.yaml (cascade routing), budget.yaml (budget-constrained).
CLI usage
Route a prompt
mmrouter route "What is the capital of France?"
# Paris.
# [simple/factual] model=claude-haiku-4-5-20251001 cost=$0.000320 latency=245ms tokens=12+8
mmrouter route "Analyze the trade-offs between microservices and monoliths" -v
# (detailed response)
# [complex/reasoning] model=claude-opus-4-6 cost=$0.003200 latency=2100ms tokens=45+320
Classify only (no API call)
mmrouter classify "Explain quantum entanglement"
# {
# "complexity": "medium",
# "category": "reasoning",
# "confidence": 0.85
# }
mmrouter classify "Write a haiku about rain" --classifier embeddings
Run accuracy eval
mmrouter eval --classifier rules
# Overall: 67.0% (80/120)
# Complexity: 78.0%
# Category: 83.0%
mmrouter eval --classifier embeddings
# Overall: 78.0% (94/120)
# Complexity: 84.0%
# Category: 90.0%
Compare all classifiers
mmrouter compare
# Classifier Overall Complexity Category Time
# ---------- ------- ---------- -------- -----
# rules 67.0% 78.0% 83.0% 0.01s
# embeddings 78.0% 84.0% 90.0% 1.23s
Train custom embedding classifier
mmrouter train --data my_data.yaml --output models/custom --eval-split 0.2
# Loaded 500 examples
# Eval split: 400 train, 100 eval
# Training... done
# Eval accuracy: 82.0%
# Saved to models/custom/
# Use trained model:
mmrouter classify "prompt" --classifier embeddings --trained-model models/custom
Cost analytics
mmrouter stats
# Requests: 142
# Total cost: $0.234500
# Avg latency: 380ms
# Tokens in/out: 15200/28400
# Fallbacks: 3
mmrouter stats --detailed
# (adds daily cost breakdown, savings vs all-Sonnet baseline,
# distribution by complexity/category, cascade stats, budget status)
LLM-as-judge quality eval
mmrouter quality --sample 20
# Evaluates whether routed responses match baseline (all-Sonnet) quality.
# Reports score, relevance, accuracy, completeness deltas.
Feedback for adaptive routing
mmrouter feedback <request_id> up # thumbs up
mmrouter feedback <request_id> down # thumbs down
A/B testing
mmrouter experiment create --name "cascade-test" \
--control configs/default.yaml \
--treatment configs/cascade.yaml \
--split 0.5
mmrouter experiment status
mmrouter experiment stop
Alerting
mmrouter alerts status
mmrouter alerts test --webhook-url https://hooks.slack.com/services/...
REST API server (OpenAI-compatible)
mmrouter serve --port 8080
# Starts OpenAI-compatible API at http://localhost:8080
# Use with any OpenAI SDK client:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'
# Or with the OpenAI Python SDK:
# client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")
# response = client.chat.completions.create(model="auto", messages=[...])
model: "auto" triggers intelligent routing. Explicit model names bypass classification and go directly to the provider.
Classification metadata is returned in X-MMRouter-* response headers.
Auth: set MMROUTER_API_KEY env var to require Bearer token auth. Unset = no auth (local dev).
Dashboard
mmrouter dashboard --port 8000
# Starts FastAPI backend + serves React SPA at http://localhost:8000
Key features
Multi-provider failover
Cross-provider fallback chains with provider-level circuit breaker. If Anthropic is down, routes to OpenAI/Google automatically. See configs/multi-provider.yaml.
Cascade routing
Try the cheapest model first. If the response fails a quality gate (too short, contains hedging phrases), automatically escalate to a stronger model. Saves cost when cheap models can handle the task. See configs/cascade.yaml.
Budget mode
Set a daily spending limit. The router dynamically downgrades model selection as spend approaches the limit:
- <75%: normal routing
- 75-90%: warn (log only)
- 90-100%: downgrade (complex->medium, medium->simple)
- 100%+: force cheapest model or reject requests
Adaptive routing
Track user feedback (thumbs up/down) via API. The router learns which models perform best for each query type and reranks the fallback chain accordingly.
A/B testing
Run two routing configs simultaneously. Traffic is split deterministically (same prompt always goes to same variant). Compare cost, latency, and error rates between strategies.
Prompt caching
Automatic cache_control annotation for Anthropic models. System prompts get cached server-side, reducing cost by up to 90% on cached input tokens. OpenAI caching is automatic and requires no annotation.
Alerting
Webhook notifications (works with Slack incoming webhooks) for cost spikes, high error rates, and budget warnings. Cooldown prevents alert spam.
Project structure
src/mmrouter/
classifier/ # RuleClassifier, EmbeddingClassifier, LLMClassifier
router/ # Engine, Config, Cascade, Budget, Fallback, Adaptive
providers/ # LiteLLM wrapper (ProviderBase ABC), cache annotation
tracker/ # SQLite logger, cost analytics
eval/ # Accuracy eval, classifier comparison, LLM-as-judge quality
server/ # OpenAI-compatible REST API (FastAPI)
dashboard/ # Dashboard backend (FastAPI)
experiments/ # A/B testing engine (store, traffic splitter)
alerts/ # Alert rules, webhook/log channels
cli.py # Click entry point
api.py # Programmatic API
models.py # Shared data models (Pydantic)
configs/ # YAML routing configs (default, cascade, budget, multi-provider)
eval_data/ # Labeled test queries for eval
dashboard/ # React + Vite + Recharts SPA
tests/ # 501 tests (pytest)
Stack
- Python 3.11+, Click (CLI), FastAPI (server + dashboard)
- LiteLLM for multi-provider model access (pinned version, isolated behind ProviderBase)
- sentence-transformers for embedding classifier (MiniLM-L6-v2, runs locally)
- SQLite with WAL mode for request/cost/feedback/experiment tracking
- React + Vite + Recharts for the dashboard
- strictyaml for config parsing
- pytest (501 tests)
Running tests
pytest # all tests
pytest tests/test_classifier/ # classifier tests only
pytest tests/test_router/ # router tests only
pytest tests/test_server/ # REST API tests only
Hard rules
- API keys only via environment variables. Never in code, config, or logs.
- LiteLLM is isolated behind
ProviderBase. Never imported outsideproviders/. - Model names live in YAML config. Never hardcoded in routing logic.
- All LLM calls go through the Router. No direct provider calls from outside
router/.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mmrouter-0.1.0.tar.gz.
File metadata
- Download URL: mmrouter-0.1.0.tar.gz
- Upload date:
- Size: 146.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6e5a61d702aca3a1bcc1bd12b33da7ee60fccc34892ec09325ad72963689f58
|
|
| MD5 |
41f3a9f66d552827802d0edfe1266ee6
|
|
| BLAKE2b-256 |
bb1014452021c37e107f71ef7ed923091ef64833aa8c6758f426ad7c929e830f
|
File details
Details for the file mmrouter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mmrouter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6faf063fb4e56ce5573413b1690804cf7e8ca3ad3254356607bc31b964a84ef1
|
|
| MD5 |
dbb9099870481ddd8fe9e370bcfc7394
|
|
| BLAKE2b-256 |
b000c43262ce8b0804cc5d642ce6162a53762341788ef4b4ff67f0d5fb60475a
|