Skip to main content

Intelligent LLM request routing: classify prompts, route to optimal model, cut costs 40-70%

Project description

mmrouter

Intelligent LLM request routing. A classifier analyzes each prompt and routes it to the right model: simple queries go to cheap/fast models (Haiku), complex reasoning goes to powerful ones (Opus), everything else to the balanced middle (Sonnet).

This is product-driven routing, not a "user picks a model" proxy. The system decides which model fits the task.

Why this exists

Most LLM applications send every request to the same model. That's wasteful. "What's the capital of France?" doesn't need Opus. A multi-step reasoning problem shouldn't go to Haiku.

The cost difference is real:

Model Input (per 1M tokens) Output (per 1M tokens)
Claude Haiku $0.80 $4.00
Claude Sonnet $3.00 $15.00
Claude Opus $15.00 $75.00

Haiku is 3.8x cheaper than Sonnet per request. Sonnet is 5x cheaper than Opus. If your traffic is mostly simple queries, routing saves real money.

Cost math

Assume 1,000 requests/day, 500 input + 1,000 output tokens each.

Scenario Monthly cost vs All-Sonnet vs All-Opus
All-Opus $2,475 +400% baseline
All-Sonnet $495 baseline -80%
Routed (60/30/10) $475 -4% -81%
Routed (70/25/5) $340 -31% -86%

Where "60/30/10" means 60% simple (Haiku), 30% medium (Sonnet), 10% complex (Opus).

The savings depend entirely on your traffic mix. If most of your requests are simple lookups and straightforward tasks, routing pays off significantly. If your workload is mostly complex reasoning, you're sending most requests to Opus anyway and routing won't save much. The router also adds quality value: complex prompts get routed to more capable models instead of being handled by a cheaper one that might produce worse results.

Architecture

Prompt -> Classifier -> Router Engine -> Provider (LiteLLM) -> Response
                             |
                    +--------+--------+
                    |        |        |
              Circuit    Cascade   Budget
              Breaker    Routing   Manager
                    |        |        |
                    +--------+--------+
                             |
                   Tracker (SQLite) -> Alerts
                             |
                 +-----------+-----------+
                 |                       |
          Dashboard               REST API
       (FastAPI+React)       (OpenAI-compatible)

Classifier analyzes the prompt along two dimensions:

  • Complexity: simple, medium, complex
  • Category: factual, reasoning, creative, code

Router Engine maps (complexity, category) to a model using YAML config. Supports confidence-based escalation, cascade routing, budget constraints, adaptive reranking, and A/B testing.

Circuit Breaker tracks failures per model and per provider. After consecutive failures, routes to fallback models automatically.

Cascade Routing tries the cheapest model first. If the response fails a quality gate (too short, hedging phrases), escalates to a stronger model.

Budget Manager enforces daily spending limits. Dynamically downgrades model selection as spend approaches the limit.

Tracker logs every request to SQLite (WAL mode): model used, tokens, cost, latency, classification result.

Three classifier strategies

Strategy Accuracy Cost Speed Notes
Rules 78% overall (90% complexity, 85% category) Free <1ms Pattern matching heuristics. No dependencies.
Embeddings 78% overall (84% complexity, 90% category) Free ~50ms kNN on sentence-transformers (MiniLM-L6-v2). Needs pip install mmrouter[embeddings].
LLM Not yet benchmarked API cost per request ~1s Uses a cheap model (Haiku) to classify before routing. Requires API key.

The embedding classifier is the best default for production. Rules work fine for development and testing. Custom training on your own data is supported via mmrouter train.

Quickstart

1. Install

pip install mmrouter

For embedding classifier support:

pip install "mmrouter[embeddings]"

Requires Python 3.11+.

2. Set up

mmrouter init

Walks you through provider selection (Anthropic, OpenAI, or Google), checks your API key, and generates a routing config.

Or set up manually:

export ANTHROPIC_API_KEY=your-key    # or OPENAI_API_KEY, GOOGLE_API_KEY

3. Route your first prompt

mmrouter route "What is the capital of France?"

Output:

France's capital is Paris.

  Model:      claude-haiku-4-5-20251001
  Complexity: simple
  Category:   factual
  Confidence: 0.82
  Tokens:     12 in / 8 out
  Cost:       $0.000042
  Latency:    245ms

4. Check your savings

mmrouter stats --detailed

5. Try the REST API

mmrouter serve --port 8080

Drop-in replacement for OpenAI API. Point your existing code at http://localhost:8080/v1 and the router handles model selection.

Common issues

Missing API key: Set the env var for your provider before routing.

export ANTHROPIC_API_KEY=your-key
export OPENAI_API_KEY=your-key      # if using OpenAI
export GOOGLE_API_KEY=your-key      # if using Google

Config not found: Run mmrouter init to generate a config, or pass a custom path with mmrouter -c path/to/config.yaml route "prompt".

Python version: mmrouter requires Python 3.11 or higher.

Configuration

Routing is configured in YAML. The default config is generated by mmrouter init or found at configs/default.yaml.

routes:
  simple:
    factual:
      model: claude-haiku-4-5-20251001
      fallbacks:
        - claude-sonnet-4-6
  ...

classifier:
  strategy: rules        # rules | embeddings | llm
  threshold: "0.7"       # confidence below this triggers escalation

provider:
  timeout_ms: 30000
  max_retries: 2
  ...

See configs/ for examples: default.yaml (Anthropic), openai.yaml (OpenAI), cascade.yaml (cascade routing), budget.yaml (budget-constrained).

CLI usage

Route a prompt

mmrouter route "What is the capital of France?"
# Paris.
# [simple/factual] model=claude-haiku-4-5-20251001 cost=$0.000320 latency=245ms tokens=12+8

mmrouter route "Analyze the trade-offs between microservices and monoliths" -v
# (detailed response)
# [complex/reasoning] model=claude-opus-4-6 cost=$0.003200 latency=2100ms tokens=45+320

Classify only (no API call)

mmrouter classify "Explain quantum entanglement"
# {
#   "complexity": "medium",
#   "category": "reasoning",
#   "confidence": 0.85
# }

mmrouter classify "Write a haiku about rain" --classifier embeddings

Run accuracy eval

mmrouter eval --classifier rules
# Overall:    67.0%  (80/120)
# Complexity: 78.0%
# Category:   83.0%

mmrouter eval --classifier embeddings
# Overall:    78.0%  (94/120)
# Complexity: 84.0%
# Category:   90.0%

Compare all classifiers

mmrouter compare
# Classifier   Overall  Complexity  Category   Time
# ----------   -------  ----------  --------   -----
# rules          67.0%       78.0%     83.0%  0.01s
# embeddings     78.0%       84.0%     90.0%  1.23s

Train custom embedding classifier

mmrouter train --data my_data.yaml --output models/custom --eval-split 0.2
# Loaded 500 examples
# Eval split: 400 train, 100 eval
# Training... done
# Eval accuracy: 82.0%
# Saved to models/custom/

# Use trained model:
mmrouter classify "prompt" --classifier embeddings --trained-model models/custom

Cost analytics

mmrouter stats
# Requests:      142
# Total cost:    $0.234500
# Avg latency:   380ms
# Tokens in/out: 15200/28400
# Fallbacks:     3

mmrouter stats --detailed
# (adds daily cost breakdown, savings vs all-Sonnet baseline,
#  distribution by complexity/category, cascade stats, budget status)

LLM-as-judge quality eval

mmrouter quality --sample 20
# Evaluates whether routed responses match baseline (all-Sonnet) quality.
# Reports score, relevance, accuracy, completeness deltas.

Feedback for adaptive routing

mmrouter feedback <request_id> up    # thumbs up
mmrouter feedback <request_id> down  # thumbs down

A/B testing

mmrouter experiment create --name "cascade-test" \
  --control configs/default.yaml \
  --treatment configs/cascade.yaml \
  --split 0.5

mmrouter experiment status
mmrouter experiment stop

Alerting

mmrouter alerts status
mmrouter alerts test --webhook-url https://hooks.slack.com/services/...

REST API server (OpenAI-compatible)

mmrouter serve --port 8080
# Starts OpenAI-compatible API at http://localhost:8080

# Use with any OpenAI SDK client:
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

# Or with the OpenAI Python SDK:
# client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")
# response = client.chat.completions.create(model="auto", messages=[...])

model: "auto" triggers intelligent routing. Explicit model names bypass classification and go directly to the provider.

Classification metadata is returned in X-MMRouter-* response headers.

Auth: set MMROUTER_API_KEY env var to require Bearer token auth. Unset = no auth (local dev).

Dashboard

mmrouter dashboard --port 8000
# Starts FastAPI backend + serves React SPA at http://localhost:8000

Key features

Multi-provider failover

Cross-provider fallback chains with provider-level circuit breaker. If Anthropic is down, routes to OpenAI/Google automatically. See configs/multi-provider.yaml.

Cascade routing

Try the cheapest model first. If the response fails a quality gate (too short, contains hedging phrases), automatically escalate to a stronger model. Saves cost when cheap models can handle the task. See configs/cascade.yaml.

Budget mode

Set a daily spending limit. The router dynamically downgrades model selection as spend approaches the limit:

  • <75%: normal routing
  • 75-90%: warn (log only)
  • 90-100%: downgrade (complex->medium, medium->simple)
  • 100%+: force cheapest model or reject requests

Adaptive routing

Track user feedback (thumbs up/down) via API. The router learns which models perform best for each query type and reranks the fallback chain accordingly.

A/B testing

Run two routing configs simultaneously. Traffic is split deterministically (same prompt always goes to same variant). Compare cost, latency, and error rates between strategies.

Prompt caching

Automatic cache_control annotation for Anthropic models. System prompts get cached server-side, reducing cost by up to 90% on cached input tokens. OpenAI caching is automatic and requires no annotation.

Alerting

Webhook notifications (works with Slack incoming webhooks) for cost spikes, high error rates, and budget warnings. Cooldown prevents alert spam.

Project structure

src/mmrouter/
  classifier/          # RuleClassifier, EmbeddingClassifier, LLMClassifier
  router/              # Engine, Config, Cascade, Budget, Fallback, Adaptive
  providers/           # LiteLLM wrapper (ProviderBase ABC), cache annotation
  tracker/             # SQLite logger, cost analytics
  eval/                # Accuracy eval, classifier comparison, LLM-as-judge quality
  server/              # OpenAI-compatible REST API (FastAPI)
  dashboard/           # Dashboard backend (FastAPI)
  experiments/         # A/B testing engine (store, traffic splitter)
  alerts/              # Alert rules, webhook/log channels
  cli.py               # Click entry point
  api.py               # Programmatic API
  models.py            # Shared data models (Pydantic)
configs/               # YAML routing configs (default, cascade, budget, multi-provider)
eval_data/             # Labeled test queries for eval
dashboard/             # React + Vite + Recharts SPA
tests/                 # 501 tests (pytest)

Stack

  • Python 3.11+, Click (CLI), FastAPI (server + dashboard)
  • LiteLLM for multi-provider model access (pinned version, isolated behind ProviderBase)
  • sentence-transformers for embedding classifier (MiniLM-L6-v2, runs locally)
  • SQLite with WAL mode for request/cost/feedback/experiment tracking
  • React + Vite + Recharts for the dashboard
  • strictyaml for config parsing
  • pytest (501 tests)

Running tests

pytest                              # all tests
pytest tests/test_classifier/       # classifier tests only
pytest tests/test_router/           # router tests only
pytest tests/test_server/           # REST API tests only

Hard rules

  • API keys only via environment variables. Never in code, config, or logs.
  • LiteLLM is isolated behind ProviderBase. Never imported outside providers/.
  • Model names live in YAML config. Never hardcoded in routing logic.
  • All LLM calls go through the Router. No direct provider calls from outside router/.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmrouter-0.1.0.tar.gz (146.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmrouter-0.1.0-py3-none-any.whl (64.9 kB view details)

Uploaded Python 3

File details

Details for the file mmrouter-0.1.0.tar.gz.

File metadata

  • Download URL: mmrouter-0.1.0.tar.gz
  • Upload date:
  • Size: 146.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mmrouter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d6e5a61d702aca3a1bcc1bd12b33da7ee60fccc34892ec09325ad72963689f58
MD5 41f3a9f66d552827802d0edfe1266ee6
BLAKE2b-256 bb1014452021c37e107f71ef7ed923091ef64833aa8c6758f426ad7c929e830f

See more details on using hashes here.

File details

Details for the file mmrouter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mmrouter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mmrouter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6faf063fb4e56ce5573413b1690804cf7e8ca3ad3254356607bc31b964a84ef1
MD5 dbb9099870481ddd8fe9e370bcfc7394
BLAKE2b-256 b000c43262ce8b0804cc5d642ce6162a53762341788ef4b4ff67f0d5fb60475a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page