Intelligent LLM request routing: classify prompts, route to optimal model, cut costs 40-70%

These details have not been verified by PyPI

Project links

Project description

mmrouter

Intelligent LLM request routing. A classifier analyzes each prompt and routes it to the right model: simple queries go to cheap/fast models (Haiku), complex reasoning goes to powerful ones (Opus), everything else to the balanced middle (Sonnet).

This is product-driven routing, not a "user picks a model" proxy. The system decides which model fits the task.

Why this exists

Most LLM applications send every request to the same model. That's wasteful. "What's the capital of France?" doesn't need Opus. A multi-step reasoning problem shouldn't go to Haiku.

The cost difference is real:

Model	Input (per 1M tokens)	Output (per 1M tokens)
Claude Haiku	$0.80	$4.00
Claude Sonnet	$3.00	$15.00
Claude Opus	$15.00	$75.00

Haiku is 3.8x cheaper than Sonnet per request. Sonnet is 5x cheaper than Opus. If your traffic is mostly simple queries, routing saves real money.

Cost math

Assume 1,000 requests/day, 500 input + 1,000 output tokens each.

Scenario	Monthly cost	vs All-Sonnet	vs All-Opus
All-Opus	$2,475	+400%	baseline
All-Sonnet	$495	baseline	-80%
Routed (60/30/10)	$475	-4%	-81%
Routed (70/25/5)	$340	-31%	-86%

Where "60/30/10" means 60% simple (Haiku), 30% medium (Sonnet), 10% complex (Opus).

The savings depend entirely on your traffic mix. If most of your requests are simple lookups and straightforward tasks, routing pays off significantly. If your workload is mostly complex reasoning, you're sending most requests to Opus anyway and routing won't save much. The router also adds quality value: complex prompts get routed to more capable models instead of being handled by a cheaper one that might produce worse results.

Architecture

Prompt -> Classifier -> Router Engine -> Provider (LiteLLM) -> Response
                             |
                    +--------+--------+
                    |        |        |
              Circuit    Cascade   Budget
              Breaker    Routing   Manager
                    |        |        |
                    +--------+--------+
                             |
                   Tracker (SQLite) -> Alerts
                             |
                 +-----------+-----------+
                 |                       |
          Dashboard               REST API
       (FastAPI+React)       (OpenAI-compatible)

Classifier analyzes the prompt along two dimensions:

Complexity: simple, medium, complex
Category: factual, reasoning, creative, code

Router Engine maps (complexity, category) to a model using YAML config. Supports confidence-based escalation, cascade routing, budget constraints, adaptive reranking, and A/B testing.

Circuit Breaker tracks failures per model and per provider. After consecutive failures, routes to fallback models automatically.

Cascade Routing tries the cheapest model first. If the response fails a quality gate (too short, hedging phrases), escalates to a stronger model.

Budget Manager enforces daily spending limits. Dynamically downgrades model selection as spend approaches the limit.

Tracker logs every request to SQLite (WAL mode): model used, tokens, cost, latency, classification result.

Three classifier strategies

Strategy	Accuracy	Cost	Speed	Notes
Rules	78% overall (90% complexity, 85% category)	Free	<1ms	Pattern matching heuristics. No dependencies.
Embeddings	78% overall (84% complexity, 90% category)	Free	~50ms	kNN on sentence-transformers (MiniLM-L6-v2). Needs `pip install mmrouter[embeddings]`.
LLM	Not yet benchmarked	API cost per request	~1s	Uses a cheap model (Haiku) to classify before routing. Requires API key.

The embedding classifier is the best default for production. Rules work fine for development and testing. Custom training on your own data is supported via mmrouter train.

Quickstart

1. Install

pip install mmrouter

For embedding classifier support:

pip install "mmrouter[embeddings]"

Requires Python 3.11+.

2. Set up

mmrouter init

Walks you through provider selection (Anthropic, OpenAI, or Google), checks your API key, and generates a routing config.

Or set up manually:

export ANTHROPIC_API_KEY=your-key    # or OPENAI_API_KEY, GOOGLE_API_KEY

3. Route your first prompt

mmrouter route "What is the capital of France?"

Output:

France's capital is Paris.

  Model:      claude-haiku-4-5-20251001
  Complexity: simple
  Category:   factual
  Confidence: 0.82
  Tokens:     12 in / 8 out
  Cost:       $0.000042
  Latency:    245ms

4. Check your savings

mmrouter stats --detailed

5. Try the REST API

mmrouter serve --port 8080

Drop-in replacement for OpenAI API. Point your existing code at http://localhost:8080/v1 and the router handles model selection.

Common issues

Missing API key: Set the env var for your provider before routing.

export ANTHROPIC_API_KEY=your-key
export OPENAI_API_KEY=your-key      # if using OpenAI
export GOOGLE_API_KEY=your-key      # if using Google

Config not found: Run mmrouter init to generate a config, or pass a custom path with mmrouter -c path/to/config.yaml route "prompt".

Python version: mmrouter requires Python 3.11 or higher.

Configuration

Routing is configured in YAML. The default config is generated by mmrouter init or found at configs/default.yaml.

routes:
  simple:
    factual:
      model: claude-haiku-4-5-20251001
      fallbacks:
        - claude-sonnet-4-6
  ...

classifier:
  strategy: rules        # rules | embeddings | llm
  threshold: "0.7"       # confidence below this triggers escalation

provider:
  timeout_ms: 30000
  max_retries: 2
  ...

See configs/ for examples: default.yaml (Anthropic), openai.yaml (OpenAI), cascade.yaml (cascade routing), budget.yaml (budget-constrained).

CLI usage

Route a prompt

mmrouter route "What is the capital of France?"
# Paris.
# [simple/factual] model=claude-haiku-4-5-20251001 cost=$0.000320 latency=245ms tokens=12+8

mmrouter route "Analyze the trade-offs between microservices and monoliths" -v
# (detailed response)
# [complex/reasoning] model=claude-opus-4-6 cost=$0.003200 latency=2100ms tokens=45+320

Classify only (no API call)

mmrouter classify "Explain quantum entanglement"
# {
#   "complexity": "medium",
#   "category": "reasoning",
#   "confidence": 0.85
# }

mmrouter classify "Write a haiku about rain" --classifier embeddings

Run accuracy eval

mmrouter eval --classifier rules
# Overall:    67.0%  (80/120)
# Complexity: 78.0%
# Category:   83.0%

mmrouter eval --classifier embeddings
# Overall:    78.0%  (94/120)
# Complexity: 84.0%
# Category:   90.0%

Compare all classifiers

mmrouter compare
# Classifier   Overall  Complexity  Category   Time
# ----------   -------  ----------  --------   -----
# rules          67.0%       78.0%     83.0%  0.01s
# embeddings     78.0%       84.0%     90.0%  1.23s

Train custom embedding classifier

mmrouter train --data my_data.yaml --output models/custom --eval-split 0.2
# Loaded 500 examples
# Eval split: 400 train, 100 eval
# Training... done
# Eval accuracy: 82.0%
# Saved to models/custom/

# Use trained model:
mmrouter classify "prompt" --classifier embeddings --trained-model models/custom

Cost analytics

mmrouter stats
# Requests:      142
# Total cost:    $0.234500
# Avg latency:   380ms
# Tokens in/out: 15200/28400
# Fallbacks:     3

mmrouter stats --detailed
# (adds daily cost breakdown, savings vs all-Sonnet baseline,
#  distribution by complexity/category, cascade stats, budget status)

LLM-as-judge quality eval

mmrouter quality --sample 20
# Evaluates whether routed responses match baseline (all-Sonnet) quality.
# Reports score, relevance, accuracy, completeness deltas.

Feedback for adaptive routing

mmrouter feedback <request_id> up    # thumbs up
mmrouter feedback <request_id> down  # thumbs down

A/B testing

mmrouter experiment create --name "cascade-test" \
  --control configs/default.yaml \
  --treatment configs/cascade.yaml \
  --split 0.5

mmrouter experiment status
mmrouter experiment stop

Alerting

mmrouter alerts status
mmrouter alerts test --webhook-url https://hooks.slack.com/services/...

REST API server (OpenAI-compatible)

mmrouter serve --port 8080
# Starts OpenAI-compatible API at http://localhost:8080

# Use with any OpenAI SDK client:
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

# Or with the OpenAI Python SDK:
# client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")
# response = client.chat.completions.create(model="auto", messages=[...])

model: "auto" triggers intelligent routing. Explicit model names bypass classification and go directly to the provider.

Classification metadata is returned in X-MMRouter-* response headers.

Auth: set MMROUTER_API_KEY env var to require Bearer token auth. Unset = no auth (local dev).

Dashboard

mmrouter dashboard --port 8000
# Starts FastAPI backend + serves React SPA at http://localhost:8000

Key features

Multi-provider failover

Cross-provider fallback chains with provider-level circuit breaker. If Anthropic is down, routes to OpenAI/Google automatically. See configs/multi-provider.yaml.

Cascade routing

Try the cheapest model first. If the response fails a quality gate (too short, contains hedging phrases), automatically escalate to a stronger model. Saves cost when cheap models can handle the task. See configs/cascade.yaml.

Budget mode

Set a daily spending limit. The router dynamically downgrades model selection as spend approaches the limit:

<75%: normal routing
75-90%: warn (log only)
90-100%: downgrade (complex->medium, medium->simple)
100%+: force cheapest model or reject requests

Adaptive routing

Track user feedback (thumbs up/down) via API. The router learns which models perform best for each query type and reranks the fallback chain accordingly.

A/B testing

Run two routing configs simultaneously. Traffic is split deterministically (same prompt always goes to same variant). Compare cost, latency, and error rates between strategies.

Prompt caching

Automatic cache_control annotation for Anthropic models. System prompts get cached server-side, reducing cost by up to 90% on cached input tokens. OpenAI caching is automatic and requires no annotation.

Alerting

Webhook notifications (works with Slack incoming webhooks) for cost spikes, high error rates, and budget warnings. Cooldown prevents alert spam.

Project structure

src/mmrouter/
  classifier/          # RuleClassifier, EmbeddingClassifier, LLMClassifier
  router/              # Engine, Config, Cascade, Budget, Fallback, Adaptive
  providers/           # LiteLLM wrapper (ProviderBase ABC), cache annotation
  tracker/             # SQLite logger, cost analytics
  eval/                # Accuracy eval, classifier comparison, LLM-as-judge quality
  server/              # OpenAI-compatible REST API (FastAPI)
  dashboard/           # Dashboard backend (FastAPI)
  experiments/         # A/B testing engine (store, traffic splitter)
  alerts/              # Alert rules, webhook/log channels
  cli.py               # Click entry point
  api.py               # Programmatic API
  models.py            # Shared data models (Pydantic)
configs/               # YAML routing configs (default, cascade, budget, multi-provider)
eval_data/             # Labeled test queries for eval
dashboard/             # React + Vite + Recharts SPA
tests/                 # 501 tests (pytest)

Stack

Python 3.11+, Click (CLI), FastAPI (server + dashboard)
LiteLLM for multi-provider model access (pinned version, isolated behind ProviderBase)
sentence-transformers for embedding classifier (MiniLM-L6-v2, runs locally)
SQLite with WAL mode for request/cost/feedback/experiment tracking
React + Vite + Recharts for the dashboard
strictyaml for config parsing
pytest (501 tests)

Running tests

pytest                              # all tests
pytest tests/test_classifier/       # classifier tests only
pytest tests/test_router/           # router tests only
pytest tests/test_server/           # REST API tests only

Hard rules

API keys only via environment variables. Never in code, config, or logs.
LiteLLM is isolated behind ProviderBase. Never imported outside providers/.
Model names live in YAML config. Never hardcoded in routing logic.
All LLM calls go through the Router. No direct provider calls from outside router/.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmrouter-0.1.0.tar.gz (146.8 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mmrouter-0.1.0-py3-none-any.whl (64.9 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file mmrouter-0.1.0.tar.gz.

File metadata

Download URL: mmrouter-0.1.0.tar.gz
Upload date: Apr 6, 2026
Size: 146.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mmrouter-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d6e5a61d702aca3a1bcc1bd12b33da7ee60fccc34892ec09325ad72963689f58`
MD5	`41f3a9f66d552827802d0edfe1266ee6`
BLAKE2b-256	`bb1014452021c37e107f71ef7ed923091ef64833aa8c6758f426ad7c929e830f`

See more details on using hashes here.

File details

Details for the file mmrouter-0.1.0-py3-none-any.whl.

File metadata

Download URL: mmrouter-0.1.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 64.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mmrouter-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6faf063fb4e56ce5573413b1690804cf7e8ca3ad3254356607bc31b964a84ef1`
MD5	`dbb9099870481ddd8fe9e370bcfc7394`
BLAKE2b-256	`b000c43262ce8b0804cc5d642ce6162a53762341788ef4b4ff67f0d5fb60475a`

See more details on using hashes here.

mmrouter 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mmrouter

Why this exists

Cost math

Architecture

Three classifier strategies

Quickstart

1. Install

2. Set up

3. Route your first prompt

4. Check your savings

5. Try the REST API

Common issues

Configuration

CLI usage

Route a prompt

Classify only (no API call)

Run accuracy eval

Compare all classifiers

Train custom embedding classifier

Cost analytics

LLM-as-judge quality eval

Feedback for adaptive routing

A/B testing

Alerting

REST API server (OpenAI-compatible)

Dashboard

Key features

Multi-provider failover

Cascade routing

Budget mode

Adaptive routing

A/B testing

Prompt caching

Alerting

Project structure

Stack

Running tests

Hard rules

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes