Plug-and-play Mixture-of-Models router. Route cheap requests to cheap models, expensive ones to specialists, track real cost savings.
Project description
npc-mom-router
Plug-and-play Mixture-of-Models router. Route cheap requests to cheap models, expensive ones to specialists, and track real cost savings.
Why Mixture-of-Models routing?
Most LLM workloads are not uniformly hard. Simple lookups, format conversions, and short factual questions can be answered accurately by a small, fast model at a fraction of the cost of a frontier model. Routing requests intelligently based on complexity lets you serve the same quality of answers at significantly lower cost—without changing your application's interface.
npc-mom-router sits between your application and your model backends. It classifies each incoming request as fast or heavy, dispatches to the appropriate backend, and records the token usage and dollar cost of every call. The ledger computes how much you saved compared to always routing to the heavy backend, so you can quantify the benefit in real dollars.
Install
pip install npc-mom-router
30-second quickstart
from npc_mom_router import MoMClient, BackendConfig, ZeroShotRouter
router = ZeroShotRouter(
base_url="https://api.groq.com/openai/v1",
api_key="YOUR_GROQ_KEY",
model="llama-3.1-8b-instant",
)
client = MoMClient(
router=router,
backends={
"fast": BackendConfig(
kind="oai_compat",
base_url="https://api.groq.com/openai/v1",
api_key="YOUR_GROQ_KEY",
model="llama-3.3-70b-versatile",
cost_per_1m_input=0.59,
cost_per_1m_output=0.79,
),
"heavy": BackendConfig(
kind="anthropic",
api_key="YOUR_ANTHROPIC_KEY",
model="claude-sonnet-4-5",
cost_per_1m_input=3.0,
cost_per_1m_output=15.0,
),
},
)
resp = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "What's the capital of France?"}],
)
print(f"Route: {resp._mom.route} ({resp._mom.reason})")
print(f"Answer: {resp.choices[0].message.content}")
print(f"Cost: ${resp._mom.cost_usd:.6f}")
NPC Fast router (local vLLM)
Run a tiny routing model locally for zero-latency, zero-cost classification:
from npc_mom_router import MoMClient, BackendConfig, NPCFastRouter
router = NPCFastRouter(
base_url="http://localhost:8001/v1",
model="npc-fast-1.7b",
)
client = MoMClient(
router=router,
backends={
"fast": BackendConfig(
kind="vllm",
base_url="http://localhost:8000/v1",
api_key="placeholder",
model="Qwen/Qwen2.5-7B-Instruct",
cost_per_1m_input=0.05,
cost_per_1m_output=0.10,
),
"heavy": BackendConfig(
kind="openai",
api_key="YOUR_OPENAI_KEY",
model="gpt-4o",
cost_per_1m_input=2.50,
cost_per_1m_output=10.00,
),
},
)
result = client.route_and_complete(
[{"role": "user", "content": "Explain the transformer architecture in depth."}]
)
print(result.decision.route, result.cost_entry.usd)
Async client
import asyncio
from npc_mom_router import AsyncMoMClient, BackendConfig, ZeroShotRouter
# ... same setup as above, just use AsyncMoMClient ...
async def main():
resp = await client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "List the G7 countries."}],
)
print(resp._mom.route, resp._mom.cost_usd)
asyncio.run(main())
Cost tracking
Every request is logged to an in-memory ledger. The ledger re-prices fast-routed requests as if they had hit the heavy backend to compute counterfactual savings.
s = client.ledger.summary()
# {
# "total_requests": 100,
# "fast_requests": 73,
# "heavy_requests": 27,
# "total_cost_usd": 0.0412,
# "savings_vs_always_heavy_usd": 0.3891
# }
client.ledger.dump("ledger.json") # writes full per-request JSON
Backend reference
kind |
Description | Default base URL |
|---|---|---|
oai_compat |
Any OpenAI-compatible API | Required |
openai |
OpenAI (api.openai.com) | https://api.openai.com/v1 |
anthropic |
Anthropic (native SDK) | https://api.anthropic.com |
groq |
Groq (OAI-compat) | https://api.groq.com/openai/v1 |
vllm |
Local vLLM server (OAI-compat) | http://localhost:8000/v1 |
Each BackendConfig takes cost_per_1m_input and cost_per_1m_output (USD) for cost tracking.
Router reference
| Router | How it works |
|---|---|
ZeroShotRouter |
Prompts any OAI-compat model; parses JSON response |
NPCFastRouter |
Calls a local vLLM endpoint; sub-10ms routing |
Both routers return RoutingDecision(route="fast"|"heavy", reason="..."). On any failure or malformed response, they fall back to heavy to preserve correctness.
Custom routers: implement route(messages) -> RoutingDecision and async_route(messages) -> RoutingDecision.
Roadmap
Support for streaming responses, per-model latency tracking, a pluggable cost-model registry, and a simple CLI dashboard are planned for v0.2. Pull requests welcome.
License
Apache 2.0 — see LICENSE. Copyright 2026 Rama Krishna Bachu.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file npc_mom_router-0.1.0.tar.gz.
File metadata
- Download URL: npc_mom_router-0.1.0.tar.gz
- Upload date:
- Size: 16.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41cc18fca4e8f4b682898fd9f5712f65ce042edd0049cde4624fd095a91de7ce
|
|
| MD5 |
5661ecb4f9a0725488b1aafd872c3353
|
|
| BLAKE2b-256 |
c69196f56712afbfa338a78b79793ed00992113c0c92cca11ed97243037746ed
|
File details
Details for the file npc_mom_router-0.1.0-py3-none-any.whl.
File metadata
- Download URL: npc_mom_router-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5799555f51d4f5929cc265df24be54f26a9f3ddef19ebb51f14cb73569fcdbfe
|
|
| MD5 |
124313a63a9f07913fbc9ef75e1ff589
|
|
| BLAKE2b-256 |
c830b17f1e77ea861135a40ffc8270876a15382bc6caf3a63e0971e75e703799
|