Skip to main content

Plug-and-play Mixture-of-Models router. Route cheap requests to cheap models, expensive ones to specialists, track real cost savings.

Project description

npc-mom-router

PyPI License: Apache 2.0 Python 3.10+

Plug-and-play Mixture-of-Models router. Route cheap requests to cheap models, expensive ones to specialists, and track real cost savings.

Why Mixture-of-Models routing?

Most LLM workloads are not uniformly hard. Simple lookups, format conversions, and short factual questions can be answered accurately by a small, fast model at a fraction of the cost of a frontier model. Routing requests intelligently based on complexity lets you serve the same quality of answers at significantly lower cost—without changing your application's interface.

npc-mom-router sits between your application and your model backends. It classifies each incoming request as fast or heavy, dispatches to the appropriate backend, and records the token usage and dollar cost of every call. The ledger computes how much you saved compared to always routing to the heavy backend, so you can quantify the benefit in real dollars.

Install

pip install npc-mom-router

30-second quickstart

from npc_mom_router import MoMClient, BackendConfig, ZeroShotRouter

router = ZeroShotRouter(
    base_url="https://api.groq.com/openai/v1",
    api_key="YOUR_GROQ_KEY",
    model="llama-3.1-8b-instant",
)

client = MoMClient(
    router=router,
    backends={
        "fast": BackendConfig(
            kind="oai_compat",
            base_url="https://api.groq.com/openai/v1",
            api_key="YOUR_GROQ_KEY",
            model="llama-3.3-70b-versatile",
            cost_per_1m_input=0.59,
            cost_per_1m_output=0.79,
        ),
        "heavy": BackendConfig(
            kind="anthropic",
            api_key="YOUR_ANTHROPIC_KEY",
            model="claude-sonnet-4-5",
            cost_per_1m_input=3.0,
            cost_per_1m_output=15.0,
        ),
    },
)

resp = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the capital of France?"}],
)
print(f"Route: {resp._mom.route} ({resp._mom.reason})")
print(f"Answer: {resp.choices[0].message.content}")
print(f"Cost: ${resp._mom.cost_usd:.6f}")

NPC Fast router (local vLLM)

Run a tiny routing model locally for zero-latency, zero-cost classification:

from npc_mom_router import MoMClient, BackendConfig, NPCFastRouter

router = NPCFastRouter(
    base_url="http://localhost:8001/v1",
    model="npc-fast-1.7b",
)

client = MoMClient(
    router=router,
    backends={
        "fast": BackendConfig(
            kind="vllm",
            base_url="http://localhost:8000/v1",
            api_key="placeholder",
            model="Qwen/Qwen2.5-7B-Instruct",
            cost_per_1m_input=0.05,
            cost_per_1m_output=0.10,
        ),
        "heavy": BackendConfig(
            kind="openai",
            api_key="YOUR_OPENAI_KEY",
            model="gpt-4o",
            cost_per_1m_input=2.50,
            cost_per_1m_output=10.00,
        ),
    },
)

result = client.route_and_complete(
    [{"role": "user", "content": "Explain the transformer architecture in depth."}]
)
print(result.decision.route, result.cost_entry.usd)

Async client

import asyncio
from npc_mom_router import AsyncMoMClient, BackendConfig, ZeroShotRouter

# ... same setup as above, just use AsyncMoMClient ...

async def main():
    resp = await client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": "List the G7 countries."}],
    )
    print(resp._mom.route, resp._mom.cost_usd)

asyncio.run(main())

Cost tracking

Every request is logged to an in-memory ledger. The ledger re-prices fast-routed requests as if they had hit the heavy backend to compute counterfactual savings.

s = client.ledger.summary()
# {
#   "total_requests": 100,
#   "fast_requests": 73,
#   "heavy_requests": 27,
#   "total_cost_usd": 0.0412,
#   "savings_vs_always_heavy_usd": 0.3891
# }

client.ledger.dump("ledger.json")  # writes full per-request JSON

Backend reference

kind Description Default base URL
oai_compat Any OpenAI-compatible API Required
openai OpenAI (api.openai.com) https://api.openai.com/v1
anthropic Anthropic (native SDK) https://api.anthropic.com
groq Groq (OAI-compat) https://api.groq.com/openai/v1
vllm Local vLLM server (OAI-compat) http://localhost:8000/v1

Each BackendConfig takes cost_per_1m_input and cost_per_1m_output (USD) for cost tracking.

Router reference

Router How it works
ZeroShotRouter Prompts any OAI-compat model; parses JSON response
NPCFastRouter Calls a local vLLM endpoint; sub-10ms routing

Both routers return RoutingDecision(route="fast"|"heavy", reason="..."). On any failure or malformed response, they fall back to heavy to preserve correctness.

Custom routers: implement route(messages) -> RoutingDecision and async_route(messages) -> RoutingDecision.

Roadmap

Support for streaming responses, per-model latency tracking, a pluggable cost-model registry, and a simple CLI dashboard are planned for v0.2. Pull requests welcome.

License

Apache 2.0 — see LICENSE. Copyright 2026 Rama Krishna Bachu.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

npc_mom_router-0.1.0.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

npc_mom_router-0.1.0-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file npc_mom_router-0.1.0.tar.gz.

File metadata

  • Download URL: npc_mom_router-0.1.0.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for npc_mom_router-0.1.0.tar.gz
Algorithm Hash digest
SHA256 41cc18fca4e8f4b682898fd9f5712f65ce042edd0049cde4624fd095a91de7ce
MD5 5661ecb4f9a0725488b1aafd872c3353
BLAKE2b-256 c69196f56712afbfa338a78b79793ed00992113c0c92cca11ed97243037746ed

See more details on using hashes here.

File details

Details for the file npc_mom_router-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: npc_mom_router-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for npc_mom_router-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5799555f51d4f5929cc265df24be54f26a9f3ddef19ebb51f14cb73569fcdbfe
MD5 124313a63a9f07913fbc9ef75e1ff589
BLAKE2b-256 c830b17f1e77ea861135a40ffc8270876a15382bc6caf3a63e0971e75e703799

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page