Skip to main content

Pluggable LLM inference routing SDK — bring any model from any provider

Project description

inference-router

A pluggable Python SDK for intelligent LLM inference routing. Route requests across model tiers based on complexity, cost, and latency — without changing your application code.

from llm-inference-router import InferenceRouter
from llm-inference-router.providers.bedrock import BedrockProvider
from llm-inference-router.strategies import ComplexityStrategy

router = InferenceRouter(
    tiers={
        "fast":     BedrockProvider("us.anthropic.claude-haiku-4-5-20251001-v1:0"),
        "balanced": BedrockProvider("us.anthropic.claude-sonnet-4-6"),
    },
    strategy=ComplexityStrategy(),
    fallback="fast"
)

response = router.complete("explain recursion in one sentence")
print(response.text)
print(response.model_used)   # "bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"
print(response.cost_usd)     # 0.000016
print(response.latency_ms)   # 1263.4
print(response.tier_used)    # "fast"

Why inference-router?

Every LLM app today sends every request to the same model — same cost, same latency, regardless of how simple or complex the question is. That's like using a surgeon for a bandaid.

inference-router sits between your app and your LLM providers. It classifies each request and dispatches it to the right model automatically:

  • Simple query → small, fast, cheap model
  • Complex reasoning → large, powerful model
  • Budget exceeded → downgrade tier automatically
  • Provider slow or down → failover to backup instantly

No changes to your application code. Zero vendor lock-in. Swap providers in one line.


Installation

Core SDK:

pip install llm-inference-router

With provider extras:

# AWS Bedrock
pip install "llm-inference-router[bedrock]"

# OpenAI / OpenAI-compatible APIs (Groq, DeepInfra, Together AI)
pip install "llm-inference-router[openai]"

# Anthropic direct API
pip install "llm-inference-router[anthropic]"

# Multiple providers
pip install "llm-inference-router[bedrock,openai]"

Quickstart

from dotenv import load_dotenv
load_dotenv()

from llm-inference-router import InferenceRouter
from llm-inference-router.providers.bedrock import BedrockProvider
from llm-inference-router.strategies import ChainStrategy, CostStrategy, LatencyStrategy, ComplexityStrategy

router = InferenceRouter(
    tiers={
        "fast":     BedrockProvider("us.anthropic.claude-haiku-4-5-20251001-v1:0"),
        "balanced": BedrockProvider("us.anthropic.claude-sonnet-4-6"),
    },
    strategy=ChainStrategy([
        CostStrategy(budget_usd_per_day=5.0, tiers_by_cost=["balanced", "fast"]),
        LatencyStrategy(sla_ms=3000, preferred_tier="balanced", fallback_tier="fast"),
        ComplexityStrategy(),
    ]),
    fallback="fast"
)

# simple — routes to fast automatically
response = router.complete("what is the capital of France?")
print(response.tier_used)   # fast
print(response.cost_usd)    # ~$0.000016

# complex — routes to balanced automatically
response = router.complete("explain the tradeoffs between SQL and NoSQL in detail")
print(response.tier_used)   # balanced

# async
response = await router.acomplete("explain recursion")

# streaming
for chunk in router.stream("write a haiku about distributed systems"):
    print(chunk, end="", flush=True)

# force a specific tier
response = router.complete("hello", tier="balanced")

Providers

AWS Bedrock

Credentials loaded automatically from ~/.aws/credentials or environment variables. Newer Claude models require the cross-region inference profile prefix (us.).

from llm-inference-router.providers.bedrock import BedrockProvider

provider = BedrockProvider(
    model_id="us.anthropic.claude-haiku-4-5-20251001-v1:0",
    region="us-east-1"
)

Generic HTTP — OpenAI-compatible APIs

One provider covers Groq, DeepInfra, Together AI, Fireworks, Ollama, and any API following the OpenAI chat completions format:

from llm-inference-router.providers.http import HTTPProvider

# Groq
provider = HTTPProvider(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-key",
    model="mixtral-8x7b-32768"
)

# DeepInfra
provider = HTTPProvider(
    base_url="https://api.deepinfra.com/v1/openai",
    api_key="your-key",
    model="meta-llama/Meta-Llama-3-8B-Instruct"
)

# Ollama (local, no auth needed)
provider = HTTPProvider(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
    model="llama3"
)

Custom providers

Support any API in ~30 lines by subclassing BaseProvider:

from llm-inference-router.providers.base import BaseProvider
from llm-inference-router.models import RouterRequest, RouterResponse, TokenUsage, RoutingDecision
import httpx

class MyCustomProvider(BaseProvider):

    def __init__(self, api_key: str, model: str):
        self.api_key = api_key
        self.model = model

    @property
    def name(self) -> str:
        return f"mycustom/{self.model}"

    def complete(self, request: RouterRequest) -> RouterResponse:
        response = httpx.post(
            "https://api.mycustom.com/v1/chat",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json={"model": self.model, "prompt": request.prompt}
        )
        data = response.json()
        return RouterResponse(
            text=data["output"],
            model_used=self.name,
            tier_used="",
            tokens=TokenUsage(input=0, output=0, total=0),
            latency_ms=0.0,
            cost_usd=0.0,
            routing=RoutingDecision(tier_selected="", strategy_used="", reason=""),
        )

    async def acomplete(self, request: RouterRequest) -> RouterResponse:
        import anyio
        return await anyio.to_thread.run_sync(lambda: self.complete(request))

Routing strategies

Complexity strategy

Scores prompts 0–10 across five dimensions — token length, question count, reasoning keywords, code presence, and simple-query detection. Maps score to a tier.

from llm-inference-router.strategies import ComplexityStrategy

strategy = ComplexityStrategy(
    rules={
        "fast":     (0, 3),    # score 0-3
        "balanced": (3, 6),    # score 3-6
        "powerful": (6, 10),   # score 6-10
    }
)

Example scores:

Prompt Score Tier
"what is the capital of France?" 0.0 fast
"explain how transformers work" 3.5 balanced
"compare B-trees vs LSM trees, explain tradeoffs" 7.5 powerful

Debug the scorer:

result = strategy.explain("compare B-trees vs LSM trees")
print(result["score"])          # 7.5
print(result["tier_selected"])  # powerful
print(result["breakdown"])      # per-dimension scores

Cost strategy

Tracks running spend globally and per-user. Downgrades tier as budget is consumed.

from llm-inference-router.strategies import CostStrategy

strategy = CostStrategy(
    budget_usd_per_day=10.0,
    tiers_by_cost=["powerful", "balanced", "fast"],  # expensive → cheap
    downgrade_at=0.8,    # downgrade at 80% budget used
    floor_at=0.95,       # use cheapest tier at 95% budget used
)

# per-user budgets via metadata
response = router.complete(
    "your prompt",
    metadata={"user_id": "user_123", "budget_usd": 1.0}
)

Latency strategy

Tracks rolling p90 latency per tier. Failovers when SLA is breached.

from llm-inference-router.strategies import LatencyStrategy

strategy = LatencyStrategy(
    sla_ms=3000,
    preferred_tier="balanced",
    fallback_tier="fast",
    window_size=50,
    min_samples=5,
)

Chain strategy

Combines strategies in priority order. Hard constraints first, complexity last.

from llm-inference-router.strategies import ChainStrategy

strategy = ChainStrategy([
    CostStrategy(budget_usd_per_day=10.0, tiers_by_cost=["balanced", "fast"]),
    LatencyStrategy(sla_ms=3000, preferred_tier="balanced", fallback_tier="fast"),
    ComplexityStrategy(),
])

RouterResponse

Every completion returns the same normalized shape regardless of provider:

response.text                        # generated text
response.model_used                  # "bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"
response.tier_used                   # "fast"
response.tokens.input                # 42
response.tokens.output               # 180
response.tokens.total                # 222
response.latency_ms                  # 1263.4
response.cost_usd                    # 0.000016
response.routing.tier_selected       # "fast"
response.routing.strategy_used       # "ChainStrategy"
response.routing.reason              # "" or "fallback"
response.raw                         # original provider response

FastAPI layer

Run inference-router as a REST API — any app in any language can use it over HTTP.

Start the server

python -m uvicorn app.main:app --reload

Endpoints

Method Endpoint Description
GET / health check, server status
POST /query route a prompt, get response
POST /query/stream streaming response
GET /tiers list configured tiers and providers
POST /debug/complexity debug complexity scorer for a prompt
DELETE /stats/reset reset cost and latency counters

Example request

POST /query
Content-Type: application/json

{
    "prompt": "explain recursion",
    "max_tokens": 512,
    "temperature": 0.7,
    "system_prompt": "Reply concisely.",
    "tier": "balanced"
}

Example response

{
    "text": "Recursion is when a function calls itself...",
    "model_used": "bedrock/us.anthropic.claude-sonnet-4-6",
    "tier_used": "balanced",
    "tokens_total": 180,
    "cost_usd": 0.002340,
    "latency_ms": 3420.1,
    "strategy_used": "ChainStrategy",
    "was_fallback": false
}

Interactive docs: http://localhost:8000/docs


Multi-turn conversations

from llm-inference-router.models import Message

response = router.complete(
    prompt="what did I just tell you?",
    messages=[
        Message(role="user", content="my name is Shubham"),
        Message(role="assistant", content="Nice to meet you, Shubham!"),
    ]
)

Project structure

inference-router/
├── llm-inference-router/
│   ├── __init__.py              # public API
│   ├── router.py                # core InferenceRouter class
│   ├── models.py                # Pydantic request/response models
│   ├── providers/
│   │   ├── base.py              # BaseProvider — implement for any API
│   │   ├── bedrock.py           # AWS Bedrock
│   │   └── http.py              # Generic HTTP for Groq, DeepInfra, etc.
│   └── strategies/
│       ├── base.py              # BaseStrategy interface
│       ├── complexity.py        # heuristic complexity scorer
│       ├── cost.py              # budget-based routing
│       ├── latency.py           # SLA-based routing
│       └── chain.py             # combine multiple strategies
├── app/
│   └── main.py                  # FastAPI REST API layer
├── examples/
│   └── basic_usage.py           # end-to-end usage examples
├── tests/
├── .env
├── pyproject.toml
└── README.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_inference_router-0.1.2.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_inference_router-0.1.2-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_inference_router-0.1.2.tar.gz.

File metadata

  • Download URL: llm_inference_router-0.1.2.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for llm_inference_router-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5f3a28713bc953a390e5c601f95b92f03b60667c87f44f77b8809d0054966cfa
MD5 da79fcf0a8b86732666344aeeb81f94b
BLAKE2b-256 7dc4e1f3a7b73f6be3c3b0e98e38eb86242c86e21481aa9893c24013ec15cc5b

See more details on using hashes here.

File details

Details for the file llm_inference_router-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_inference_router-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 27cab494261fafd79d1693114b4d1bf1982ca55f449b4540245086319467843e
MD5 ba480391b32123c4b3c552e3406b4725
BLAKE2b-256 403047ed2e0453e476d524c0902caf935dd4908d283739ee3c7d0a5e130af204

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page