Skip to main content

Pluggable LLM inference routing SDK — bring any model from any provider

Project description

inference-router

A pluggable Python SDK for intelligent LLM inference routing. Route requests across model tiers based on complexity, cost, and latency — without changing your application code.

from inference_router import InferenceRouter
from inference_router.providers.bedrock import BedrockProvider
from inference_router.strategies import ComplexityStrategy

router = InferenceRouter(
    tiers={
        "fast":     BedrockProvider("us.anthropic.claude-haiku-4-5-20251001-v1:0"),
        "balanced": BedrockProvider("us.anthropic.claude-sonnet-4-6"),
    },
    strategy=ComplexityStrategy(),
    fallback="fast"
)

response = router.complete("explain recursion in one sentence")
print(response.text)
print(response.model_used)   # "bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"
print(response.cost_usd)     # 0.000016
print(response.latency_ms)   # 1263.4
print(response.tier_used)    # "fast"

Why inference-router?

Every LLM app today sends every request to the same model — same cost, same latency, regardless of how simple or complex the question is. That's like using a surgeon for a bandaid.

inference-router sits between your app and your LLM providers. It classifies each request and dispatches it to the right model automatically:

  • Simple query → small, fast, cheap model
  • Complex reasoning → large, powerful model
  • Budget exceeded → downgrade tier automatically
  • Provider slow or down → failover to backup instantly

No changes to your application code. Zero vendor lock-in. Swap providers in one line.


Installation

Core SDK:

pip install inference-router

With provider extras:

# AWS Bedrock
pip install "inference-router[bedrock]"

# OpenAI / OpenAI-compatible APIs (Groq, DeepInfra, Together AI)
pip install "inference-router[openai]"

# Anthropic direct API
pip install "inference-router[anthropic]"

# Multiple providers
pip install "inference-router[bedrock,openai]"

Quickstart

from dotenv import load_dotenv
load_dotenv()

from inference_router import InferenceRouter
from inference_router.providers.bedrock import BedrockProvider
from inference_router.strategies import ChainStrategy, CostStrategy, LatencyStrategy, ComplexityStrategy

router = InferenceRouter(
    tiers={
        "fast":     BedrockProvider("us.anthropic.claude-haiku-4-5-20251001-v1:0"),
        "balanced": BedrockProvider("us.anthropic.claude-sonnet-4-6"),
    },
    strategy=ChainStrategy([
        CostStrategy(budget_usd_per_day=5.0, tiers_by_cost=["balanced", "fast"]),
        LatencyStrategy(sla_ms=3000, preferred_tier="balanced", fallback_tier="fast"),
        ComplexityStrategy(),
    ]),
    fallback="fast"
)

# simple — routes to fast automatically
response = router.complete("what is the capital of France?")
print(response.tier_used)   # fast
print(response.cost_usd)    # ~$0.000016

# complex — routes to balanced automatically
response = router.complete("explain the tradeoffs between SQL and NoSQL in detail")
print(response.tier_used)   # balanced

# async
response = await router.acomplete("explain recursion")

# streaming
for chunk in router.stream("write a haiku about distributed systems"):
    print(chunk, end="", flush=True)

# force a specific tier
response = router.complete("hello", tier="balanced")

Providers

AWS Bedrock

Credentials loaded automatically from ~/.aws/credentials or environment variables. Newer Claude models require the cross-region inference profile prefix (us.).

from inference_router.providers.bedrock import BedrockProvider

provider = BedrockProvider(
    model_id="us.anthropic.claude-haiku-4-5-20251001-v1:0",
    region="us-east-1"
)

Generic HTTP — OpenAI-compatible APIs

One provider covers Groq, DeepInfra, Together AI, Fireworks, Ollama, and any API following the OpenAI chat completions format:

from inference_router.providers.http import HTTPProvider

# Groq
provider = HTTPProvider(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-key",
    model="mixtral-8x7b-32768"
)

# DeepInfra
provider = HTTPProvider(
    base_url="https://api.deepinfra.com/v1/openai",
    api_key="your-key",
    model="meta-llama/Meta-Llama-3-8B-Instruct"
)

# Ollama (local, no auth needed)
provider = HTTPProvider(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
    model="llama3"
)

Custom providers

Support any API in ~30 lines by subclassing BaseProvider:

from inference_router.providers.base import BaseProvider
from inference_router.models import RouterRequest, RouterResponse, TokenUsage, RoutingDecision
import httpx

class MyCustomProvider(BaseProvider):

    def __init__(self, api_key: str, model: str):
        self.api_key = api_key
        self.model = model

    @property
    def name(self) -> str:
        return f"mycustom/{self.model}"

    def complete(self, request: RouterRequest) -> RouterResponse:
        response = httpx.post(
            "https://api.mycustom.com/v1/chat",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json={"model": self.model, "prompt": request.prompt}
        )
        data = response.json()
        return RouterResponse(
            text=data["output"],
            model_used=self.name,
            tier_used="",
            tokens=TokenUsage(input=0, output=0, total=0),
            latency_ms=0.0,
            cost_usd=0.0,
            routing=RoutingDecision(tier_selected="", strategy_used="", reason=""),
        )

    async def acomplete(self, request: RouterRequest) -> RouterResponse:
        import anyio
        return await anyio.to_thread.run_sync(lambda: self.complete(request))

Routing strategies

Complexity strategy

Scores prompts 0–10 across five dimensions — token length, question count, reasoning keywords, code presence, and simple-query detection. Maps score to a tier.

from inference_router.strategies import ComplexityStrategy

strategy = ComplexityStrategy(
    rules={
        "fast":     (0, 3),    # score 0-3
        "balanced": (3, 6),    # score 3-6
        "powerful": (6, 10),   # score 6-10
    }
)

Example scores:

Prompt Score Tier
"what is the capital of France?" 0.0 fast
"explain how transformers work" 3.5 balanced
"compare B-trees vs LSM trees, explain tradeoffs" 7.5 powerful

Debug the scorer:

result = strategy.explain("compare B-trees vs LSM trees")
print(result["score"])          # 7.5
print(result["tier_selected"])  # powerful
print(result["breakdown"])      # per-dimension scores

Cost strategy

Tracks running spend globally and per-user. Downgrades tier as budget is consumed.

from inference_router.strategies import CostStrategy

strategy = CostStrategy(
    budget_usd_per_day=10.0,
    tiers_by_cost=["powerful", "balanced", "fast"],  # expensive → cheap
    downgrade_at=0.8,    # downgrade at 80% budget used
    floor_at=0.95,       # use cheapest tier at 95% budget used
)

# per-user budgets via metadata
response = router.complete(
    "your prompt",
    metadata={"user_id": "user_123", "budget_usd": 1.0}
)

Latency strategy

Tracks rolling p90 latency per tier. Failovers when SLA is breached.

from inference_router.strategies import LatencyStrategy

strategy = LatencyStrategy(
    sla_ms=3000,
    preferred_tier="balanced",
    fallback_tier="fast",
    window_size=50,
    min_samples=5,
)

Chain strategy

Combines strategies in priority order. Hard constraints first, complexity last.

from inference_router.strategies import ChainStrategy

strategy = ChainStrategy([
    CostStrategy(budget_usd_per_day=10.0, tiers_by_cost=["balanced", "fast"]),
    LatencyStrategy(sla_ms=3000, preferred_tier="balanced", fallback_tier="fast"),
    ComplexityStrategy(),
])

RouterResponse

Every completion returns the same normalized shape regardless of provider:

response.text                        # generated text
response.model_used                  # "bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"
response.tier_used                   # "fast"
response.tokens.input                # 42
response.tokens.output               # 180
response.tokens.total                # 222
response.latency_ms                  # 1263.4
response.cost_usd                    # 0.000016
response.routing.tier_selected       # "fast"
response.routing.strategy_used       # "ChainStrategy"
response.routing.reason              # "" or "fallback"
response.raw                         # original provider response

FastAPI layer

Run inference-router as a REST API — any app in any language can use it over HTTP.

Start the server

python -m uvicorn app.main:app --reload

Endpoints

Method Endpoint Description
GET / health check, server status
POST /query route a prompt, get response
POST /query/stream streaming response
GET /tiers list configured tiers and providers
POST /debug/complexity debug complexity scorer for a prompt
DELETE /stats/reset reset cost and latency counters

Example request

POST /query
Content-Type: application/json

{
    "prompt": "explain recursion",
    "max_tokens": 512,
    "temperature": 0.7,
    "system_prompt": "Reply concisely.",
    "tier": "balanced"
}

Example response

{
    "text": "Recursion is when a function calls itself...",
    "model_used": "bedrock/us.anthropic.claude-sonnet-4-6",
    "tier_used": "balanced",
    "tokens_total": 180,
    "cost_usd": 0.002340,
    "latency_ms": 3420.1,
    "strategy_used": "ChainStrategy",
    "was_fallback": false
}

Interactive docs: http://localhost:8000/docs


Multi-turn conversations

from inference_router.models import Message

response = router.complete(
    prompt="what did I just tell you?",
    messages=[
        Message(role="user", content="my name is Shubham"),
        Message(role="assistant", content="Nice to meet you, Shubham!"),
    ]
)

Project structure

inference-router/
├── inference_router/
│   ├── __init__.py              # public API
│   ├── router.py                # core InferenceRouter class
│   ├── models.py                # Pydantic request/response models
│   ├── providers/
│   │   ├── base.py              # BaseProvider — implement for any API
│   │   ├── bedrock.py           # AWS Bedrock
│   │   └── http.py              # Generic HTTP for Groq, DeepInfra, etc.
│   └── strategies/
│       ├── base.py              # BaseStrategy interface
│       ├── complexity.py        # heuristic complexity scorer
│       ├── cost.py              # budget-based routing
│       ├── latency.py           # SLA-based routing
│       └── chain.py             # combine multiple strategies
├── app/
│   └── main.py                  # FastAPI REST API layer
├── examples/
│   └── basic_usage.py           # end-to-end usage examples
├── tests/
├── .env
├── pyproject.toml
└── README.md

Roadmap

  • providers/bedrock.py — AWS Bedrock
  • providers/http.py — Generic OpenAI-compatible HTTP
  • strategies/complexity.py — heuristic scorer
  • strategies/cost.py — budget enforcer
  • strategies/latency.py — SLA routing
  • strategies/chain.py — strategy chaining
  • router.py — core router with sync/async/streaming
  • examples/basic_usage.py — working examples
  • FastAPI REST API layer with streaming
  • providers/openai.py — OpenAI native client
  • providers/anthropic.py — Anthropic direct API
  • PyPI publish — pip install inference-router

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_inference_router-0.1.0.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_inference_router-0.1.0-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_inference_router-0.1.0.tar.gz.

File metadata

  • Download URL: llm_inference_router-0.1.0.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for llm_inference_router-0.1.0.tar.gz
Algorithm Hash digest
SHA256 75c5bd1796c302522e8745de0710d3ed6bcea33842ec4b22a6c15921b177021f
MD5 3ca36b44c4e6a25d917ac5e6e167da0d
BLAKE2b-256 d43704196b6ed9daac6da7463472f5c69fcd4d417bfac6d1da18185b023028f4

See more details on using hashes here.

File details

Details for the file llm_inference_router-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_inference_router-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0fa17aeea4b41efd97e52b462b93a39f457c91bea75b00f6da0816c91af67095
MD5 1fb8cd578bebc12890c590f34f41ec1c
BLAKE2b-256 575270b6a27cec17f69a2300f895b8ee953003aad53d387a029ecb240cc54880

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page