Pluggable LLM inference routing SDK — bring any model from any provider
Project description
inference-router
A pluggable Python SDK for intelligent LLM inference routing. Route requests across model tiers based on complexity, cost, and latency — without changing your application code.
from inference_router import InferenceRouter
from inference_router.providers.bedrock import BedrockProvider
from inference_router.strategies import ComplexityStrategy
router = InferenceRouter(
tiers={
"fast": BedrockProvider("us.anthropic.claude-haiku-4-5-20251001-v1:0"),
"balanced": BedrockProvider("us.anthropic.claude-sonnet-4-6"),
},
strategy=ComplexityStrategy(),
fallback="fast"
)
response = router.complete("explain recursion in one sentence")
print(response.text)
print(response.model_used) # "bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"
print(response.cost_usd) # 0.000016
print(response.latency_ms) # 1263.4
print(response.tier_used) # "fast"
Why inference-router?
Every LLM app today sends every request to the same model — same cost, same latency, regardless of how simple or complex the question is. That's like using a surgeon for a bandaid.
inference-router sits between your app and your LLM providers. It classifies each request and dispatches it to the right model automatically:
- Simple query → small, fast, cheap model
- Complex reasoning → large, powerful model
- Budget exceeded → downgrade tier automatically
- Provider slow or down → failover to backup instantly
No changes to your application code. Zero vendor lock-in. Swap providers in one line.
Installation
Core SDK:
pip install inference-router
With provider extras:
# AWS Bedrock
pip install "inference-router[bedrock]"
# OpenAI / OpenAI-compatible APIs (Groq, DeepInfra, Together AI)
pip install "inference-router[openai]"
# Anthropic direct API
pip install "inference-router[anthropic]"
# Multiple providers
pip install "inference-router[bedrock,openai]"
Quickstart
from dotenv import load_dotenv
load_dotenv()
from inference_router import InferenceRouter
from inference_router.providers.bedrock import BedrockProvider
from inference_router.strategies import ChainStrategy, CostStrategy, LatencyStrategy, ComplexityStrategy
router = InferenceRouter(
tiers={
"fast": BedrockProvider("us.anthropic.claude-haiku-4-5-20251001-v1:0"),
"balanced": BedrockProvider("us.anthropic.claude-sonnet-4-6"),
},
strategy=ChainStrategy([
CostStrategy(budget_usd_per_day=5.0, tiers_by_cost=["balanced", "fast"]),
LatencyStrategy(sla_ms=3000, preferred_tier="balanced", fallback_tier="fast"),
ComplexityStrategy(),
]),
fallback="fast"
)
# simple — routes to fast automatically
response = router.complete("what is the capital of France?")
print(response.tier_used) # fast
print(response.cost_usd) # ~$0.000016
# complex — routes to balanced automatically
response = router.complete("explain the tradeoffs between SQL and NoSQL in detail")
print(response.tier_used) # balanced
# async
response = await router.acomplete("explain recursion")
# streaming
for chunk in router.stream("write a haiku about distributed systems"):
print(chunk, end="", flush=True)
# force a specific tier
response = router.complete("hello", tier="balanced")
Providers
AWS Bedrock
Credentials loaded automatically from ~/.aws/credentials or environment variables.
Newer Claude models require the cross-region inference profile prefix (us.).
from inference_router.providers.bedrock import BedrockProvider
provider = BedrockProvider(
model_id="us.anthropic.claude-haiku-4-5-20251001-v1:0",
region="us-east-1"
)
Generic HTTP — OpenAI-compatible APIs
One provider covers Groq, DeepInfra, Together AI, Fireworks, Ollama, and any API following the OpenAI chat completions format:
from inference_router.providers.http import HTTPProvider
# Groq
provider = HTTPProvider(
base_url="https://api.groq.com/openai/v1",
api_key="your-key",
model="mixtral-8x7b-32768"
)
# DeepInfra
provider = HTTPProvider(
base_url="https://api.deepinfra.com/v1/openai",
api_key="your-key",
model="meta-llama/Meta-Llama-3-8B-Instruct"
)
# Ollama (local, no auth needed)
provider = HTTPProvider(
base_url="http://localhost:11434/v1",
api_key="ollama",
model="llama3"
)
Custom providers
Support any API in ~30 lines by subclassing BaseProvider:
from inference_router.providers.base import BaseProvider
from inference_router.models import RouterRequest, RouterResponse, TokenUsage, RoutingDecision
import httpx
class MyCustomProvider(BaseProvider):
def __init__(self, api_key: str, model: str):
self.api_key = api_key
self.model = model
@property
def name(self) -> str:
return f"mycustom/{self.model}"
def complete(self, request: RouterRequest) -> RouterResponse:
response = httpx.post(
"https://api.mycustom.com/v1/chat",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"model": self.model, "prompt": request.prompt}
)
data = response.json()
return RouterResponse(
text=data["output"],
model_used=self.name,
tier_used="",
tokens=TokenUsage(input=0, output=0, total=0),
latency_ms=0.0,
cost_usd=0.0,
routing=RoutingDecision(tier_selected="", strategy_used="", reason=""),
)
async def acomplete(self, request: RouterRequest) -> RouterResponse:
import anyio
return await anyio.to_thread.run_sync(lambda: self.complete(request))
Routing strategies
Complexity strategy
Scores prompts 0–10 across five dimensions — token length, question count, reasoning keywords, code presence, and simple-query detection. Maps score to a tier.
from inference_router.strategies import ComplexityStrategy
strategy = ComplexityStrategy(
rules={
"fast": (0, 3), # score 0-3
"balanced": (3, 6), # score 3-6
"powerful": (6, 10), # score 6-10
}
)
Example scores:
| Prompt | Score | Tier |
|---|---|---|
| "what is the capital of France?" | 0.0 | fast |
| "explain how transformers work" | 3.5 | balanced |
| "compare B-trees vs LSM trees, explain tradeoffs" | 7.5 | powerful |
Debug the scorer:
result = strategy.explain("compare B-trees vs LSM trees")
print(result["score"]) # 7.5
print(result["tier_selected"]) # powerful
print(result["breakdown"]) # per-dimension scores
Cost strategy
Tracks running spend globally and per-user. Downgrades tier as budget is consumed.
from inference_router.strategies import CostStrategy
strategy = CostStrategy(
budget_usd_per_day=10.0,
tiers_by_cost=["powerful", "balanced", "fast"], # expensive → cheap
downgrade_at=0.8, # downgrade at 80% budget used
floor_at=0.95, # use cheapest tier at 95% budget used
)
# per-user budgets via metadata
response = router.complete(
"your prompt",
metadata={"user_id": "user_123", "budget_usd": 1.0}
)
Latency strategy
Tracks rolling p90 latency per tier. Failovers when SLA is breached.
from inference_router.strategies import LatencyStrategy
strategy = LatencyStrategy(
sla_ms=3000,
preferred_tier="balanced",
fallback_tier="fast",
window_size=50,
min_samples=5,
)
Chain strategy
Combines strategies in priority order. Hard constraints first, complexity last.
from inference_router.strategies import ChainStrategy
strategy = ChainStrategy([
CostStrategy(budget_usd_per_day=10.0, tiers_by_cost=["balanced", "fast"]),
LatencyStrategy(sla_ms=3000, preferred_tier="balanced", fallback_tier="fast"),
ComplexityStrategy(),
])
RouterResponse
Every completion returns the same normalized shape regardless of provider:
response.text # generated text
response.model_used # "bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"
response.tier_used # "fast"
response.tokens.input # 42
response.tokens.output # 180
response.tokens.total # 222
response.latency_ms # 1263.4
response.cost_usd # 0.000016
response.routing.tier_selected # "fast"
response.routing.strategy_used # "ChainStrategy"
response.routing.reason # "" or "fallback"
response.raw # original provider response
FastAPI layer
Run inference-router as a REST API — any app in any language can use it over HTTP.
Start the server
python -m uvicorn app.main:app --reload
Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
health check, server status |
POST |
/query |
route a prompt, get response |
POST |
/query/stream |
streaming response |
GET |
/tiers |
list configured tiers and providers |
POST |
/debug/complexity |
debug complexity scorer for a prompt |
DELETE |
/stats/reset |
reset cost and latency counters |
Example request
POST /query
Content-Type: application/json
{
"prompt": "explain recursion",
"max_tokens": 512,
"temperature": 0.7,
"system_prompt": "Reply concisely.",
"tier": "balanced"
}
Example response
{
"text": "Recursion is when a function calls itself...",
"model_used": "bedrock/us.anthropic.claude-sonnet-4-6",
"tier_used": "balanced",
"tokens_total": 180,
"cost_usd": 0.002340,
"latency_ms": 3420.1,
"strategy_used": "ChainStrategy",
"was_fallback": false
}
Interactive docs: http://localhost:8000/docs
Multi-turn conversations
from inference_router.models import Message
response = router.complete(
prompt="what did I just tell you?",
messages=[
Message(role="user", content="my name is Shubham"),
Message(role="assistant", content="Nice to meet you, Shubham!"),
]
)
Project structure
inference-router/
├── inference_router/
│ ├── __init__.py # public API
│ ├── router.py # core InferenceRouter class
│ ├── models.py # Pydantic request/response models
│ ├── providers/
│ │ ├── base.py # BaseProvider — implement for any API
│ │ ├── bedrock.py # AWS Bedrock
│ │ └── http.py # Generic HTTP for Groq, DeepInfra, etc.
│ └── strategies/
│ ├── base.py # BaseStrategy interface
│ ├── complexity.py # heuristic complexity scorer
│ ├── cost.py # budget-based routing
│ ├── latency.py # SLA-based routing
│ └── chain.py # combine multiple strategies
├── app/
│ └── main.py # FastAPI REST API layer
├── examples/
│ └── basic_usage.py # end-to-end usage examples
├── tests/
├── .env
├── pyproject.toml
└── README.md
Roadmap
-
providers/bedrock.py— AWS Bedrock -
providers/http.py— Generic OpenAI-compatible HTTP -
strategies/complexity.py— heuristic scorer -
strategies/cost.py— budget enforcer -
strategies/latency.py— SLA routing -
strategies/chain.py— strategy chaining -
router.py— core router with sync/async/streaming -
examples/basic_usage.py— working examples - FastAPI REST API layer with streaming
-
providers/openai.py— OpenAI native client -
providers/anthropic.py— Anthropic direct API - PyPI publish —
pip install inference-router
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_inference_router-0.1.0.tar.gz.
File metadata
- Download URL: llm_inference_router-0.1.0.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75c5bd1796c302522e8745de0710d3ed6bcea33842ec4b22a6c15921b177021f
|
|
| MD5 |
3ca36b44c4e6a25d917ac5e6e167da0d
|
|
| BLAKE2b-256 |
d43704196b6ed9daac6da7463472f5c69fcd4d417bfac6d1da18185b023028f4
|
File details
Details for the file llm_inference_router-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_inference_router-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fa17aeea4b41efd97e52b462b93a39f457c91bea75b00f6da0816c91af67095
|
|
| MD5 |
1fb8cd578bebc12890c590f34f41ec1c
|
|
| BLAKE2b-256 |
575270b6a27cec17f69a2300f895b8ee953003aad53d387a029ecb240cc54880
|