Semantic cache, multi-provider LLM router and cost tracker (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen)
Project description
llm-cache-router
A lightweight, production-ready Python library that combines semantic caching, multi-provider LLM routing, and cost tracking in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.
Table of Contents
- Why llm-cache-router
- Features
- Installation
- Quickstart
- Streaming
- Cache Warmup
- Routing Strategies
- Cache Backends
- Budget and Cost Tracking
- FastAPI Integration
- Async Context Manager
- Supported Providers
- Architecture
- Development
- Roadmap
- Contributing
- License
Why llm-cache-router
Calling LLMs directly is expensive, slow, and locks you into a single vendor. This library solves all three problems at once:
- Save money — a semantic cache returns answers for near-duplicate queries without re-calling the provider, typically cutting spend by 30–70% on production workloads.
- Stay resilient — swap providers on the fly, use fallback chains, and never take a full outage because one vendor is down.
- Control cost — built-in daily/monthly budget guardrails with Prometheus metrics for every request.
One dependency. Six providers. Three cache backends. Full async support.
Features
- Semantic cache — vector-similarity matching via
sentence-transformers, not just exact string hashing. - Multi-provider routing across OpenAI, Anthropic, Google Gemini, Ollama, MiniMax and Qwen (Dashscope).
- Three routing strategies:
CHEAPEST_FIRST,FASTEST_FIRST,FALLBACK_CHAIN. - Pluggable cache backends: in-memory (FAISS), Redis, Qdrant.
- Streaming — native async SSE streaming for every provider, transparent to the cache layer.
- Cost tracker with per-model pricing, daily/monthly budget limits and savings accounting.
- Cache warmup with controlled concurrency for pre-production pre-loading.
- FastAPI middleware + Prometheus metrics endpoint out of the box.
- Typed — Pydantic v2 models everywhere, fully typed public API.
- Tested — 10 test modules covering router, cache, providers, retry, warmup, and HTTP middleware.
Installation
pip install llm-cache-router
Optional extras:
pip install "llm-cache-router[redis]" # Redis cache backend
pip install "llm-cache-router[qdrant]" # Qdrant vector cache backend
pip install "llm-cache-router[fastapi]" # FastAPI middleware + Prometheus
pip install "llm-cache-router[all]" # everything above
pip install "llm-cache-router[dev]" # tests, ruff, mypy
Requires Python 3.11+.
Quickstart
import asyncio
from llm_cache_router import CacheConfig, LLMRouter, RoutingStrategy
async def main() -> None:
router = LLMRouter(
providers={
"openai": {"api_key": "sk-...", "models": ["gpt-4o-mini"]},
"anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
"gemini": {"api_key": "AIza...", "models": ["gemini-1.5-flash"]},
"ollama": {"base_url": "http://localhost:11434", "models": ["llama3.2"]},
},
cache=CacheConfig(
backend="memory",
threshold=0.92, # cosine similarity threshold
ttl=3600, # cache TTL in seconds
max_entries=10_000,
),
strategy=RoutingStrategy.CHEAPEST_FIRST,
budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)
response = await router.complete(
messages=[{"role": "user", "content": "What is a semantic cache?"}],
model="gpt-4o-mini",
)
print(response.content)
print(f"cache_hit={response.cache_hit} cost=${response.cost_usd:.6f}")
asyncio.run(main())
Streaming
All providers (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) support native SSE streaming. The cache layer is transparent: on a cache hit you receive a single final chunk, on a miss — a real streaming response that is also written to the cache once complete.
async for chunk in router.stream(
messages=[{"role": "user", "content": "Explain async/await in Python"}],
model="gpt-4o-mini",
):
print(chunk.delta, end="", flush=True)
if chunk.is_final:
print(f"\nprovider={chunk.provider_used} cost=${chunk.cost_usd:.6f}")
Cache Warmup
Pre-load the cache with known queries before traffic hits production:
from llm_cache_router.models import WarmupEntry
results = await router.warmup(
entries=[
WarmupEntry(
messages=[{"role": "user", "content": "What is RAG?"}],
model="gpt-4o-mini",
),
WarmupEntry(
messages=[{"role": "user", "content": "Explain vector databases"}],
model="gpt-4o-mini",
),
],
concurrency=5,
skip_cached=True,
)
print(results) # {"warmed": 2, "skipped": 0, "failed": 0}
Routing Strategies
| Strategy | Description |
|---|---|
CHEAPEST_FIRST |
Picks the cheapest provider/model by live pricing for each call. |
FASTEST_FIRST |
Picks the provider with the lowest observed latency (EMA). |
FALLBACK_CHAIN |
Tries providers in order, falls back on error/timeout. |
router = LLMRouter(
providers={
"openai": {"api_key": "sk-...", "models": ["gpt-4o"]},
"anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
},
strategy=RoutingStrategy.FALLBACK_CHAIN,
fallback_chain=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
)
Cache Backends
In-memory (FAISS)
Default. Zero dependencies beyond the core install. Best for single-process apps and tests.
cache=CacheConfig(backend="memory", threshold=0.92, ttl=3600, max_entries=10_000)
Redis
Production-grade distributed cache with LRU eviction, configurable timeouts, retry/backoff and bounded candidate set for vector search.
cache=CacheConfig(
backend="redis",
redis_url="redis://localhost:6379/0",
redis_namespace="llm_cache_router_prod",
threshold=0.92,
ttl=3600,
max_entries=50_000,
redis_command_timeout_sec=1.5,
redis_retry_attempts=3,
redis_retry_backoff_sec=0.2,
redis_candidate_k=256,
)
Qdrant
Native vector database for very large caches (millions of entries) and cross-service deployments.
pip install "llm-cache-router[qdrant]"
cache=CacheConfig(
backend="qdrant",
qdrant_url="http://localhost:6333",
qdrant_api_key=None, # optional for Qdrant Cloud
qdrant_collection="llm_cache",
threshold=0.92,
ttl=3600,
max_entries=100_000,
)
Budget and Cost Tracking
Set per-day and per-month USD limits — requests that would exceed the budget are rejected before hitting the provider.
router = LLMRouter(
providers={...},
budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)
stats = router.stats()
print(stats.total_cost_usd) # total spent since start
print(stats.saved_cost_usd) # saved via cache hits
print(stats.daily_spend_usd)
print(stats.budget_remaining_usd) # None if no limit is set
print(stats.cache_hit_rate) # 0.0–1.0
FastAPI Integration
pip install "llm-cache-router[fastapi]"
from fastapi import FastAPI
from llm_cache_router.middleware.fastapi import (
add_http_metrics_middleware,
mount_metrics_endpoint,
)
app = FastAPI()
add_http_metrics_middleware(app=app)
mount_metrics_endpoint(app=app, router=router, path="/metrics")
Exposed Prometheus metrics:
llm_router_http_requests_total{method,path,status}llm_router_http_request_duration_seconds_*(histogram)llm_router_cache_hits_total,llm_router_cache_misses_totalllm_router_cost_usd_total,llm_router_saved_cost_usd_total
Async Context Manager
async with LLMRouter(providers={...}) as router:
response = await router.complete(messages=[...], model="gpt-4o-mini")
# close() is called automatically — closes provider clients and cache connections
Supported Providers
| Provider | Streaming | Notes |
|---|---|---|
| OpenAI | yes | gpt-4o, gpt-4o-mini, o1-*, etc. |
| Anthropic | yes | Claude 3.5 Sonnet/Haiku, Opus |
| Google Gemini | yes | 1.5 Flash, 1.5 Pro |
| Ollama | yes | Any locally-served model |
| MiniMax | yes | MiniMax-Text-01 and others |
| Qwen (Dashscope) | yes | qwen-plus, qwen-max, etc. |
Adding a new provider = subclass LLMProvider, register with @register_provider("name"). See llm_cache_router/providers/base.py.
Architecture
llm_cache_router/
cache/ # memory (FAISS) / redis / qdrant backends
providers/ # openai, anthropic, gemini, ollama, minimax, qwen
strategies/ # cheapest, fastest, fallback
embeddings/ # SentenceEncoder, HashingEncoder
cost/ # CostTracker with daily/monthly budgets
middleware/ # FastAPI middleware
observability/ # Prometheus metrics
models.py # Pydantic models (LLMResponse, LLMStreamChunk, ...)
router.py # LLMRouter — public entrypoint
retry.py # RetryConfig + exponential backoff
warmup.py # async warmup helper
Development
git clone https://github.com/svalench/llm-cache-router.git
cd llm-cache-router
# using uv (recommended)
uv sync --all-extras
uv run pytest
# or plain pip
pip install -e ".[all,dev]"
pytest
Code quality is enforced in CI via:
ruff check(lint) andruff format --check(style)mypy --ignore-missing-imports(type check)pyteston Python 3.11, 3.12, 3.13 with coverage
Roadmap
- v0.3 — Django helpers and middleware.
- v0.4 — Streaming retry (reconnect on SSE drop).
- v0.5 — Request tracing hooks (OpenTelemetry).
- v1.0 — Full OTel spans, pluggable pricing providers, cache invalidation API.
Contributing
Pull requests are welcome. Please:
- Open an issue first for anything larger than a small bug fix.
- Add tests for new behaviour.
- Run
ruff check,ruff format,mypyandpytestbefore pushing.
License
MIT — see LICENSE for details.
🇷🇺 Краткое описание (Russian)
llm-cache-router — лёгкая production-ready Python-библиотека для семантического кэширования LLM-запросов, мульти-провайдер роутинга и контроля бюджета. Экономит 30–70% на LLM-счетах за счёт векторного кэша, переключается между провайдерами (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) без изменений в коде приложения, и включает встроенный трекинг стоимости с дневными/месячными лимитами. Поддерживает три бэкенда кэша (in-memory / Redis / Qdrant), нативный стриминг для всех провайдеров и FastAPI-middleware с Prometheus-метриками.
Установка:
pip install llm-cache-router
# с дополнительными бэкендами
pip install "llm-cache-router[redis]"
pip install "llm-cache-router[qdrant]"
pip install "llm-cache-router[fastapi]"
pip install "llm-cache-router[all]"
Требуется Python 3.11+. Полная документация и примеры — выше (на английском).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_cache_router-0.2.3.tar.gz.
File metadata
- Download URL: llm_cache_router-0.2.3.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf39dd9629aa18389efd566d4d4968bc65c0d0634db10f7334a0c2e103fe51b7
|
|
| MD5 |
f4ec4fafa063b51e9d6800d6b86ed432
|
|
| BLAKE2b-256 |
d664db5dfb656e3b3a920ca77820ebe76cbc0a7053a933bee8604b2edc737aee
|
Provenance
The following attestation bundles were made for llm_cache_router-0.2.3.tar.gz:
Publisher:
publish.yml on svalench/llm-cache-router
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_cache_router-0.2.3.tar.gz -
Subject digest:
bf39dd9629aa18389efd566d4d4968bc65c0d0634db10f7334a0c2e103fe51b7 - Sigstore transparency entry: 1357156449
- Sigstore integration time:
-
Permalink:
svalench/llm-cache-router@446ed0790e648ff1688fb89ee41aeb9713fdb6ca -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/svalench
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@446ed0790e648ff1688fb89ee41aeb9713fdb6ca -
Trigger Event:
push
-
Statement type:
File details
Details for the file llm_cache_router-0.2.3-py3-none-any.whl.
File metadata
- Download URL: llm_cache_router-0.2.3-py3-none-any.whl
- Upload date:
- Size: 44.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87f37233bf98bc61651d61469a402fba76118d211aaea7107721f100b1121510
|
|
| MD5 |
dc86b5689c539077242d97eab9bad635
|
|
| BLAKE2b-256 |
04bd1eaf7c9b3e7c49249cacbc672785f0be3409d760e9771ab6de5646bbdd58
|
Provenance
The following attestation bundles were made for llm_cache_router-0.2.3-py3-none-any.whl:
Publisher:
publish.yml on svalench/llm-cache-router
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_cache_router-0.2.3-py3-none-any.whl -
Subject digest:
87f37233bf98bc61651d61469a402fba76118d211aaea7107721f100b1121510 - Sigstore transparency entry: 1357156458
- Sigstore integration time:
-
Permalink:
svalench/llm-cache-router@446ed0790e648ff1688fb89ee41aeb9713fdb6ca -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/svalench
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@446ed0790e648ff1688fb89ee41aeb9713fdb6ca -
Trigger Event:
push
-
Statement type: