Skip to main content

Semantic cache, multi-provider LLM router and cost tracker (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen)

Project description

llm-cache-router

PyPI version Python versions PyPI Downloads License: MIT CI Code style: ruff

A lightweight, production-ready Python library that combines semantic caching, multi-provider LLM routing, and cost tracking in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.


Table of Contents


Why llm-cache-router

Calling LLMs directly is expensive, slow, and locks you into a single vendor. This library solves all three problems at once:

  • Save money — a semantic cache returns answers for near-duplicate queries without re-calling the provider, typically cutting spend by 30–70% on production workloads.
  • Stay resilient — swap providers on the fly, use fallback chains, and never take a full outage because one vendor is down.
  • Control cost — built-in daily/monthly budget guardrails with Prometheus metrics for every request.

One dependency. Six providers. Three cache backends. Full async support.

Features

  • Semantic cache — vector-similarity matching via sentence-transformers, not just exact string hashing.
  • Multi-provider routing across OpenAI, Anthropic, Google Gemini, Ollama, MiniMax and Qwen (Dashscope).
  • Three routing strategies: CHEAPEST_FIRST, FASTEST_FIRST, FALLBACK_CHAIN.
  • Pluggable cache backends: in-memory (FAISS), Redis, Qdrant.
  • Streaming — native async SSE streaming for every provider, transparent to the cache layer.
  • Cost tracker with per-model pricing, daily/monthly budget limits and savings accounting.
  • Cache warmup with controlled concurrency for pre-production pre-loading.
  • FastAPI middleware + Prometheus metrics endpoint out of the box.
  • Typed — Pydantic v2 models everywhere, fully typed public API.
  • Tested — 10 test modules covering router, cache, providers, retry, warmup, and HTTP middleware.

Installation

pip install llm-cache-router

Optional extras:

pip install "llm-cache-router[redis]"     # Redis cache backend
pip install "llm-cache-router[qdrant]"    # Qdrant vector cache backend
pip install "llm-cache-router[fastapi]"   # FastAPI middleware + Prometheus
pip install "llm-cache-router[all]"       # everything above
pip install "llm-cache-router[dev]"       # tests, ruff, mypy

Requires Python 3.11+.

Quickstart

import asyncio
from llm_cache_router import CacheConfig, LLMRouter, RoutingStrategy


async def main() -> None:
    router = LLMRouter(
        providers={
            "openai":    {"api_key": "sk-...",           "models": ["gpt-4o-mini"]},
            "anthropic": {"api_key": "sk-ant-...",       "models": ["claude-3-5-sonnet"]},
            "gemini":    {"api_key": "AIza...",          "models": ["gemini-1.5-flash"]},
            "ollama":    {"base_url": "http://localhost:11434", "models": ["llama3.2"]},
        },
        cache=CacheConfig(
            backend="memory",
            threshold=0.92,       # cosine similarity threshold
            ttl=3600,             # cache TTL in seconds
            max_entries=10_000,
        ),
        strategy=RoutingStrategy.CHEAPEST_FIRST,
        budget={"daily_usd": 5.0, "monthly_usd": 50.0},
    )

    response = await router.complete(
        messages=[{"role": "user", "content": "What is a semantic cache?"}],
        model="gpt-4o-mini",
    )
    print(response.content)
    print(f"cache_hit={response.cache_hit} cost=${response.cost_usd:.6f}")


asyncio.run(main())

Streaming

All providers (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) support native SSE streaming. The cache layer is transparent: on a cache hit you receive a single final chunk, on a miss — a real streaming response that is also written to the cache once complete.

async for chunk in router.stream(
    messages=[{"role": "user", "content": "Explain async/await in Python"}],
    model="gpt-4o-mini",
):
    print(chunk.delta, end="", flush=True)
    if chunk.is_final:
        print(f"\nprovider={chunk.provider_used} cost=${chunk.cost_usd:.6f}")

Cache Warmup

Pre-load the cache with known queries before traffic hits production:

from llm_cache_router.models import WarmupEntry

results = await router.warmup(
    entries=[
        WarmupEntry(
            messages=[{"role": "user", "content": "What is RAG?"}],
            model="gpt-4o-mini",
        ),
        WarmupEntry(
            messages=[{"role": "user", "content": "Explain vector databases"}],
            model="gpt-4o-mini",
        ),
    ],
    concurrency=5,
    skip_cached=True,
)
print(results)  # {"warmed": 2, "skipped": 0, "failed": 0}

Routing Strategies

Strategy Description
CHEAPEST_FIRST Picks the cheapest provider/model by live pricing for each call.
FASTEST_FIRST Picks the provider with the lowest observed latency (EMA).
FALLBACK_CHAIN Tries providers in order, falls back on error/timeout.
router = LLMRouter(
    providers={
        "openai":    {"api_key": "sk-...",     "models": ["gpt-4o"]},
        "anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
    },
    strategy=RoutingStrategy.FALLBACK_CHAIN,
    fallback_chain=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
)

Cache Backends

In-memory (FAISS)

Default. Zero dependencies beyond the core install. Best for single-process apps and tests.

cache=CacheConfig(backend="memory", threshold=0.92, ttl=3600, max_entries=10_000)

Redis

Production-grade distributed cache with LRU eviction, configurable timeouts, retry/backoff and bounded candidate set for vector search.

cache=CacheConfig(
    backend="redis",
    redis_url="redis://localhost:6379/0",
    redis_namespace="llm_cache_router_prod",
    threshold=0.92,
    ttl=3600,
    max_entries=50_000,
    redis_command_timeout_sec=1.5,
    redis_retry_attempts=3,
    redis_retry_backoff_sec=0.2,
    redis_candidate_k=256,
)

Qdrant

Native vector database for very large caches (millions of entries) and cross-service deployments.

pip install "llm-cache-router[qdrant]"
cache=CacheConfig(
    backend="qdrant",
    qdrant_url="http://localhost:6333",
    qdrant_api_key=None,           # optional for Qdrant Cloud
    qdrant_collection="llm_cache",
    threshold=0.92,
    ttl=3600,
    max_entries=100_000,
)

Budget and Cost Tracking

Set per-day and per-month USD limits — requests that would exceed the budget are rejected before hitting the provider.

router = LLMRouter(
    providers={...},
    budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)

stats = router.stats()
print(stats.total_cost_usd)           # total spent since start
print(stats.saved_cost_usd)           # saved via cache hits
print(stats.daily_spend_usd)
print(stats.budget_remaining_usd)     # None if no limit is set
print(stats.cache_hit_rate)           # 0.0–1.0

FastAPI Integration

pip install "llm-cache-router[fastapi]"
from fastapi import FastAPI
from llm_cache_router.middleware.fastapi import (
    add_http_metrics_middleware,
    mount_metrics_endpoint,
)

app = FastAPI()
add_http_metrics_middleware(app=app)
mount_metrics_endpoint(app=app, router=router, path="/metrics")

Exposed Prometheus metrics:

  • llm_router_http_requests_total{method,path,status}
  • llm_router_http_request_duration_seconds_* (histogram)
  • llm_router_cache_hits_total, llm_router_cache_misses_total
  • llm_router_cost_usd_total, llm_router_saved_cost_usd_total

Async Context Manager

async with LLMRouter(providers={...}) as router:
    response = await router.complete(messages=[...], model="gpt-4o-mini")
# close() is called automatically — closes provider clients and cache connections

Supported Providers

Provider Streaming Notes
OpenAI yes gpt-4o, gpt-4o-mini, o1-*, etc.
Anthropic yes Claude 3.5 Sonnet/Haiku, Opus
Google Gemini yes 1.5 Flash, 1.5 Pro
Ollama yes Any locally-served model
MiniMax yes MiniMax-Text-01 and others
Qwen (Dashscope) yes qwen-plus, qwen-max, etc.

Adding a new provider = subclass LLMProvider, register with @register_provider("name"). See llm_cache_router/providers/base.py.

Architecture

llm_cache_router/
  cache/          # memory (FAISS) / redis / qdrant backends
  providers/      # openai, anthropic, gemini, ollama, minimax, qwen
  strategies/     # cheapest, fastest, fallback
  embeddings/     # SentenceEncoder, HashingEncoder
  cost/           # CostTracker with daily/monthly budgets
  middleware/     # FastAPI middleware
  observability/  # Prometheus metrics
  models.py       # Pydantic models (LLMResponse, LLMStreamChunk, ...)
  router.py       # LLMRouter — public entrypoint
  retry.py        # RetryConfig + exponential backoff
  warmup.py       # async warmup helper

Development

git clone https://github.com/svalench/llm-cache-router.git
cd llm-cache-router

# using uv (recommended)
uv sync --all-extras
uv run pytest

# or plain pip
pip install -e ".[all,dev]"
pytest

Code quality is enforced in CI via:

  • ruff check (lint) and ruff format --check (style)
  • mypy --ignore-missing-imports (type check)
  • pytest on Python 3.11, 3.12, 3.13 with coverage

Roadmap

  • v0.3 — Django helpers and middleware.
  • v0.4 — Streaming retry (reconnect on SSE drop).
  • v0.5 — Request tracing hooks (OpenTelemetry).
  • v1.0 — Full OTel spans, pluggable pricing providers, cache invalidation API.

Contributing

Pull requests are welcome. Please:

  1. Open an issue first for anything larger than a small bug fix.
  2. Add tests for new behaviour.
  3. Run ruff check, ruff format, mypy and pytest before pushing.

License

MIT — see LICENSE for details.


🇷🇺 Краткое описание (Russian)

llm-cache-router — лёгкая production-ready Python-библиотека для семантического кэширования LLM-запросов, мульти-провайдер роутинга и контроля бюджета. Экономит 30–70% на LLM-счетах за счёт векторного кэша, переключается между провайдерами (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) без изменений в коде приложения, и включает встроенный трекинг стоимости с дневными/месячными лимитами. Поддерживает три бэкенда кэша (in-memory / Redis / Qdrant), нативный стриминг для всех провайдеров и FastAPI-middleware с Prometheus-метриками.

Установка:

pip install llm-cache-router

# с дополнительными бэкендами
pip install "llm-cache-router[redis]"
pip install "llm-cache-router[qdrant]"
pip install "llm-cache-router[fastapi]"
pip install "llm-cache-router[all]"

Требуется Python 3.11+. Полная документация и примеры — выше (на английском).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_cache_router-0.2.3.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_cache_router-0.2.3-py3-none-any.whl (44.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_cache_router-0.2.3.tar.gz.

File metadata

  • Download URL: llm_cache_router-0.2.3.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_cache_router-0.2.3.tar.gz
Algorithm Hash digest
SHA256 bf39dd9629aa18389efd566d4d4968bc65c0d0634db10f7334a0c2e103fe51b7
MD5 f4ec4fafa063b51e9d6800d6b86ed432
BLAKE2b-256 d664db5dfb656e3b3a920ca77820ebe76cbc0a7053a933bee8604b2edc737aee

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_cache_router-0.2.3.tar.gz:

Publisher: publish.yml on svalench/llm-cache-router

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_cache_router-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_cache_router-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 87f37233bf98bc61651d61469a402fba76118d211aaea7107721f100b1121510
MD5 dc86b5689c539077242d97eab9bad635
BLAKE2b-256 04bd1eaf7c9b3e7c49249cacbc672785f0be3409d760e9771ab6de5646bbdd58

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_cache_router-0.2.3-py3-none-any.whl:

Publisher: publish.yml on svalench/llm-cache-router

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page