Semantic cache, multi-provider LLM router and cost tracker (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

llm-cache-router

A lightweight, production-ready Python library that combines semantic caching, multi-provider LLM routing, and cost tracking in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.

Why llm-cache-router
Features
Installation
Quickstart
Streaming
Cache Warmup
Routing Strategies
Cache Backends
Budget and Cost Tracking
FastAPI Integration
Async Context Manager
Supported Providers
Architecture
Development
Roadmap
Contributing
License

Why llm-cache-router

Calling LLMs directly is expensive, slow, and locks you into a single vendor. This library solves all three problems at once:

Save money — a semantic cache returns answers for near-duplicate queries without re-calling the provider, typically cutting spend by 30–70% on production workloads.
Stay resilient — swap providers on the fly, use fallback chains, and never take a full outage because one vendor is down.
Control cost — built-in daily/monthly budget guardrails with Prometheus metrics for every request.

One dependency. Six providers. Three cache backends. Full async support.

Features

Semantic cache — vector-similarity matching via sentence-transformers, not just exact string hashing.
Multi-provider routing across OpenAI, Anthropic, Google Gemini, Ollama, MiniMax and Qwen (Dashscope).
Three routing strategies: CHEAPEST_FIRST, FASTEST_FIRST, FALLBACK_CHAIN.
Pluggable cache backends: in-memory (FAISS), Redis, Qdrant.
Streaming — native async SSE streaming for every provider, transparent to the cache layer.
Cost tracker with per-model pricing, daily/monthly budget limits and savings accounting.
Cache warmup with controlled concurrency for pre-production pre-loading.
FastAPI middleware + Prometheus metrics endpoint out of the box.
Typed — Pydantic v2 models everywhere, fully typed public API.
Tested — 10 test modules covering router, cache, providers, retry, warmup, and HTTP middleware.

Installation

pip install llm-cache-router

Optional extras:

pip install "llm-cache-router[redis]"     # Redis cache backend
pip install "llm-cache-router[qdrant]"    # Qdrant vector cache backend
pip install "llm-cache-router[fastapi]"   # FastAPI middleware + Prometheus
pip install "llm-cache-router[all]"       # everything above
pip install "llm-cache-router[dev]"       # tests, ruff, mypy

Requires Python 3.11+.

Quickstart

import asyncio
from llm_cache_router import CacheConfig, LLMRouter, RoutingStrategy


async def main() -> None:
    router = LLMRouter(
        providers={
            "openai":    {"api_key": "sk-...",           "models": ["gpt-4o-mini"]},
            "anthropic": {"api_key": "sk-ant-...",       "models": ["claude-3-5-sonnet"]},
            "gemini":    {"api_key": "AIza...",          "models": ["gemini-1.5-flash"]},
            "ollama":    {"base_url": "http://localhost:11434", "models": ["llama3.2"]},
        },
        cache=CacheConfig(
            backend="memory",
            threshold=0.92,       # cosine similarity threshold
            ttl=3600,             # cache TTL in seconds
            max_entries=10_000,
        ),
        strategy=RoutingStrategy.CHEAPEST_FIRST,
        budget={"daily_usd": 5.0, "monthly_usd": 50.0},
    )

    response = await router.complete(
        messages=[{"role": "user", "content": "What is a semantic cache?"}],
        model="gpt-4o-mini",
    )
    print(response.content)
    print(f"cache_hit={response.cache_hit} cost=${response.cost_usd:.6f}")


asyncio.run(main())

Streaming

All providers (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) support native SSE streaming. The cache layer is transparent: on a cache hit you receive a single final chunk, on a miss — a real streaming response that is also written to the cache once complete.

async for chunk in router.stream(
    messages=[{"role": "user", "content": "Explain async/await in Python"}],
    model="gpt-4o-mini",
):
    print(chunk.delta, end="", flush=True)
    if chunk.is_final:
        print(f"\nprovider={chunk.provider_used} cost=${chunk.cost_usd:.6f}")

Cache Warmup

Pre-load the cache with known queries before traffic hits production:

from llm_cache_router.models import WarmupEntry

results = await router.warmup(
    entries=[
        WarmupEntry(
            messages=[{"role": "user", "content": "What is RAG?"}],
            model="gpt-4o-mini",
        ),
        WarmupEntry(
            messages=[{"role": "user", "content": "Explain vector databases"}],
            model="gpt-4o-mini",
        ),
    ],
    concurrency=5,
    skip_cached=True,
)
print(results)  # {"warmed": 2, "skipped": 0, "failed": 0}

Routing Strategies

Strategy	Description
`CHEAPEST_FIRST`	Picks the cheapest provider/model by live pricing for each call.
`FASTEST_FIRST`	Picks the provider with the lowest observed latency (EMA).
`FALLBACK_CHAIN`	Tries providers in order, falls back on error/timeout.

router = LLMRouter(
    providers={
        "openai":    {"api_key": "sk-...",     "models": ["gpt-4o"]},
        "anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
    },
    strategy=RoutingStrategy.FALLBACK_CHAIN,
    fallback_chain=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
)

Cache Backends

In-memory (FAISS)

Default. Zero dependencies beyond the core install. Best for single-process apps and tests.

cache=CacheConfig(backend="memory", threshold=0.92, ttl=3600, max_entries=10_000)

Redis

Production-grade distributed cache with LRU eviction, configurable timeouts, retry/backoff and bounded candidate set for vector search.

cache=CacheConfig(
    backend="redis",
    redis_url="redis://localhost:6379/0",
    redis_namespace="llm_cache_router_prod",
    threshold=0.92,
    ttl=3600,
    max_entries=50_000,
    redis_command_timeout_sec=1.5,
    redis_retry_attempts=3,
    redis_retry_backoff_sec=0.2,
    redis_candidate_k=256,
)

Qdrant

Native vector database for very large caches (millions of entries) and cross-service deployments.

pip install "llm-cache-router[qdrant]"

cache=CacheConfig(
    backend="qdrant",
    qdrant_url="http://localhost:6333",
    qdrant_api_key=None,           # optional for Qdrant Cloud
    qdrant_collection="llm_cache",
    threshold=0.92,
    ttl=3600,
    max_entries=100_000,
)

Budget and Cost Tracking

Set per-day and per-month USD limits — requests that would exceed the budget are rejected before hitting the provider.

router = LLMRouter(
    providers={...},
    budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)

stats = router.stats()
print(stats.total_cost_usd)           # total spent since start
print(stats.saved_cost_usd)           # saved via cache hits
print(stats.daily_spend_usd)
print(stats.budget_remaining_usd)     # None if no limit is set
print(stats.cache_hit_rate)           # 0.0–1.0

FastAPI Integration

pip install "llm-cache-router[fastapi]"

from fastapi import FastAPI
from llm_cache_router.middleware.fastapi import (
    add_http_metrics_middleware,
    mount_metrics_endpoint,
)

app = FastAPI()
add_http_metrics_middleware(app=app)
mount_metrics_endpoint(app=app, router=router, path="/metrics")

Exposed Prometheus metrics:

llm_router_http_requests_total{method,path,status}
llm_router_http_request_duration_seconds_* (histogram)
llm_router_cache_hits_total, llm_router_cache_misses_total
llm_router_cost_usd_total, llm_router_saved_cost_usd_total

Async Context Manager

async with LLMRouter(providers={...}) as router:
    response = await router.complete(messages=[...], model="gpt-4o-mini")
# close() is called automatically — closes provider clients and cache connections

Supported Providers

Provider	Streaming	Notes
OpenAI	yes	`gpt-4o`, `gpt-4o-mini`, `o1-*`, etc.
Anthropic	yes	Claude 3.5 Sonnet/Haiku, Opus
Google Gemini	yes	1.5 Flash, 1.5 Pro
Ollama	yes	Any locally-served model
MiniMax	yes	`MiniMax-Text-01` and others
Qwen (Dashscope)	yes	`qwen-plus`, `qwen-max`, etc.

Adding a new provider = subclass LLMProvider, register with @register_provider("name"). See llm_cache_router/providers/base.py.

Architecture

llm_cache_router/
  cache/          # memory (FAISS) / redis / qdrant backends
  providers/      # openai, anthropic, gemini, ollama, minimax, qwen
  strategies/     # cheapest, fastest, fallback
  embeddings/     # SentenceEncoder, HashingEncoder
  cost/           # CostTracker with daily/monthly budgets
  middleware/     # FastAPI middleware
  observability/  # Prometheus metrics
  models.py       # Pydantic models (LLMResponse, LLMStreamChunk, ...)
  router.py       # LLMRouter — public entrypoint
  retry.py        # RetryConfig + exponential backoff
  warmup.py       # async warmup helper

Development

git clone https://github.com/svalench/llm-cache-router.git
cd llm-cache-router

# using uv (recommended)
uv sync --all-extras
uv run pytest

# or plain pip
pip install -e ".[all,dev]"
pytest

Code quality is enforced in CI via:

ruff check (lint) and ruff format --check (style)
mypy --ignore-missing-imports (type check)
pytest on Python 3.11, 3.12, 3.13 with coverage

Roadmap

v0.3 — Django helpers and middleware.
v0.4 — Streaming retry (reconnect on SSE drop).
v0.5 — Request tracing hooks (OpenTelemetry).
v1.0 — Full OTel spans, pluggable pricing providers, cache invalidation API.

Contributing

Pull requests are welcome. Please:

Open an issue first for anything larger than a small bug fix.
Add tests for new behaviour.
Run ruff check, ruff format, mypy and pytest before pushing.

License

MIT — see LICENSE for details.

🇷🇺 Краткое описание (Russian)

llm-cache-router — лёгкая production-ready Python-библиотека для семантического кэширования LLM-запросов, мульти-провайдер роутинга и контроля бюджета. Экономит 30–70% на LLM-счетах за счёт векторного кэша, переключается между провайдерами (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) без изменений в коде приложения, и включает встроенный трекинг стоимости с дневными/месячными лимитами. Поддерживает три бэкенда кэша (in-memory / Redis / Qdrant), нативный стриминг для всех провайдеров и FastAPI-middleware с Prometheus-метриками.

Установка:

pip install llm-cache-router

# с дополнительными бэкендами
pip install "llm-cache-router[redis]"
pip install "llm-cache-router[qdrant]"
pip install "llm-cache-router[fastapi]"
pip install "llm-cache-router[all]"

Требуется Python 3.11+. Полная документация и примеры — выше (на английском).

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

chitsalex

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.3

Apr 22, 2026

0.2.2

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_cache_router-0.2.3.tar.gz (34.6 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_cache_router-0.2.3-py3-none-any.whl (44.0 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file llm_cache_router-0.2.3.tar.gz.

File metadata

Download URL: llm_cache_router-0.2.3.tar.gz
Upload date: Apr 22, 2026
Size: 34.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_cache_router-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`bf39dd9629aa18389efd566d4d4968bc65c0d0634db10f7334a0c2e103fe51b7`
MD5	`f4ec4fafa063b51e9d6800d6b86ed432`
BLAKE2b-256	`d664db5dfb656e3b3a920ca77820ebe76cbc0a7053a933bee8604b2edc737aee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_cache_router-0.2.3.tar.gz:

Publisher: publish.yml on svalench/llm-cache-router

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_cache_router-0.2.3.tar.gz
- Subject digest: bf39dd9629aa18389efd566d4d4968bc65c0d0634db10f7334a0c2e103fe51b7
- Sigstore transparency entry: 1357156449
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: svalench/llm-cache-router@446ed0790e648ff1688fb89ee41aeb9713fdb6ca
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/svalench
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@446ed0790e648ff1688fb89ee41aeb9713fdb6ca
- Trigger Event: push

File details

Details for the file llm_cache_router-0.2.3-py3-none-any.whl.

File metadata

Download URL: llm_cache_router-0.2.3-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 44.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_cache_router-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`87f37233bf98bc61651d61469a402fba76118d211aaea7107721f100b1121510`
MD5	`dc86b5689c539077242d97eab9bad635`
BLAKE2b-256	`04bd1eaf7c9b3e7c49249cacbc672785f0be3409d760e9771ab6de5646bbdd58`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_cache_router-0.2.3-py3-none-any.whl:

Publisher: publish.yml on svalench/llm-cache-router

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_cache_router-0.2.3-py3-none-any.whl
- Subject digest: 87f37233bf98bc61651d61469a402fba76118d211aaea7107721f100b1121510
- Sigstore transparency entry: 1357156458
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: svalench/llm-cache-router@446ed0790e648ff1688fb89ee41aeb9713fdb6ca
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/svalench
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@446ed0790e648ff1688fb89ee41aeb9713fdb6ca
- Trigger Event: push

llm-cache-router 0.2.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llm-cache-router

Table of Contents

Why llm-cache-router

Features

Installation

Quickstart

Streaming

Cache Warmup

Routing Strategies

Cache Backends

In-memory (FAISS)

Redis

Qdrant

Budget and Cost Tracking

FastAPI Integration

Async Context Manager

Supported Providers

Architecture

Development

Roadmap

Contributing

License

🇷🇺 Краткое описание (Russian)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance