Drop in FastAPI middleware/reverse proxy with semantic caching for APIs & LLMs

These details have not been verified by PyPI

Project links

Project description

fastapi-semcache

Semantic caching middleware and reverse proxy for APIs and LLMs, with embeddings, pgvector similarity search, and Redis-backed response caching.

The PyPI distribution and GitHub repository are fastapi-semcache (the import package remains semanticcache).

Why fastapi-semcache?

This package is designed for direct integration into modern Python API stacks with minimal refactoring needed. It keeps the caching path simple and gives you explicit control over embeddings, vector search, and cache behavior.

It includes FastAPI middleware as a first-class integration path and can also run as a reverse proxy in front of an upstream API or LLM service. Django and Flask middleware are planned for a future release so you can hook semantic caching into those stacks the same way as FastAPI.

FastAPI middleware

Add SemanticCacheMiddleware to your app and reuse one SemanticCache instance for all requests. Configure Postgres, Redis, and the embedder with SEMANTIC_CACHE_* environment variables (see .env.example). By default only POST requests are intercepted; the middleware derives cache-key text from JSON bodies using query, prompt, input, or chat-style messages (see default_extract_query in semanticcache.middleware). Successful responses whose body parses as a JSON object are candidates for storage.

from typing import Any

from fastapi import FastAPI

from semanticcache import SemanticCache, SemanticCacheMiddleware

app = FastAPI()
cache = SemanticCache()
app.add_middleware(SemanticCacheMiddleware, cache=cache)


@app.post("/v1/chat/completions")
async def chat_completions(body: dict[str, Any]) -> dict[str, Any]:
    # Clients should send JSON with prompt, query, input, or chat messages so the
    # middleware can build the cache key (see default_extract_query). Misses run your
    # handler; hits short-circuit with a cached JSON body.
    return {"choices": [{"message": {"role": "assistant", "content": "Hello"}}]}

Run with uvicorn mymodule:app --host 0.0.0.0 --port 8000.

Custom cache key text (`extract_query`)

If your JSON body does not follow the usual query / prompt / messages patterns, pass an async callable as extract_query. It receives the Starlette Request and the raw body bytes (already buffered by the middleware). Return a non-empty string to embed and look up; return None to skip semantic caching for that request (the route still runs).

You can wrap default_extract_query and add fallbacks for your own fields, or replace it entirely.

from fastapi import FastAPI, Request

from semanticcache import SemanticCache
from semanticcache.middleware import SemanticCacheMiddleware, default_extract_query

async def extract_query(request: Request, body: bytes) -> str | None:
    base = await default_extract_query(request, body)
    if base is not None:
        return base
    # Parse ``body`` for your schema; return None to bypass the cache.
    return None

app = FastAPI()
cache = SemanticCache()
app.add_middleware(
    SemanticCacheMiddleware,
    cache=cache,
    extract_query=extract_query,
)

Use extract_model when the cache key should also vary by model id from headers or JSON (same async (request, body) -> str | None idea). For create_semantic_cache_proxy_app, pass extract_query=... (and other middleware options) as keyword arguments; they are forwarded to SemanticCacheMiddleware.

Other advanced options (path_prefix, HTTP 429 circuit breaker via cache_settings, enabled=False) are documented on SemanticCacheMiddleware in semanticcache.middleware.fastapi. On shutdown, call await cache.close() from a lifespan handler if you want pools closed cleanly.

What is implemented

Huggingface embeddings via Sentence Transformers (embedder_type="huggingface").
OpenAI embeddings via the official async client (embedder_type="openai"; install embed-openai and set OPENAI_API_KEY). Use OpenAIEmbedder(..., send_dimensions_to_api=False) when the model has a fixed output size and the API must not get a dimensions field.
PostgreSQL + pgvector for semantic similarity lookup. The library creates a dedicated cache table per embedder configuration (derived from model id and vector dimension) on first use, so you are not tied to a single hard-coded vector width.
Redis for response caching (keys include an embedder-specific prefix so separate models do not collide).
FastAPI middleware for in-app semantic caching.
Reverse proxy mode via create_semantic_cache_proxy_app().

Streaming and chunked responses

Today the middleware buffers the full downstream response before sending it to the client. That applies even when your route returns a streaming-style response (for example token streaming); the bytes are collected first, then returned as one response. Cached hits are served as ordinary JSON bodies. The reverse proxy uses httpx’s full response body, not a streamed upstream read.

Chunked pass-through and streaming-friendly caching are planned so SSE and similar flows can deliver early bytes while still integrating with semantic caching where feasible.

Future support

Chunked / streaming responses for the middleware (and related proxy behavior): pass-through streaming instead of full buffering; see Streaming and chunked responses.
Django and Flask middleware for in-app semantic caching (not yet shipped; same role as the FastAPI middleware).

Embeddings from the following providers are planned:

Ollama (HTTP embedding API against a configurable base URL, so the server can run locally or on another host).
Cohere
Voyage

Reverse proxy

The reverse proxy mode is optional: it forwards traffic to an upstream base URL while using the same semantic cache middleware. Use it when you want a standalone hop in front of another service rather than importing routes into your FastAPI app.

Minimal programmatic setup:

from semanticcache import SemanticCache, create_semantic_cache_proxy_app

cache = SemanticCache()
app = create_semantic_cache_proxy_app(
    upstream="http://127.0.0.1:11434",
    cache=cache,
)

Run with uvicorn mymodule:app --host 0.0.0.0 --port 8080.

This repository includes a small ASGI app at app/main.py (import app for uvicorn). Set SEMANTIC_CACHE_PROXY_UPSTREAM to the backend base URL; the default is http://127.0.0.1:11434.

uv run uvicorn app.main:app --host 0.0.0.0 --port 8080

See create_semantic_cache_proxy_app in semanticcache.proxy for timeout, TLS verification, httpx_client_kwargs, and middleware options such as path_prefix and extract_query.

Install

pip install fastapi-semcache

Custom embedders: subclass BaseEmbedder from semanticcache.embedders and pass it to SemanticCache(embedder=...) to skip the optional embedding extras. See docs/embedders.md.

Optional extras:

embed-huggingface / embed-huggingface-cpu: Sentence Transformers with CPU PyTorch.
embed-huggingface-gpu: Sentence Transformers with a CUDA-enabled PyTorch install.
embed-openai: OpenAI embeddings (openai, tiktoken).

CPU

pip install "fastapi-semcache[embed-huggingface-cpu]"
# or: pip install "fastapi-semcache[embed-huggingface]"

GPU

Pick a CUDA version that matches your system from PyTorch Get Started, then install with that index so pip selects CUDA wheels.

pip install "fastapi-semcache[embed-huggingface-gpu]" \
  --extra-index-url https://download.pytorch.org/whl/cu124

OpenAI embeddings

Install the OpenAI extra so embedder_type="openai" works (pulls openai and tiktoken). Set OPENAI_API_KEY in your environment.

pip install "fastapi-semcache[embed-openai]"

Requirements

Python 3.12+.

License

Apache-2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.1

May 23, 2026

0.4.0

May 14, 2026

0.3.1

May 12, 2026

0.3.0

May 11, 2026

0.2.22

May 10, 2026

0.2.21

May 9, 2026

0.2.19

May 9, 2026

0.2.18

May 9, 2026

0.2.17

May 9, 2026

0.2.16

May 8, 2026

0.2.14

May 7, 2026

0.2.13

May 7, 2026

0.2.12

May 7, 2026

0.2.11

May 6, 2026

0.2.10

May 6, 2026

0.2.9

May 5, 2026

This version

0.2.8

May 5, 2026

0.2.7

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastapi_semcache-0.2.8.tar.gz (31.8 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastapi_semcache-0.2.8-py3-none-any.whl (36.6 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file fastapi_semcache-0.2.8.tar.gz.

File metadata

Download URL: fastapi_semcache-0.2.8.tar.gz
Upload date: May 5, 2026
Size: 31.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"42","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for fastapi_semcache-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`9fd2cd733c050217b2b8d9c311d8359dc19774b4e6962e0214af037a9d387be8`
MD5	`b8a18a684d32201acf927bf3cd92251e`
BLAKE2b-256	`e58278e9d106f8cf5de0c331e082a706ea6fb1fc9450a5e7101432532ad515dc`

See more details on using hashes here.

File details

Details for the file fastapi_semcache-0.2.8-py3-none-any.whl.

File metadata

Download URL: fastapi_semcache-0.2.8-py3-none-any.whl
Upload date: May 5, 2026
Size: 36.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"42","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for fastapi_semcache-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`692d47b67b993c0c2edb579a7bea99d5dd2f5f065807484be6cfcf2a82d2bb66`
MD5	`dd25d71a3b35822bbbe969e8b839d086`
BLAKE2b-256	`5c6198d28aa4255a51808368b675e189e58785896fa365054e0203c9c46dc123`

See more details on using hashes here.

fastapi-semcache 0.2.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fastapi-semcache

Why fastapi-semcache?

FastAPI middleware

Custom cache key text (extract_query)

What is implemented

Streaming and chunked responses

Future support

Reverse proxy

Install

CPU

GPU

OpenAI embeddings

Requirements

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Custom cache key text (`extract_query`)