Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.
Project description
cogcache
Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.
cogcache caches LLM responses by semantic similarity instead of exact key match. When a paraphrased query arrives, it returns the previous answer in milliseconds — zero LLM tokens spent.
"What is semantic caching?" → LLM call (4.2s, 320 tokens)
"What does semantic caching mean?" → Cache HIT (0.5ms, 0 tokens) ← 99% savings
Install
pip install cogcache # core library
pip install cogcache[redis] # + Redis Stack backend (HNSW vector search)
pip install cogcache[prometheus] # + Prometheus metrics sink
pip install cogcache[openai-judge] # + LLM-as-Judge quality scoring
pip install cogcache[langchain] # + LangChain BaseCache adapter
pip install cogcache[all] # everything
Quick start
from cogcache import CogniCache
cache = CogniCache(similarity_threshold=0.92)
def my_llm(query: str) -> str:
# Your real LLM call here (OpenAI, Anthropic, DashScope, ...)
return openai_client.chat.completions.create(...).choices[0].message.content
# First call → LLM
answer = cache.query("What is gradient descent?", llm_fn=my_llm)
# Second call → cache hit, zero LLM cost
answer = cache.query("Explain gradient descent.", llm_fn=my_llm)
As a decorator
@cache.cached(threshold=0.90)
def ask_llm(query: str) -> str:
return my_llm(query)
ask_llm("What is X?") # LLM call
ask_llm("Tell me X.") # cache hit
With LangChain
from cogcache.integrations.langchain import CogniCacheLangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(cache=CogniCacheLangChain(cache))
Features
| Feature | Default |
|---|---|
| Cosine-similarity semantic matching | ✅ |
Pluggable stores: MemoryStore / RedisStore (Redis Stack HNSW) |
✅ |
| TTL eviction on read & write paths | ✅ |
| LLM-as-Judge with "write strict, hit lenient" policy | optional |
| Prometheus + JSON metrics sink | optional |
| Route / intent isolation (multi-tenant safe) | ✅ |
| Fail-open on backend failures | ✅ |
Configuration
CogniCache(
redis_url="redis://localhost:6379/0", # None = in-memory
similarity_threshold=0.92, # 0.85–0.95 typical
max_cache_size=10_000,
ttl=3600, # -1 for no expiry
vector_dim=512, # match your embedder
enable_judge=True, # LLM Judge quality gate
write_min_quality=0.8,
judge_on_hit=False, # async hit-time warning
embed_fn=my_custom_embedder, # or use the default
metrics=MetricsCollector(), # observability hook
)
See tuning guide for threshold selection, embedding model comparison, and Prometheus alert thresholds.
When to use cogcache
✅ High-QPS chatbots where users phrase the same question different ways
✅ RAG systems with repetitive paraphrased queries
✅ Multi-tenant LLM APIs where you bill per token
✅ Demo / dev environments where you want to skip LLM calls on repeat
❌ Personalized answers (use route=user_id isolation if you must)
❌ Real-time data (weather, prices) — set short TTL or skip caching
Production readiness
- ✅ Thread-safe
MemoryStoreandMetricsCollector - ✅ Fail-open: Redis disconnect / Judge crash never breaks your request path
- ✅ 49 unit tests, run with
pytest -q - ✅ Used in production at AI_Cost_Optimization reference deployment
Try it live
For a complete demo with FastAPI backend, admin dashboard, Prometheus exporter, and Docker Compose setup, see the cogcache-playground repo:
git clone https://github.com/AaronharveyHan/cogcache-playground.git
cd cogcache-playground && docker compose up
# Open http://localhost:8000/admin
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cogcache-0.2.0.tar.gz.
File metadata
- Download URL: cogcache-0.2.0.tar.gz
- Upload date:
- Size: 41.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b6c15be13f3312681db2e70206a26e01154b1067e7f9638af897841d50cb81e
|
|
| MD5 |
dfe6b671785ce10443f55a4997679467
|
|
| BLAKE2b-256 |
114e7d9466acf8564ab1cc8743bf2c8812021484511348a123698debea3f1c5d
|
Provenance
The following attestation bundles were made for cogcache-0.2.0.tar.gz:
Publisher:
release.yml on AaronharveyHan/AI_Cost_Optimization
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cogcache-0.2.0.tar.gz -
Subject digest:
8b6c15be13f3312681db2e70206a26e01154b1067e7f9638af897841d50cb81e - Sigstore transparency entry: 1609989930
- Sigstore integration time:
-
Permalink:
AaronharveyHan/AI_Cost_Optimization@c353db6f4e7182706ea6a62eb960db6f13f2a06c -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/AaronharveyHan
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c353db6f4e7182706ea6a62eb960db6f13f2a06c -
Trigger Event:
push
-
Statement type:
File details
Details for the file cogcache-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cogcache-0.2.0-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c0d4e5796969c0fbd4e506bc5ee250e6805e4ac85d0e6113826b7a62165b5b6
|
|
| MD5 |
5fa5244ea63a4590bfdb63796fa83ae8
|
|
| BLAKE2b-256 |
9d84f5d8c1d9b59c68c72d1309b9543ebb666c153aae763077ed8b7feda583e8
|
Provenance
The following attestation bundles were made for cogcache-0.2.0-py3-none-any.whl:
Publisher:
release.yml on AaronharveyHan/AI_Cost_Optimization
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cogcache-0.2.0-py3-none-any.whl -
Subject digest:
3c0d4e5796969c0fbd4e506bc5ee250e6805e4ac85d0e6113826b7a62165b5b6 - Sigstore transparency entry: 1609990098
- Sigstore integration time:
-
Permalink:
AaronharveyHan/AI_Cost_Optimization@c353db6f4e7182706ea6a62eb960db6f13f2a06c -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/AaronharveyHan
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c353db6f4e7182706ea6a62eb960db6f13f2a06c -
Trigger Event:
push
-
Statement type: