Skip to main content

Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.

Project description

cogcache

Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.

PyPI Python License: MIT

cogcache caches LLM responses by semantic similarity instead of exact key match. When a paraphrased query arrives, it returns the previous answer in milliseconds — zero LLM tokens spent.

"What is semantic caching?"      → LLM call (4.2s, 320 tokens)
"What does semantic caching mean?" → Cache HIT (0.5ms, 0 tokens)   ← 99% savings

Install

pip install cogcache                   # core library
pip install cogcache[redis]            # + Redis Stack backend (HNSW vector search)
pip install cogcache[prometheus]       # + Prometheus metrics sink
pip install cogcache[openai-judge]     # + LLM-as-Judge quality scoring
pip install cogcache[langchain]        # + LangChain BaseCache adapter
pip install cogcache[all]              # everything

Quick start

from cogcache import CogniCache

cache = CogniCache(similarity_threshold=0.92)

def my_llm(query: str) -> str:
    # Your real LLM call here (OpenAI, Anthropic, DashScope, ...)
    return openai_client.chat.completions.create(...).choices[0].message.content

# First call → LLM
answer = cache.query("What is gradient descent?", llm_fn=my_llm)

# Second call → cache hit, zero LLM cost
answer = cache.query("Explain gradient descent.", llm_fn=my_llm)

As a decorator

@cache.cached(threshold=0.90)
def ask_llm(query: str) -> str:
    return my_llm(query)

ask_llm("What is X?")   # LLM call
ask_llm("Tell me X.")    # cache hit

With LangChain

from cogcache.integrations.langchain import CogniCacheLangChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(cache=CogniCacheLangChain(cache))

Features

Feature Default
Cosine-similarity semantic matching
Pluggable stores: MemoryStore / RedisStore (Redis Stack HNSW)
TTL eviction on read & write paths
LLM-as-Judge with "write strict, hit lenient" policy optional
Prometheus + JSON metrics sink optional
Route / intent isolation (multi-tenant safe)
Fail-open on backend failures

Configuration

CogniCache(
    redis_url="redis://localhost:6379/0",   # None = in-memory
    similarity_threshold=0.92,               # 0.85–0.95 typical
    max_cache_size=10_000,
    ttl=3600,                                # -1 for no expiry
    vector_dim=512,                          # match your embedder
    enable_judge=True,                       # LLM Judge quality gate
    write_min_quality=0.8,
    judge_on_hit=False,                      # async hit-time warning
    embed_fn=my_custom_embedder,             # or use the default
    metrics=MetricsCollector(),              # observability hook
)

See tuning guide for threshold selection, embedding model comparison, and Prometheus alert thresholds.

When to use cogcache

High-QPS chatbots where users phrase the same question different ways ✅ RAG systems with repetitive paraphrased queries
Multi-tenant LLM APIs where you bill per token
Demo / dev environments where you want to skip LLM calls on repeat

❌ Personalized answers (use route=user_id isolation if you must)
❌ Real-time data (weather, prices) — set short TTL or skip caching

Production readiness

  • ✅ Thread-safe MemoryStore and MetricsCollector
  • ✅ Fail-open: Redis disconnect / Judge crash never breaks your request path
  • ✅ 49 unit tests, run with pytest -q
  • ✅ Used in production at AI_Cost_Optimization reference deployment

Try it live

For a complete demo with FastAPI backend, admin dashboard, Prometheus exporter, and Docker Compose setup, see the cogcache-playground repo:

git clone https://github.com/AaronharveyHan/cogcache-playground.git
cd cogcache-playground && docker compose up
# Open http://localhost:8000/admin

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogcache-0.2.0.tar.gz (41.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogcache-0.2.0-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file cogcache-0.2.0.tar.gz.

File metadata

  • Download URL: cogcache-0.2.0.tar.gz
  • Upload date:
  • Size: 41.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cogcache-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8b6c15be13f3312681db2e70206a26e01154b1067e7f9638af897841d50cb81e
MD5 dfe6b671785ce10443f55a4997679467
BLAKE2b-256 114e7d9466acf8564ab1cc8743bf2c8812021484511348a123698debea3f1c5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogcache-0.2.0.tar.gz:

Publisher: release.yml on AaronharveyHan/AI_Cost_Optimization

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cogcache-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cogcache-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cogcache-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c0d4e5796969c0fbd4e506bc5ee250e6805e4ac85d0e6113826b7a62165b5b6
MD5 5fa5244ea63a4590bfdb63796fa83ae8
BLAKE2b-256 9d84f5d8c1d9b59c68c72d1309b9543ebb666c153aae763077ed8b7feda583e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogcache-0.2.0-py3-none-any.whl:

Publisher: release.yml on AaronharveyHan/AI_Cost_Optimization

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page