Skip to main content

Semantic caching for LLMs. Ask once, recall forever.

Project description

Recallm

Semantic caching for LLMs. Ask once, recall forever.

PyPI Python 3.11+ MIT License CI

Exact-match caching is useless for LLMs — two users asking the same question in slightly different words both pay the full API cost. Recallm uses sentence embeddings to find near-matches and return cached responses instantly. The result: lower API costs, reduced latency, and no changes to your existing LLM client code.

Install

pip install recallm
pip install "recallm[redis]"   # persistent cache, shared across workers
pip install "recallm[torch]"   # sentence-transformers embedder (700MB, PyTorch)

Once installed, import directly from the recallm package:

from recallm import SemanticCache, CacheConfig, InMemoryStorage

Quickstart

pip install recallm
from recallm import CacheConfig, InMemoryStorage, SemanticCache

storage = InMemoryStorage()
cache = SemanticCache(storage=storage, config=CacheConfig(threshold="balanced"))

def fake_llm(**kwargs):
    return {"id": "resp-1", "choices": [{"message": {"content": "Paris"}}]}

cached = cache.wrap(fake_llm, mode="sync")

request = {
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "cache_context": {"user_id": "u-1", "document_id": "geo-v1"},
}

first = cached(**request)   # miss: calls fake_llm and stores response
second = cached(**request)  # hit: returns cached response
print(first["choices"][0]["message"]["content"], second["choices"][0]["message"]["content"])

Debugging

Inspect cache behaviour during development with cache.stats():

stats = cache.stats()
print(stats.hit_rate)         # fraction of requests served from cache
print(stats.hits, stats.misses)
print(stats.avg_similarity)   # mean cosine similarity of cache hits
print(stats.namespace_sizes)  # entry counts per namespace

stats() returns a CacheStats dataclass and is intended for development and debugging. Use the Prometheus metrics for production observability.

Deployment note: SemanticCache(...) loads the embedding model synchronously. In async frameworks (FastAPI, etc.), use await cache.async_warmup() during startup instead of relying on the constructor — see getting started.

Use case Expected hit rate Why
FAQ / support bot 40–70% High repetition, forgiving similarity
Document summarization 20–50% Same docs re-processed, template prompts
General chat assistant 5–15% High diversity, dynamic context
Code generation 3–10% Exact problem statements vary, strict threshold

Known limitations

  • stream=True bypasses the cache entirely
  • Redis backend is not suitable for namespaces > 5,000 entries without partitioning
  • Sync callers using RedisStorage have no timeout protection (v0.1.0)

Full docs · Contributing · MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

recallm-0.2.0.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

recallm-0.2.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file recallm-0.2.0.tar.gz.

File metadata

  • Download URL: recallm-0.2.0.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for recallm-0.2.0.tar.gz
Algorithm Hash digest
SHA256 82bf9a57b0c347581c80e27be2bfe09bb475d17de00050629d5c77c21cb09f7c
MD5 b634496bcff5854cd97f748550476a90
BLAKE2b-256 4b27f49fd47eb0c6c186376b47f3372e70e5eb4c47c4834d7dc1bfe1842e9719

See more details on using hashes here.

File details

Details for the file recallm-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: recallm-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for recallm-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f396f18ac4dfcc961efad709fd59fe955f8c59e197aaf9291524591efa909c8a
MD5 4a230415ae791d7f3259590de879cc91
BLAKE2b-256 6a03be0286ff46677d00b3f513a850d36f61ec512f7e21d23d8ad2a93825632f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page