Semantic caching for LLMs. Ask once, recall forever.
Project description
Recallm
Semantic caching for LLMs. Ask once, recall forever.
Exact-match caching is useless for LLMs — two users asking the same question in slightly different words both pay the full API cost. Recallm uses sentence embeddings to find near-matches and return cached responses instantly. The result: lower API costs, reduced latency, and no changes to your existing LLM client code.
Install
pip install recallm
pip install "recallm[redis]" # persistent cache, shared across workers
pip install "recallm[torch]" # sentence-transformers embedder (700MB, PyTorch)
Once installed, import directly from the recallm package:
from recallm import SemanticCache, CacheConfig, InMemoryStorage
Quickstart
pip install recallm
from recallm import CacheConfig, InMemoryStorage, SemanticCache
storage = InMemoryStorage()
cache = SemanticCache(storage=storage, config=CacheConfig(threshold="balanced"))
def fake_llm(**kwargs):
return {"id": "resp-1", "choices": [{"message": {"content": "Paris"}}]}
cached = cache.wrap(fake_llm, mode="sync")
request = {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"cache_context": {"user_id": "u-1", "document_id": "geo-v1"},
}
first = cached(**request) # miss: calls fake_llm and stores response
second = cached(**request) # hit: returns cached response
print(first["choices"][0]["message"]["content"], second["choices"][0]["message"]["content"])
Debugging
Inspect cache behaviour during development with cache.stats():
stats = cache.stats()
print(stats.hit_rate) # fraction of requests served from cache
print(stats.hits, stats.misses)
print(stats.avg_similarity) # mean cosine similarity of cache hits
print(stats.namespace_sizes) # entry counts per namespace
stats() returns a CacheStats dataclass and is intended for development and debugging. Use the Prometheus metrics for production observability.
Deployment note:
SemanticCache(...)loads the embedding model synchronously. In async frameworks (FastAPI, etc.), useawait cache.async_warmup()during startup instead of relying on the constructor — see getting started.
| Use case | Expected hit rate | Why |
|---|---|---|
| FAQ / support bot | 40–70% | High repetition, forgiving similarity |
| Document summarization | 20–50% | Same docs re-processed, template prompts |
| General chat assistant | 5–15% | High diversity, dynamic context |
| Code generation | 3–10% | Exact problem statements vary, strict threshold |
Known limitations
stream=Truebypasses the cache entirely- Redis backend is not suitable for namespaces > 5,000 entries without partitioning
- Sync callers using
RedisStoragehave no timeout protection (v0.1.0)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file recallm-0.2.0.tar.gz.
File metadata
- Download URL: recallm-0.2.0.tar.gz
- Upload date:
- Size: 28.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82bf9a57b0c347581c80e27be2bfe09bb475d17de00050629d5c77c21cb09f7c
|
|
| MD5 |
b634496bcff5854cd97f748550476a90
|
|
| BLAKE2b-256 |
4b27f49fd47eb0c6c186376b47f3372e70e5eb4c47c4834d7dc1bfe1842e9719
|
File details
Details for the file recallm-0.2.0-py3-none-any.whl.
File metadata
- Download URL: recallm-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f396f18ac4dfcc961efad709fd59fe955f8c59e197aaf9291524591efa909c8a
|
|
| MD5 |
4a230415ae791d7f3259590de879cc91
|
|
| BLAKE2b-256 |
6a03be0286ff46677d00b3f513a850d36f61ec512f7e21d23d8ad2a93825632f
|