Semantic caching for LLM API calls - reduce costs with one decorator
Project description
semantic-llm-cache
Semantic caching for LLM API calls - reduce costs with one decorator.
Overview
LLM API calls are expensive and slow. In production applications, 20-40% of prompts are semantically identical but get charged as separate API calls. semantic-llm-cache solves this with a simple decorator that:
- ✅ Caches semantically similar prompts (not just exact matches)
- ✅ Reduces API costs by 20-40%
- ✅ Returns cached responses in <10ms
- ✅ Works with any LLM provider (OpenAI, Anthropic, local models)
- ✅ Zero behavior change - drop-in decorator
Installation
# Core (exact match only)
pip install semantic-llm-cache
# With semantic similarity
pip install semantic-llm-cache[semantic]
# With Redis backend
pip install semantic-llm-cache[redis]
# With everything
pip install semantic-llm-cache[all]
Quick Start
Basic Caching (Exact Match)
from semantic_llm_cache import cache
@cache()
def ask_gpt(prompt: str) -> str:
return openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
).choices[0].message.content
# First call - API hit
ask_gpt("What is Python?") # $0.002
# Second call - cache hit
ask_gpt("What is Python?") # FREE, <10ms
Semantic Matching
Match semantically similar prompts (requires pip install semantic-llm-cache[semantic]):
from semantic_llm_cache import cache
@cache(similarity=0.90)
def ask_gpt(prompt: str) -> str:
return call_openai(prompt)
ask_gpt("What is Python?") # API call
ask_gpt("What's Python?") # Cache hit (95% similar)
ask_gpt("Explain Python") # Cache hit (91% similar)
ask_gpt("What is Rust?") # API call (different topic)
TTL Expiration
from semantic_llm_cache import cache
@cache(ttl=3600) # 1 hour
def ask_gpt(prompt: str) -> str:
return call_openai(prompt)
Cache Statistics
from semantic_llm_cache import get_stats
stats = get_stats()
# {
# "hits": 1547,
# "misses": 892,
# "hit_rate": 0.634,
# "estimated_savings_usd": 3.09,
# "latency_saved_ms": 773500
# }
Cache Management
from semantic_llm_cache import clear_cache, invalidate
# Clear all cached entries
clear_cache()
# Invalidate specific pattern
invalidate(pattern="Python")
Advanced Usage
Multiple Cache Backends
from semantic_llm_cache import cache
from semantic_llm_cache.backends import RedisBackend
# Use Redis for distributed caching
backend = RedisBackend(url="redis://localhost:6379")
@cache(backend=backend)
def ask_gpt(prompt: str) -> str:
return call_openai(prompt)
Context Manager
from semantic_llm_cache import CacheContext
with CacheContext(similarity=0.9) as ctx:
result1 = any_llm_call("prompt 1")
result2 = any_llm_call("prompt 2")
print(ctx.stats) # {"hits": 1, "misses": 1}
Wrapper Class
from semantic_llm_cache import CachedLLM
llm = CachedLLM(
provider="openai",
similarity=0.9,
ttl=3600
)
response = llm.chat("What is Python?")
API Reference
@cache() Decorator
@cache(
similarity: float = 1.0, # 1.0 = exact match, 0.9 = semantic
ttl: int = 3600, # seconds, None = forever
backend: Backend = None, # None = in-memory
namespace: str = "default", # isolate different use cases
enabled: bool = True, # toggle for debugging
key_func: Callable = None, # custom cache key
)
def my_llm_function(prompt: str) -> str:
...
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
similarity |
float |
1.0 |
Cosine similarity threshold (1.0 = exact, 0.9 = semantic) |
ttl |
int | None |
3600 |
Time-to-live in seconds (None = never expires) |
backend |
Backend |
None |
Storage backend (None = in-memory) |
namespace |
str |
"default" |
Isolate different use cases |
enabled |
bool |
True |
Enable/disable caching |
key_func |
Callable |
None |
Custom cache key function |
Utility Functions
from semantic_llm_cache import (
get_stats, # Get cache statistics
clear_cache, # Clear all cached entries
invalidate, # Invalidate by pattern
warm_cache, # Pre-populate cache
export_cache, # Export for analysis
)
Backends
| Backend | Description | Installation |
|---|---|---|
MemoryBackend |
In-memory (default) | Built-in |
SQLiteBackend |
Persistent storage | Built-in |
RedisBackend |
Distributed caching | pip install semantic-llm-cache[redis] |
Performance
| Metric | Value |
|---|---|
| Cache hit latency | <10ms |
| Cache miss overhead | <50ms (embedding) |
| Typical hit rate | 25-40% |
| Cost reduction | 20-40% |
Requirements
- Python >= 3.9
- numpy >= 1.24.0
Optional Dependencies
sentence-transformers >= 2.2.0(for semantic matching)redis >= 4.0.0(for Redis backend)openai >= 1.0.0(for OpenAI embeddings)
License
MIT License - see LICENSE file.
Author
Karthick Raja M (@karthyick)
Related Packages
- distill-json - JSON compression for LLMs
Cut LLM costs 30% with one decorator. pip install semantic-llm-cache
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_llm_cache-0.1.0.tar.gz.
File metadata
- Download URL: semantic_llm_cache-0.1.0.tar.gz
- Upload date:
- Size: 33.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0de9a7984be926d43486e1fae6577092c74dd035281b8284918d7a008b568060
|
|
| MD5 |
f4dedb327482d83c9c8f71f011df21d1
|
|
| BLAKE2b-256 |
7dcc1a59e01a7b4f83ee0af219868250413b3007a1c08b16d3ef71b4ef8e37d1
|
File details
Details for the file semantic_llm_cache-0.1.0-py3-none-any.whl.
File metadata
- Download URL: semantic_llm_cache-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
573592d4c2d38a7cb42f6d615d51a61f2bc3a5375a6a02661215d2a35ff3812c
|
|
| MD5 |
ad7cf700a47181b240c7d9f1b246cd60
|
|
| BLAKE2b-256 |
dce2d731be75c73cef7d6007c0de54c2abefe9357bfa275a0836fd9cddb36ac4
|