A high-performance semantic caching system with FAISS vector similarity search for LLMs
Project description
Smart Semantic Cache
A high-performance, multi-layered semantic caching system that dramatically reduces LLM costs and latency through intelligent similarity-based response caching.
Why Semantic Cache?
Traditional caching requires exact query matches. Semantic caching understands that "What's the capital of France?" and "Tell me France's capital city" should return the same cached result. This can reduce your LLM API costs by 60–90% in real applications.
Key Features
- Multi-Layer Intelligence: Memory cache → SQLite → FAISS vector similarity
- Lightning Fast: Sub-millisecond memory lookups, <10ms semantic search
- Configurable Similarity: Fine-tune cache hit sensitivity (0.0–1.0)
- Memory Efficient: Optional FAISS quantization for large-scale deployments
- Async Ready: Full async support for high-throughput applications
- Rich Metrics: Comprehensive performance monitoring and analytics
- Smart Eviction: LRU-based cache management with intelligent cleanup
- Production Ready: Thread-safe, error-resilient, battle-tested
Installation
pip install thinkcache
Optional Dependencies
# For GPU acceleration (recommended for production)
pip install thinkcache[quantization]
# For development and testing
pip install thinkcache[dev]
Quick Start
Method 1: Global Cache Setup (Recommended)
from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI
# Initialize semantic cache globally - one line setup!
ensure_semantic_cache(
similarity_threshold=0.15,
max_cache_size=1000
)
llm = OpenAI(temperature=0)
response1 = llm.invoke("What is the capital of France?")
response2 = llm.invoke("Tell me the capital city of France")
response3 = llm.invoke("France's capital is?")
Method 2: Direct Cache Usage
from thinkcache import SemanticCache
from langchain.globals import set_llm_cache
cache = SemanticCache(
database_path="./production_cache.db",
faiss_index_path="./vector_cache",
similarity_threshold=0.15,
max_cache_size=5000,
memory_cache_size=1000,
enable_quantization=True
)
set_llm_cache(cache)
Configuration Methods
Global Configuration (Before First Use)
from thinkcache import configure_semantic_cache
configure_semantic_cache(
database_path="./my_cache.db",
similarity_threshold=0.15,
max_cache_size=2000
)
from thinkcache import ensure_semantic_cache
ensure_semantic_cache()
Runtime Configuration
from thinkcache import ensure_semantic_cache
cache = ensure_semantic_cache(
similarity_threshold=0.2,
database_path="./cache.db",
faiss_index_path="./vectors",
max_cache_size=1000,
memory_cache_size=500,
batch_size=20,
enable_quantization=False
)
Production Configuration
from thinkcache import configure_semantic_cache
configure_semantic_cache(
database_path="/var/cache/semantic/cache.db",
faiss_index_path="/var/cache/semantic/vectors",
similarity_threshold=0.15,
max_cache_size=10000,
memory_cache_size=2000,
enable_quantization=True,
batch_size=50
)
Cache Management
Getting Cache Instance
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
if cache:
print("Cache is active and ready!")
else:
print("No cache initialized yet")
Resetting Cache
from thinkcache import reset_semantic_cache
reset_semantic_cache()
from thinkcache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)
Handling Already Initialized Cache
from thinkcache import configure_semantic_cache
try:
configure_semantic_cache(similarity_threshold=0.1)
except ValueError as e:
print("Cache already initialized!")
from thinkcache import reset_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.1)
Performance Monitoring
Real-time Metrics
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
if cache:
metrics = cache.get_metrics()
print(f"Cache Hit Rate: {metrics['hit_rate']:.1%}")
print(f"Total Requests: {metrics['total_requests']:,}")
print(f"Memory Hits: {metrics['memory_hits']:,}")
print(f"Semantic Hits: {metrics['semantic_hits']:,}")
print(f"Avg Embedding Time: {metrics['avg_embedding_time']:.3f}s")
print(f"Avg Search Time: {metrics['avg_search_time']:.3f}s")
Cache Cleanup
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
if cache:
cache.clear_cache()
print("All caches cleared!")
Architecture Overview
Query → Memory Cache → SQLite Cache → Semantic Search → LLM API
↓ ↓ ↓ ↓ ↓
<1ms ~1-2ms ~2-5ms ~5-15ms 100-2000ms
How It Works
- Memory Cache: Lightning-fast LRU cache for recently accessed queries
- SQLite Cache: Persistent exact-match cache with indexing
- Semantic Search: FAISS-powered vector similarity search
- Embedding Cache: Cached embeddings to avoid recomputation
- Smart Eviction: Automatic cleanup based on usage patterns
Advanced Usage
Async Operations
import asyncio
from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI
ensure_semantic_cache()
async def cached_queries():
llm = OpenAI()
tasks = [
llm.ainvoke("Explain quantum computing"),
llm.ainvoke("What is quantum computing?"),
llm.ainvoke("Define quantum computing")
]
results = await asyncio.gather(*tasks)
return results
results = asyncio.run(cached_queries())
Custom Similarity Thresholds
from thinkcache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)
configure_semantic_cache(similarity_threshold=0.4)
configure_semantic_cache(similarity_threshold=0.2)
Multiple Cache Instances
from thinkcache import SemanticCache
qa_cache = SemanticCache(
database_path="./qa_cache.db",
similarity_threshold=0.15
)
summarization_cache = SemanticCache(
database_path="./summary_cache.db",
similarity_threshold=0.25
)
Complete Workflow Example
from thinkcache import (
configure_semantic_cache,
ensure_semantic_cache,
get_semantic_cache,
reset_semantic_cache
)
from langchain_openai import OpenAI
configure_semantic_cache(
similarity_threshold=0.2,
max_cache_size=5000,
enable_quantization=True
)
cache = ensure_semantic_cache()
llm = OpenAI(temperature=0)
response = llm.invoke("What is machine learning?")
metrics = cache.get_metrics()
print(f"Hit rate: {metrics['hit_rate']:.1%}")
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.1)
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
database_path |
.langchain.db |
SQLite database file path |
faiss_index_path |
./semantic_cache_index |
FAISS vector index directory |
similarity_threshold |
0.5 |
Semantic similarity threshold (0.0–1.0) |
max_cache_size |
1000 |
Maximum entries in vector store |
memory_cache_size |
100 |
Maximum entries in memory cache |
batch_size |
10 |
Batch size for vector operations |
enable_quantization |
False |
Enable FAISS quantization for efficiency |
Troubleshooting
Cache not working?
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
print(f"Cache active: {cache is not None}")
Configuration errors?
from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)
Low hit rates?
from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)
Memory issues?
from thinkcache import configure_semantic_cache
configure_semantic_cache(enable_quantization=True)
Performance Tips
- Start with 0.2 similarity threshold
- Use
configure_semantic_cache()for production - Enable quantization for large caches
- Use a larger memory cache in production
- Monitor hit rates and adjust threshold
- Reset cache regularly during development
Requirements
- Python 3.8+
- Core: FAISS, HuggingFace Transformers, SQLite
- Optional: faiss-gpu (for GPU acceleration)
License
MIT License – see LICENSE file for details.
Changelog
v0.1.1
- Multi-layer caching system
- FAISS integration with quantization
- Comprehensive metrics and monitoring
- Full async support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thinkcache-0.1.1.tar.gz.
File metadata
- Download URL: thinkcache-0.1.1.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c76959e42990624f37ca6489ae0e8231b8c59a210ce20c64a962fb21a5e6857
|
|
| MD5 |
a5930055cd5444f89b1f6fe0b306582b
|
|
| BLAKE2b-256 |
0383eccbcad64023a574be7ff9c73764e08bd7c0e846f0a00df86518ab5aa8b3
|
File details
Details for the file thinkcache-0.1.1-py3-none-any.whl.
File metadata
- Download URL: thinkcache-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ab6ca7011e421cc59ba415513943ce341f0a5f72e375c6d1b87e20aaffc7258
|
|
| MD5 |
0892a1eca2f32f2da14b55d485fedbc7
|
|
| BLAKE2b-256 |
c3f753445efeaaa167bdaccbfd01fbf4a84d75c9132ab2e95b33553945aabb1a
|