Skip to main content

A high-performance semantic caching system with FAISS vector similarity search for LLMs

Project description

Smart Semantic Cache

A high-performance, multi-layered semantic caching system that dramatically reduces LLM costs and latency through intelligent similarity-based response caching.

Why Semantic Cache?

Traditional caching requires exact query matches. Semantic caching understands that "What's the capital of France?" and "Tell me France's capital city" should return the same cached result. This can reduce your LLM API costs by 60–90% in real applications.

Key Features

  • Multi-Layer Intelligence: Memory cache → SQLite → FAISS vector similarity
  • Lightning Fast: Sub-millisecond memory lookups, <10ms semantic search
  • Configurable Similarity: Fine-tune cache hit sensitivity (0.0–1.0)
  • Memory Efficient: Optional FAISS quantization for large-scale deployments
  • Async Ready: Full async support for high-throughput applications
  • Rich Metrics: Comprehensive performance monitoring and analytics
  • Smart Eviction: LRU-based cache management with intelligent cleanup
  • Production Ready: Thread-safe, error-resilient, battle-tested

Installation

pip install thinkcache

Optional Dependencies

# For GPU acceleration (recommended for production)
pip install thinkcache[quantization]

# For development and testing
pip install thinkcache[dev]

Quick Start

Method 1: Global Cache Setup (Recommended)

from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI

# Initialize semantic cache globally - one line setup!
ensure_semantic_cache(
    similarity_threshold=0.15,
    max_cache_size=1000
)

llm = OpenAI(temperature=0)

response1 = llm.invoke("What is the capital of France?")
response2 = llm.invoke("Tell me the capital city of France")
response3 = llm.invoke("France's capital is?")

Method 2: Direct Cache Usage

from thinkcache import SemanticCache
from langchain.globals import set_llm_cache

cache = SemanticCache(
    database_path="./production_cache.db",
    faiss_index_path="./vector_cache",
    similarity_threshold=0.15,
    max_cache_size=5000,
    memory_cache_size=1000,
    enable_quantization=True
)

set_llm_cache(cache)

Configuration Methods

Global Configuration (Before First Use)

from thinkcache import configure_semantic_cache

configure_semantic_cache(
    database_path="./my_cache.db",
    similarity_threshold=0.15,
    max_cache_size=2000
)

from thinkcache import ensure_semantic_cache
ensure_semantic_cache()

Runtime Configuration

from thinkcache import ensure_semantic_cache

cache = ensure_semantic_cache(
    similarity_threshold=0.2,
    database_path="./cache.db",
    faiss_index_path="./vectors",
    max_cache_size=1000,
    memory_cache_size=500,
    batch_size=20,
    enable_quantization=False
)

Production Configuration

from thinkcache import configure_semantic_cache

configure_semantic_cache(
    database_path="/var/cache/semantic/cache.db",
    faiss_index_path="/var/cache/semantic/vectors",
    similarity_threshold=0.15,
    max_cache_size=10000,
    memory_cache_size=2000,
    enable_quantization=True,
    batch_size=50
)

Cache Management

Getting Cache Instance

from thinkcache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    print("Cache is active and ready!")
else:
    print("No cache initialized yet")

Resetting Cache

from thinkcache import reset_semantic_cache

reset_semantic_cache()

from thinkcache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)

Handling Already Initialized Cache

from thinkcache import configure_semantic_cache

try:
    configure_semantic_cache(similarity_threshold=0.1)
except ValueError as e:
    print("Cache already initialized!")
    from thinkcache import reset_semantic_cache
    reset_semantic_cache()
    configure_semantic_cache(similarity_threshold=0.1)

Performance Monitoring

Real-time Metrics

from thinkcache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    metrics = cache.get_metrics()

    print(f"Cache Hit Rate: {metrics['hit_rate']:.1%}")
    print(f"Total Requests: {metrics['total_requests']:,}")
    print(f"Memory Hits: {metrics['memory_hits']:,}")
    print(f"Semantic Hits: {metrics['semantic_hits']:,}")
    print(f"Avg Embedding Time: {metrics['avg_embedding_time']:.3f}s")
    print(f"Avg Search Time: {metrics['avg_search_time']:.3f}s")

Cache Cleanup

from thinkcache import get_semantic_cache

cache = get_semantic_cache()
if cache:
    cache.clear_cache()
    print("All caches cleared!")

Architecture Overview

Query → Memory Cache → SQLite Cache → Semantic Search → LLM API
  ↓         ↓             ↓              ↓            ↓
 <1ms    ~1-2ms        ~2-5ms        ~5-15ms      100-2000ms

How It Works

  1. Memory Cache: Lightning-fast LRU cache for recently accessed queries
  2. SQLite Cache: Persistent exact-match cache with indexing
  3. Semantic Search: FAISS-powered vector similarity search
  4. Embedding Cache: Cached embeddings to avoid recomputation
  5. Smart Eviction: Automatic cleanup based on usage patterns

Advanced Usage

Async Operations

import asyncio
from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI

ensure_semantic_cache()

async def cached_queries():
    llm = OpenAI()
    tasks = [
        llm.ainvoke("Explain quantum computing"),
        llm.ainvoke("What is quantum computing?"),
        llm.ainvoke("Define quantum computing")
    ]
    results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(cached_queries())

Custom Similarity Thresholds

from thinkcache import configure_semantic_cache

configure_semantic_cache(similarity_threshold=0.1)
configure_semantic_cache(similarity_threshold=0.4)
configure_semantic_cache(similarity_threshold=0.2)

Multiple Cache Instances

from thinkcache import SemanticCache

qa_cache = SemanticCache(
    database_path="./qa_cache.db",
    similarity_threshold=0.15
)

summarization_cache = SemanticCache(
    database_path="./summary_cache.db",
    similarity_threshold=0.25
)

Complete Workflow Example

from thinkcache import (
    configure_semantic_cache,
    ensure_semantic_cache,
    get_semantic_cache,
    reset_semantic_cache
)
from langchain_openai import OpenAI

configure_semantic_cache(
    similarity_threshold=0.2,
    max_cache_size=5000,
    enable_quantization=True
)

cache = ensure_semantic_cache()

llm = OpenAI(temperature=0)
response = llm.invoke("What is machine learning?")

metrics = cache.get_metrics()
print(f"Hit rate: {metrics['hit_rate']:.1%}")

reset_semantic_cache()

configure_semantic_cache(similarity_threshold=0.1)

Configuration Reference

Parameter Default Description
database_path .langchain.db SQLite database file path
faiss_index_path ./semantic_cache_index FAISS vector index directory
similarity_threshold 0.5 Semantic similarity threshold (0.0–1.0)
max_cache_size 1000 Maximum entries in vector store
memory_cache_size 100 Maximum entries in memory cache
batch_size 10 Batch size for vector operations
enable_quantization False Enable FAISS quantization for efficiency

Troubleshooting

Cache not working?

from thinkcache import get_semantic_cache
cache = get_semantic_cache()
print(f"Cache active: {cache is not None}")

Configuration errors?

from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Low hit rates?

from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Memory issues?

from thinkcache import configure_semantic_cache
configure_semantic_cache(enable_quantization=True)

Performance Tips

  1. Start with 0.2 similarity threshold
  2. Use configure_semantic_cache() for production
  3. Enable quantization for large caches
  4. Use a larger memory cache in production
  5. Monitor hit rates and adjust threshold
  6. Reset cache regularly during development

Requirements

  • Python 3.8+
  • Core: FAISS, HuggingFace Transformers, SQLite
  • Optional: faiss-gpu (for GPU acceleration)

License

MIT License – see LICENSE file for details.

Changelog

v0.1.1

  • Multi-layer caching system
  • FAISS integration with quantization
  • Comprehensive metrics and monitoring
  • Full async support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinkcache-0.1.2.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thinkcache-0.1.2-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file thinkcache-0.1.2.tar.gz.

File metadata

  • Download URL: thinkcache-0.1.2.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f5c2b1077c5ac2813e6368a284a0468a0772615e607a42d9374c54ae59937a65
MD5 2e9fdb5bbd46573b8bd949f090f36411
BLAKE2b-256 7ae22db0ecbab1b4a991368522e7f4c9c65f5ed7a67b7725010afd44cdd9bba6

See more details on using hashes here.

File details

Details for the file thinkcache-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: thinkcache-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2cbf8e900f688741edf824b3788f4da84a9e4ecc9e4262e7dff2d52c3763f8d7
MD5 37a8846946cfe0722556a7d34dca6c23
BLAKE2b-256 fc219651144921c6c3c81d4fd8d2bc56c2f3d659ab8f963c8dd616c554e4ee8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page