Skip to main content

A high-performance semantic caching system with FAISS vector similarity search for LLMs

Project description

Smart Semantic Cache

A high-performance, multi-layered semantic caching system that dramatically reduces LLM costs and latency through intelligent similarity-based response caching.

Why Semantic Cache?

Traditional caching requires exact query matches. Semantic caching understands that "What's the capital of France?" and "Tell me France's capital city" should return the same cached result. This can reduce your LLM API costs by 60–90% in real applications.

Key Features

  • Multi-Layer Intelligence: Memory cache → SQLite → FAISS vector similarity
  • Lightning Fast: Sub-millisecond memory lookups, <10ms semantic search
  • Configurable Similarity: Fine-tune cache hit sensitivity (0.0–1.0)
  • Memory Efficient: Optional FAISS quantization for large-scale deployments
  • Async Ready: Full async support for high-throughput applications
  • Rich Metrics: Comprehensive performance monitoring and analytics
  • Smart Eviction: LRU-based cache management with intelligent cleanup
  • Production Ready: Thread-safe, error-resilient, battle-tested

Installation

pip install thinkcache

Optional Dependencies

# For GPU acceleration (recommended for production)
pip install thinkcache[quantization]

# For development and testing
pip install thinkcache[dev]

Quick Start

Method 1: Global Cache Setup (Recommended)

from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI

# Initialize semantic cache globally - one line setup!
ensure_semantic_cache(
    similarity_threshold=0.15,
    max_cache_size=1000
)

llm = OpenAI(temperature=0)

response1 = llm.invoke("What is the capital of France?")
response2 = llm.invoke("Tell me the capital city of France")
response3 = llm.invoke("France's capital is?")

Method 2: Direct Cache Usage

from thinkcache import SemanticCache
from langchain.globals import set_llm_cache

cache = SemanticCache(
    database_path="./production_cache.db",
    faiss_index_path="./vector_cache",
    similarity_threshold=0.15,
    max_cache_size=5000,
    memory_cache_size=1000,
    enable_quantization=True
)

set_llm_cache(cache)

Configuration Methods

Global Configuration (Before First Use)

from thinkcache import configure_semantic_cache

configure_semantic_cache(
    database_path="./my_cache.db",
    similarity_threshold=0.15,
    max_cache_size=2000
)

from thinkcache import ensure_semantic_cache
ensure_semantic_cache()

Runtime Configuration

from thinkcache import ensure_semantic_cache

cache = ensure_semantic_cache(
    similarity_threshold=0.2,
    database_path="./cache.db",
    faiss_index_path="./vectors",
    max_cache_size=1000,
    memory_cache_size=500,
    batch_size=20,
    enable_quantization=False
)

Production Configuration

from thinkcache import configure_semantic_cache

configure_semantic_cache(
    database_path="/var/cache/semantic/cache.db",
    faiss_index_path="/var/cache/semantic/vectors",
    similarity_threshold=0.15,
    max_cache_size=10000,
    memory_cache_size=2000,
    enable_quantization=True,
    batch_size=50
)

Cache Management

Getting Cache Instance

from thinkcache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    print("Cache is active and ready!")
else:
    print("No cache initialized yet")

Resetting Cache

from thinkcache import reset_semantic_cache

reset_semantic_cache()

from thinkcache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)

Handling Already Initialized Cache

from thinkcache import configure_semantic_cache

try:
    configure_semantic_cache(similarity_threshold=0.1)
except ValueError as e:
    print("Cache already initialized!")
    from thinkcache import reset_semantic_cache
    reset_semantic_cache()
    configure_semantic_cache(similarity_threshold=0.1)

Performance Monitoring

Real-time Metrics

from thinkcache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    metrics = cache.get_metrics()

    print(f"Cache Hit Rate: {metrics['hit_rate']:.1%}")
    print(f"Total Requests: {metrics['total_requests']:,}")
    print(f"Memory Hits: {metrics['memory_hits']:,}")
    print(f"Semantic Hits: {metrics['semantic_hits']:,}")
    print(f"Avg Embedding Time: {metrics['avg_embedding_time']:.3f}s")
    print(f"Avg Search Time: {metrics['avg_search_time']:.3f}s")

Cache Cleanup

from thinkcache import get_semantic_cache

cache = get_semantic_cache()
if cache:
    cache.clear_cache()
    print("All caches cleared!")

Architecture Overview

Query → Memory Cache → SQLite Cache → Semantic Search → LLM API
  ↓         ↓             ↓              ↓            ↓
 <1ms    ~1-2ms        ~2-5ms        ~5-15ms      100-2000ms

How It Works

  1. Memory Cache: Lightning-fast LRU cache for recently accessed queries
  2. SQLite Cache: Persistent exact-match cache with indexing
  3. Semantic Search: FAISS-powered vector similarity search
  4. Embedding Cache: Cached embeddings to avoid recomputation
  5. Smart Eviction: Automatic cleanup based on usage patterns

Advanced Usage

Async Operations

import asyncio
from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI

ensure_semantic_cache()

async def cached_queries():
    llm = OpenAI()
    tasks = [
        llm.ainvoke("Explain quantum computing"),
        llm.ainvoke("What is quantum computing?"),
        llm.ainvoke("Define quantum computing")
    ]
    results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(cached_queries())

Custom Similarity Thresholds

from thinkcache import configure_semantic_cache

configure_semantic_cache(similarity_threshold=0.1)
configure_semantic_cache(similarity_threshold=0.4)
configure_semantic_cache(similarity_threshold=0.2)

Multiple Cache Instances

from thinkcache import SemanticCache

qa_cache = SemanticCache(
    database_path="./qa_cache.db",
    similarity_threshold=0.15
)

summarization_cache = SemanticCache(
    database_path="./summary_cache.db",
    similarity_threshold=0.25
)

Complete Workflow Example

from thinkcache import (
    configure_semantic_cache,
    ensure_semantic_cache,
    get_semantic_cache,
    reset_semantic_cache
)
from langchain_openai import OpenAI

configure_semantic_cache(
    similarity_threshold=0.2,
    max_cache_size=5000,
    enable_quantization=True
)

cache = ensure_semantic_cache()

llm = OpenAI(temperature=0)
response = llm.invoke("What is machine learning?")

metrics = cache.get_metrics()
print(f"Hit rate: {metrics['hit_rate']:.1%}")

reset_semantic_cache()

configure_semantic_cache(similarity_threshold=0.1)

Configuration Reference

Parameter Default Description
database_path .langchain.db SQLite database file path
faiss_index_path ./semantic_cache_index FAISS vector index directory
similarity_threshold 0.5 Semantic similarity threshold (0.0–1.0)
max_cache_size 1000 Maximum entries in vector store
memory_cache_size 100 Maximum entries in memory cache
batch_size 10 Batch size for vector operations
enable_quantization False Enable FAISS quantization for efficiency

Troubleshooting

Cache not working?

from thinkcache import get_semantic_cache
cache = get_semantic_cache()
print(f"Cache active: {cache is not None}")

Configuration errors?

from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Low hit rates?

from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Memory issues?

from thinkcache import configure_semantic_cache
configure_semantic_cache(enable_quantization=True)

Performance Tips

  1. Start with 0.2 similarity threshold
  2. Use configure_semantic_cache() for production
  3. Enable quantization for large caches
  4. Use a larger memory cache in production
  5. Monitor hit rates and adjust threshold
  6. Reset cache regularly during development

Requirements

  • Python 3.8+
  • Core: FAISS, HuggingFace Transformers, SQLite
  • Optional: faiss-gpu (for GPU acceleration)

License

MIT License – see LICENSE file for details.

Changelog

v0.1.1

  • Multi-layer caching system
  • FAISS integration with quantization
  • Comprehensive metrics and monitoring
  • Full async support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinkcache-0.1.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thinkcache-0.1.1-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file thinkcache-0.1.1.tar.gz.

File metadata

  • Download URL: thinkcache-0.1.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7c76959e42990624f37ca6489ae0e8231b8c59a210ce20c64a962fb21a5e6857
MD5 a5930055cd5444f89b1f6fe0b306582b
BLAKE2b-256 0383eccbcad64023a574be7ff9c73764e08bd7c0e846f0a00df86518ab5aa8b3

See more details on using hashes here.

File details

Details for the file thinkcache-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: thinkcache-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ab6ca7011e421cc59ba415513943ce341f0a5f72e375c6d1b87e20aaffc7258
MD5 0892a1eca2f32f2da14b55d485fedbc7
BLAKE2b-256 c3f753445efeaaa167bdaccbfd01fbf4a84d75c9132ab2e95b33553945aabb1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page