Skip to main content

A high-performance semantic caching system with FAISS vector similarity search for LLMs

Project description

Smart Semantic Cache

A high-performance, multi-layered semantic caching system that dramatically reduces LLM costs and latency through intelligent similarity-based response caching.

Why Semantic Cache?

Traditional caching requires exact query matches. Semantic caching understands that "What's the capital of France?" and "Tell me France's capital city" should return the same cached result. This can reduce your LLM API costs by 60–90% in real applications.

Key Features

  • Multi-Layer Intelligence: Memory cache → SQLite → FAISS vector similarity
  • Lightning Fast: Sub-millisecond memory lookups, <10ms semantic search
  • Configurable Similarity: Fine-tune cache hit sensitivity (0.0–1.0)
  • Memory Efficient: Optional FAISS quantization for large-scale deployments
  • Async Ready: Full async support for high-throughput applications
  • Rich Metrics: Comprehensive performance monitoring and analytics
  • Smart Eviction: LRU-based cache management with intelligent cleanup
  • Production Ready: Thread-safe, error-resilient, battle-tested

Installation

pip install thinkcache

Optional Dependencies

# For GPU acceleration (recommended for production)
pip install thinkcache[quantization]

# For development and testing
pip install cacheML[dev]

Quick Start

Method 1: Global Cache Setup (Recommended)

from smart_semantic_cache import ensure_semantic_cache
from langchain_openai import OpenAI

# Initialize semantic cache globally - one line setup!
ensure_semantic_cache(
    similarity_threshold=0.15,
    max_cache_size=1000
)

llm = OpenAI(temperature=0)

response1 = llm.invoke("What is the capital of France?")
response2 = llm.invoke("Tell me the capital city of France")
response3 = llm.invoke("France's capital is?")

Method 2: Direct Cache Usage

from smart_semantic_cache import SemanticCache
from langchain.globals import set_llm_cache

cache = SemanticCache(
    database_path="./production_cache.db",
    faiss_index_path="./vector_cache",
    similarity_threshold=0.15,
    max_cache_size=5000,
    memory_cache_size=1000,
    enable_quantization=True
)

set_llm_cache(cache)

Configuration Methods

Global Configuration (Before First Use)

from smart_semantic_cache import configure_semantic_cache

configure_semantic_cache(
    database_path="./my_cache.db",
    similarity_threshold=0.15,
    max_cache_size=2000
)

from smart_semantic_cache import ensure_semantic_cache
ensure_semantic_cache()

Runtime Configuration

from smart_semantic_cache import ensure_semantic_cache

cache = ensure_semantic_cache(
    similarity_threshold=0.2,
    database_path="./cache.db",
    faiss_index_path="./vectors",
    max_cache_size=1000,
    memory_cache_size=500,
    batch_size=20,
    enable_quantization=False
)

Production Configuration

from smart_semantic_cache import configure_semantic_cache

configure_semantic_cache(
    database_path="/var/cache/semantic/cache.db",
    faiss_index_path="/var/cache/semantic/vectors",
    similarity_threshold=0.15,
    max_cache_size=10000,
    memory_cache_size=2000,
    enable_quantization=True,
    batch_size=50
)

Cache Management

Getting Cache Instance

from smart_semantic_cache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    print("Cache is active and ready!")
else:
    print("No cache initialized yet")

Resetting Cache

from smart_semantic_cache import reset_semantic_cache

reset_semantic_cache()

from smart_semantic_cache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)

Handling Already Initialized Cache

from smart_semantic_cache import configure_semantic_cache

try:
    configure_semantic_cache(similarity_threshold=0.1)
except ValueError as e:
    print("Cache already initialized!")
    from smart_semantic_cache import reset_semantic_cache
    reset_semantic_cache()
    configure_semantic_cache(similarity_threshold=0.1)

Performance Monitoring

Real-time Metrics

from smart_semantic_cache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    metrics = cache.get_metrics()

    print(f"Cache Hit Rate: {metrics['hit_rate']:.1%}")
    print(f"Total Requests: {metrics['total_requests']:,}")
    print(f"Memory Hits: {metrics['memory_hits']:,}")
    print(f"Semantic Hits: {metrics['semantic_hits']:,}")
    print(f"Avg Embedding Time: {metrics['avg_embedding_time']:.3f}s")
    print(f"Avg Search Time: {metrics['avg_search_time']:.3f}s")

Cache Cleanup

from smart_semantic_cache import get_semantic_cache

cache = get_semantic_cache()
if cache:
    cache.clear_cache()
    print("All caches cleared!")

Architecture Overview

Query → Memory Cache → SQLite Cache → Semantic Search → LLM API
  ↓         ↓             ↓              ↓            ↓
 <1ms    ~1-2ms        ~2-5ms        ~5-15ms      100-2000ms

How It Works

  1. Memory Cache: Lightning-fast LRU cache for recently accessed queries
  2. SQLite Cache: Persistent exact-match cache with indexing
  3. Semantic Search: FAISS-powered vector similarity search
  4. Embedding Cache: Cached embeddings to avoid recomputation
  5. Smart Eviction: Automatic cleanup based on usage patterns

Advanced Usage

Async Operations

import asyncio
from smart_semantic_cache import ensure_semantic_cache
from langchain_openai import OpenAI

ensure_semantic_cache()

async def cached_queries():
    llm = OpenAI()
    tasks = [
        llm.ainvoke("Explain quantum computing"),
        llm.ainvoke("What is quantum computing?"),
        llm.ainvoke("Define quantum computing")
    ]
    results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(cached_queries())

Custom Similarity Thresholds

from smart_semantic_cache import configure_semantic_cache

configure_semantic_cache(similarity_threshold=0.1)
configure_semantic_cache(similarity_threshold=0.4)
configure_semantic_cache(similarity_threshold=0.2)

Multiple Cache Instances

from smart_semantic_cache import SemanticCache

qa_cache = SemanticCache(
    database_path="./qa_cache.db",
    similarity_threshold=0.15
)

summarization_cache = SemanticCache(
    database_path="./summary_cache.db",
    similarity_threshold=0.25
)

Complete Workflow Example

from smart_semantic_cache import (
    configure_semantic_cache,
    ensure_semantic_cache,
    get_semantic_cache,
    reset_semantic_cache
)
from langchain_openai import OpenAI

configure_semantic_cache(
    similarity_threshold=0.2,
    max_cache_size=5000,
    enable_quantization=True
)

cache = ensure_semantic_cache()

llm = OpenAI(temperature=0)
response = llm.invoke("What is machine learning?")

metrics = cache.get_metrics()
print(f"Hit rate: {metrics['hit_rate']:.1%}")

reset_semantic_cache()

configure_semantic_cache(similarity_threshold=0.1)

Configuration Reference

Parameter Default Description
database_path .langchain.db SQLite database file path
faiss_index_path ./semantic_cache_index FAISS vector index directory
similarity_threshold 0.5 Semantic similarity threshold (0.0–1.0)
max_cache_size 1000 Maximum entries in vector store
memory_cache_size 100 Maximum entries in memory cache
batch_size 10 Batch size for vector operations
enable_quantization False Enable FAISS quantization for efficiency

Troubleshooting

Cache not working?

from smart_semantic_cache import get_semantic_cache
cache = get_semantic_cache()
print(f"Cache active: {cache is not None}")

Configuration errors?

from smart_semantic_cache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Low hit rates?

from smart_semantic_cache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Memory issues?

from smart_semantic_cache import configure_semantic_cache
configure_semantic_cache(enable_quantization=True)

Performance Tips

  1. Start with 0.2 similarity threshold
  2. Use configure_semantic_cache() for production
  3. Enable quantization for large caches
  4. Use a larger memory cache in production
  5. Monitor hit rates and adjust threshold
  6. Reset cache regularly during development

Requirements

  • Python 3.8+
  • Core: FAISS, HuggingFace Transformers, SQLite
  • Optional: faiss-gpu (for GPU acceleration)

License

MIT License – see LICENSE file for details.

Changelog

v0.1.1

  • Multi-layer caching system
  • FAISS integration with quantization
  • Comprehensive metrics and monitoring
  • Full async support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinkcache-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thinkcache-0.1.0-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file thinkcache-0.1.0.tar.gz.

File metadata

  • Download URL: thinkcache-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b70b0926b5a7b211e4204a21f3812a56e8186f8bd52d021cecb7191a58cc7d7f
MD5 b72a01d222ecb615c63dcccf009107fa
BLAKE2b-256 6f68834b7ea47e639baaddd099edf9d73d46a504e344e803b3c24ef7aca106e0

See more details on using hashes here.

File details

Details for the file thinkcache-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: thinkcache-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d25aced3ca341494afa11999c29911bac21a957e60f73ad32361b9b67a5cb0b3
MD5 b63ce0a02605fccf97bb478a2c4ca446
BLAKE2b-256 3afc96e803e235bfab17c070fb306a8c2d8ed39cf035c33b2636d22ce81ca4ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page