A high-performance semantic caching system with FAISS vector similarity search for LLMs

These details have not been verified by PyPI

Project links

Project description

Smart Semantic Cache

A high-performance, multi-layered semantic caching system that dramatically reduces LLM costs and latency through intelligent similarity-based response caching.

Why Semantic Cache?

Traditional caching requires exact query matches. Semantic caching understands that "What's the capital of France?" and "Tell me France's capital city" should return the same cached result. This can reduce your LLM API costs by 60–90% in real applications.

Key Features

Multi-Layer Intelligence: Memory cache → SQLite → FAISS vector similarity
Lightning Fast: Sub-millisecond memory lookups, <10ms semantic search
Configurable Similarity: Fine-tune cache hit sensitivity (0.0–1.0)
Memory Efficient: Optional FAISS quantization for large-scale deployments
Async Ready: Full async support for high-throughput applications
Rich Metrics: Comprehensive performance monitoring and analytics
Smart Eviction: LRU-based cache management with intelligent cleanup
Production Ready: Thread-safe, error-resilient, battle-tested

Installation

pip install thinkcache

Optional Dependencies

# For GPU acceleration (recommended for production)
pip install thinkcache[quantization]

# For development and testing
pip install thinkcache[dev]

Quick Start

Method 1: Global Cache Setup (Recommended)

from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI

# Initialize semantic cache globally - one line setup!
ensure_semantic_cache(
    similarity_threshold=0.15,
    max_cache_size=1000
)

llm = OpenAI(temperature=0)

response1 = llm.invoke("What is the capital of France?")
response2 = llm.invoke("Tell me the capital city of France")
response3 = llm.invoke("France's capital is?")

Method 2: Direct Cache Usage

from thinkcache import SemanticCache
from langchain.globals import set_llm_cache

cache = SemanticCache(
    database_path="./production_cache.db",
    faiss_index_path="./vector_cache",
    similarity_threshold=0.15,
    max_cache_size=5000,
    memory_cache_size=1000,
    enable_quantization=True
)

set_llm_cache(cache)

Configuration Methods

Global Configuration (Before First Use)

from thinkcache import configure_semantic_cache

configure_semantic_cache(
    database_path="./my_cache.db",
    similarity_threshold=0.15,
    max_cache_size=2000
)

from thinkcache import ensure_semantic_cache
ensure_semantic_cache()

Runtime Configuration

from thinkcache import ensure_semantic_cache

cache = ensure_semantic_cache(
    similarity_threshold=0.2,
    database_path="./cache.db",
    faiss_index_path="./vectors",
    max_cache_size=1000,
    memory_cache_size=500,
    batch_size=20,
    enable_quantization=False
)

Production Configuration

from thinkcache import configure_semantic_cache

configure_semantic_cache(
    database_path="/var/cache/semantic/cache.db",
    faiss_index_path="/var/cache/semantic/vectors",
    similarity_threshold=0.15,
    max_cache_size=10000,
    memory_cache_size=2000,
    enable_quantization=True,
    batch_size=50
)

Cache Management

Getting Cache Instance

from thinkcache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    print("Cache is active and ready!")
else:
    print("No cache initialized yet")

Resetting Cache

from thinkcache import reset_semantic_cache

reset_semantic_cache()

from thinkcache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)

Handling Already Initialized Cache

from thinkcache import configure_semantic_cache

try:
    configure_semantic_cache(similarity_threshold=0.1)
except ValueError as e:
    print("Cache already initialized!")
    from thinkcache import reset_semantic_cache
    reset_semantic_cache()
    configure_semantic_cache(similarity_threshold=0.1)

Performance Monitoring

Real-time Metrics

from thinkcache import get_semantic_cache

cache = get_semantic_cache()

if cache:
    metrics = cache.get_metrics()

    print(f"Cache Hit Rate: {metrics['hit_rate']:.1%}")
    print(f"Total Requests: {metrics['total_requests']:,}")
    print(f"Memory Hits: {metrics['memory_hits']:,}")
    print(f"Semantic Hits: {metrics['semantic_hits']:,}")
    print(f"Avg Embedding Time: {metrics['avg_embedding_time']:.3f}s")
    print(f"Avg Search Time: {metrics['avg_search_time']:.3f}s")

Cache Cleanup

from thinkcache import get_semantic_cache

cache = get_semantic_cache()
if cache:
    cache.clear_cache()
    print("All caches cleared!")

Architecture Overview

Query → Memory Cache → SQLite Cache → Semantic Search → LLM API
  ↓         ↓             ↓              ↓            ↓
 <1ms    ~1-2ms        ~2-5ms        ~5-15ms      100-2000ms

How It Works

Memory Cache: Lightning-fast LRU cache for recently accessed queries
SQLite Cache: Persistent exact-match cache with indexing
Semantic Search: FAISS-powered vector similarity search
Embedding Cache: Cached embeddings to avoid recomputation
Smart Eviction: Automatic cleanup based on usage patterns

Advanced Usage

Async Operations

import asyncio
from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI

ensure_semantic_cache()

async def cached_queries():
    llm = OpenAI()
    tasks = [
        llm.ainvoke("Explain quantum computing"),
        llm.ainvoke("What is quantum computing?"),
        llm.ainvoke("Define quantum computing")
    ]
    results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(cached_queries())

Custom Similarity Thresholds

from thinkcache import configure_semantic_cache

configure_semantic_cache(similarity_threshold=0.1)
configure_semantic_cache(similarity_threshold=0.4)
configure_semantic_cache(similarity_threshold=0.2)

Multiple Cache Instances

from thinkcache import SemanticCache

qa_cache = SemanticCache(
    database_path="./qa_cache.db",
    similarity_threshold=0.15
)

summarization_cache = SemanticCache(
    database_path="./summary_cache.db",
    similarity_threshold=0.25
)

Complete Workflow Example

from thinkcache import (
    configure_semantic_cache,
    ensure_semantic_cache,
    get_semantic_cache,
    reset_semantic_cache
)
from langchain_openai import OpenAI

configure_semantic_cache(
    similarity_threshold=0.2,
    max_cache_size=5000,
    enable_quantization=True
)

cache = ensure_semantic_cache()

llm = OpenAI(temperature=0)
response = llm.invoke("What is machine learning?")

metrics = cache.get_metrics()
print(f"Hit rate: {metrics['hit_rate']:.1%}")

reset_semantic_cache()

configure_semantic_cache(similarity_threshold=0.1)

Configuration Reference

Parameter	Default	Description
`database_path`	`.langchain.db`	SQLite database file path
`faiss_index_path`	`./semantic_cache_index`	FAISS vector index directory
`similarity_threshold`	`0.5`	Semantic similarity threshold (0.0–1.0)
`max_cache_size`	`1000`	Maximum entries in vector store
`memory_cache_size`	`100`	Maximum entries in memory cache
`batch_size`	`10`	Batch size for vector operations
`enable_quantization`	`False`	Enable FAISS quantization for efficiency

Troubleshooting

Cache not working?

from thinkcache import get_semantic_cache
cache = get_semantic_cache()
print(f"Cache active: {cache is not None}")

Configuration errors?

from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Low hit rates?

from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)

Memory issues?

from thinkcache import configure_semantic_cache
configure_semantic_cache(enable_quantization=True)

Performance Tips

Start with 0.2 similarity threshold
Use configure_semantic_cache() for production
Enable quantization for large caches
Use a larger memory cache in production
Monitor hit rates and adjust threshold
Reset cache regularly during development

Requirements

Python 3.8+
Core: FAISS, HuggingFace Transformers, SQLite
Optional: faiss-gpu (for GPU acceleration)

License

MIT License – see LICENSE file for details.

Changelog

v0.1.1

Multi-layer caching system
FAISS integration with quantization
Comprehensive metrics and monitoring
Full async support

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jul 24, 2025

0.1.1

Jul 24, 2025

0.1.0

Jul 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thinkcache-0.1.2.tar.gz (13.7 kB view details)

Uploaded Jul 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thinkcache-0.1.2-py3-none-any.whl (10.6 kB view details)

Uploaded Jul 24, 2025 Python 3

File details

Details for the file thinkcache-0.1.2.tar.gz.

File metadata

Download URL: thinkcache-0.1.2.tar.gz
Upload date: Jul 24, 2025
Size: 13.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`f5c2b1077c5ac2813e6368a284a0468a0772615e607a42d9374c54ae59937a65`
MD5	`2e9fdb5bbd46573b8bd949f090f36411`
BLAKE2b-256	`7ae22db0ecbab1b4a991368522e7f4c9c65f5ed7a67b7725010afd44cdd9bba6`

See more details on using hashes here.

File details

Details for the file thinkcache-0.1.2-py3-none-any.whl.

File metadata

Download URL: thinkcache-0.1.2-py3-none-any.whl
Upload date: Jul 24, 2025
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for thinkcache-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2cbf8e900f688741edf824b3788f4da84a9e4ecc9e4262e7dff2d52c3763f8d7`
MD5	`37a8846946cfe0722556a7d34dca6c23`
BLAKE2b-256	`fc219651144921c6c3c81d4fd8d2bc56c2f3d659ab8f963c8dd616c554e4ee8f`

See more details on using hashes here.

thinkcache 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Smart Semantic Cache

Why Semantic Cache?

Key Features

Installation

Optional Dependencies

Quick Start

Method 1: Global Cache Setup (Recommended)

Method 2: Direct Cache Usage

Configuration Methods

Global Configuration (Before First Use)

Runtime Configuration

Production Configuration

Cache Management

Getting Cache Instance

Resetting Cache

Handling Already Initialized Cache

Performance Monitoring

Real-time Metrics

Cache Cleanup

Architecture Overview

How It Works

Advanced Usage

Async Operations

Custom Similarity Thresholds

Multiple Cache Instances

Complete Workflow Example

Configuration Reference

Troubleshooting

Performance Tips

Requirements

License

Changelog

v0.1.1

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes