Skip to main content

Semantic caching for LLM API calls - reduce costs with one decorator

Project description

semantic-llm-cache

Semantic caching for LLM API calls - reduce costs with one decorator.

PyPI License: MIT Python

Overview

LLM API calls are expensive and slow. In production applications, 20-40% of prompts are semantically identical but get charged as separate API calls. semantic-llm-cache solves this with a simple decorator that:

  • Caches semantically similar prompts (not just exact matches)
  • Reduces API costs by 20-40%
  • Returns cached responses in <10ms
  • Works with any LLM provider (OpenAI, Anthropic, local models)
  • Zero behavior change - drop-in decorator

Installation

# Core (exact match only)
pip install semantic-llm-cache

# With semantic similarity
pip install semantic-llm-cache[semantic]

# With Redis backend
pip install semantic-llm-cache[redis]

# With everything
pip install semantic-llm-cache[all]

Quick Start

Basic Caching (Exact Match)

from semantic_llm_cache import cache

@cache()
def ask_gpt(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

# First call - API hit
ask_gpt("What is Python?")  # $0.002

# Second call - cache hit
ask_gpt("What is Python?")  # FREE, <10ms

Semantic Matching

Match semantically similar prompts (requires pip install semantic-llm-cache[semantic]):

from semantic_llm_cache import cache

@cache(similarity=0.90)
def ask_gpt(prompt: str) -> str:
    return call_openai(prompt)

ask_gpt("What is Python?")   # API call
ask_gpt("What's Python?")    # Cache hit (95% similar)
ask_gpt("Explain Python")    # Cache hit (91% similar)
ask_gpt("What is Rust?")     # API call (different topic)

TTL Expiration

from semantic_llm_cache import cache

@cache(ttl=3600)  # 1 hour
def ask_gpt(prompt: str) -> str:
    return call_openai(prompt)

Cache Statistics

from semantic_llm_cache import get_stats

stats = get_stats()
# {
#     "hits": 1547,
#     "misses": 892,
#     "hit_rate": 0.634,
#     "estimated_savings_usd": 3.09,
#     "latency_saved_ms": 773500
# }

Cache Management

from semantic_llm_cache import clear_cache, invalidate

# Clear all cached entries
clear_cache()

# Invalidate specific pattern
invalidate(pattern="Python")

Advanced Usage

Multiple Cache Backends

from semantic_llm_cache import cache
from semantic_llm_cache.backends import RedisBackend

# Use Redis for distributed caching
backend = RedisBackend(url="redis://localhost:6379")

@cache(backend=backend)
def ask_gpt(prompt: str) -> str:
    return call_openai(prompt)

Context Manager

from semantic_llm_cache import CacheContext

with CacheContext(similarity=0.9) as ctx:
    result1 = any_llm_call("prompt 1")
    result2 = any_llm_call("prompt 2")

print(ctx.stats)  # {"hits": 1, "misses": 1}

Wrapper Class

from semantic_llm_cache import CachedLLM

llm = CachedLLM(
    provider="openai",
    similarity=0.9,
    ttl=3600
)

response = llm.chat("What is Python?")

API Reference

@cache() Decorator

@cache(
    similarity: float = 1.0,      # 1.0 = exact match, 0.9 = semantic
    ttl: int = 3600,              # seconds, None = forever
    backend: Backend = None,      # None = in-memory
    namespace: str = "default",   # isolate different use cases
    enabled: bool = True,         # toggle for debugging
    key_func: Callable = None,    # custom cache key
)
def my_llm_function(prompt: str) -> str:
    ...

Parameters

Parameter Type Default Description
similarity float 1.0 Cosine similarity threshold (1.0 = exact, 0.9 = semantic)
ttl int | None 3600 Time-to-live in seconds (None = never expires)
backend Backend None Storage backend (None = in-memory)
namespace str "default" Isolate different use cases
enabled bool True Enable/disable caching
key_func Callable None Custom cache key function

Utility Functions

from semantic_llm_cache import (
    get_stats,      # Get cache statistics
    clear_cache,    # Clear all cached entries
    invalidate,     # Invalidate by pattern
    warm_cache,     # Pre-populate cache
    export_cache,   # Export for analysis
)

Backends

Backend Description Installation
MemoryBackend In-memory (default) Built-in
SQLiteBackend Persistent storage Built-in
RedisBackend Distributed caching pip install semantic-llm-cache[redis]

Performance

Metric Value
Cache hit latency <10ms
Cache miss overhead <50ms (embedding)
Typical hit rate 25-40%
Cost reduction 20-40%

Requirements

  • Python >= 3.9
  • numpy >= 1.24.0

Optional Dependencies

  • sentence-transformers >= 2.2.0 (for semantic matching)
  • redis >= 4.0.0 (for Redis backend)
  • openai >= 1.0.0 (for OpenAI embeddings)

License

MIT License - see LICENSE file.

Author

Karthick Raja M (@karthyick)

Related Packages


Cut LLM costs 30% with one decorator. pip install semantic-llm-cache

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_llm_cache-0.1.0.tar.gz (33.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_llm_cache-0.1.0-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file semantic_llm_cache-0.1.0.tar.gz.

File metadata

  • Download URL: semantic_llm_cache-0.1.0.tar.gz
  • Upload date:
  • Size: 33.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for semantic_llm_cache-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0de9a7984be926d43486e1fae6577092c74dd035281b8284918d7a008b568060
MD5 f4dedb327482d83c9c8f71f011df21d1
BLAKE2b-256 7dcc1a59e01a7b4f83ee0af219868250413b3007a1c08b16d3ef71b4ef8e37d1

See more details on using hashes here.

File details

Details for the file semantic_llm_cache-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_llm_cache-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 573592d4c2d38a7cb42f6d615d51a61f2bc3a5375a6a02661215d2a35ff3812c
MD5 ad7cf700a47181b240c7d9f1b246cd60
BLAKE2b-256 dce2d731be75c73cef7d6007c0de54c2abefe9357bfa275a0836fd9cddb36ac4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page