Semantic caching for LLM API calls - reduce costs with one decorator

These details have not been verified by PyPI

Project links

Project description

semantic-llm-cache

Semantic caching for LLM API calls - reduce costs with one decorator.

Overview

LLM API calls are expensive and slow. In production applications, 20-40% of prompts are semantically identical but get charged as separate API calls. semantic-llm-cache solves this with a simple decorator that:

✅ Caches semantically similar prompts (not just exact matches)
✅ Reduces API costs by 20-40%
✅ Returns cached responses in <10ms
✅ Works with any LLM provider (OpenAI, Anthropic, local models)
✅ Zero behavior change - drop-in decorator

Installation

# Core (exact match only)
pip install semantic-llm-cache

# With semantic similarity
pip install semantic-llm-cache[semantic]

# With Redis backend
pip install semantic-llm-cache[redis]

# With everything
pip install semantic-llm-cache[all]

Quick Start

Basic Caching (Exact Match)

from semantic_llm_cache import cache

@cache()
def ask_gpt(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

# First call - API hit
ask_gpt("What is Python?")  # $0.002

# Second call - cache hit
ask_gpt("What is Python?")  # FREE, <10ms

Semantic Matching

Match semantically similar prompts (requires pip install semantic-llm-cache[semantic]):

from semantic_llm_cache import cache

@cache(similarity=0.90)
def ask_gpt(prompt: str) -> str:
    return call_openai(prompt)

ask_gpt("What is Python?")   # API call
ask_gpt("What's Python?")    # Cache hit (95% similar)
ask_gpt("Explain Python")    # Cache hit (91% similar)
ask_gpt("What is Rust?")     # API call (different topic)

TTL Expiration

from semantic_llm_cache import cache

@cache(ttl=3600)  # 1 hour
def ask_gpt(prompt: str) -> str:
    return call_openai(prompt)

Cache Statistics

from semantic_llm_cache import get_stats

stats = get_stats()
# {
#     "hits": 1547,
#     "misses": 892,
#     "hit_rate": 0.634,
#     "estimated_savings_usd": 3.09,
#     "latency_saved_ms": 773500
# }

Cache Management

from semantic_llm_cache import clear_cache, invalidate

# Clear all cached entries
clear_cache()

# Invalidate specific pattern
invalidate(pattern="Python")

Advanced Usage

Multiple Cache Backends

from semantic_llm_cache import cache
from semantic_llm_cache.backends import RedisBackend

# Use Redis for distributed caching
backend = RedisBackend(url="redis://localhost:6379")

@cache(backend=backend)
def ask_gpt(prompt: str) -> str:
    return call_openai(prompt)

Context Manager

from semantic_llm_cache import CacheContext

with CacheContext(similarity=0.9) as ctx:
    result1 = any_llm_call("prompt 1")
    result2 = any_llm_call("prompt 2")

print(ctx.stats)  # {"hits": 1, "misses": 1}

Wrapper Class

from semantic_llm_cache import CachedLLM

llm = CachedLLM(
    provider="openai",
    similarity=0.9,
    ttl=3600
)

response = llm.chat("What is Python?")

API Reference

`@cache()` Decorator

@cache(
    similarity: float = 1.0,      # 1.0 = exact match, 0.9 = semantic
    ttl: int = 3600,              # seconds, None = forever
    backend: Backend = None,      # None = in-memory
    namespace: str = "default",   # isolate different use cases
    enabled: bool = True,         # toggle for debugging
    key_func: Callable = None,    # custom cache key
)
def my_llm_function(prompt: str) -> str:
    ...

Parameters

Parameter	Type	Default	Description
`similarity`	`float`	`1.0`	Cosine similarity threshold (1.0 = exact, 0.9 = semantic)
`ttl`	`int \| None`	`3600`	Time-to-live in seconds (None = never expires)
`backend`	`Backend`	`None`	Storage backend (None = in-memory)
`namespace`	`str`	`"default"`	Isolate different use cases
`enabled`	`bool`	`True`	Enable/disable caching
`key_func`	`Callable`	`None`	Custom cache key function

Utility Functions

from semantic_llm_cache import (
    get_stats,      # Get cache statistics
    clear_cache,    # Clear all cached entries
    invalidate,     # Invalidate by pattern
    warm_cache,     # Pre-populate cache
    export_cache,   # Export for analysis
)

Backends

Backend	Description	Installation
`MemoryBackend`	In-memory (default)	Built-in
`SQLiteBackend`	Persistent storage	Built-in
`RedisBackend`	Distributed caching	`pip install semantic-llm-cache[redis]`

Performance

Metric	Value
Cache hit latency	<10ms
Cache miss overhead	<50ms (embedding)
Typical hit rate	25-40%
Cost reduction	20-40%

Requirements

Python >= 3.9
numpy >= 1.24.0

Optional Dependencies

sentence-transformers >= 2.2.0 (for semantic matching)
redis >= 4.0.0 (for Redis backend)
openai >= 1.0.0 (for OpenAI embeddings)

License

MIT License - see LICENSE file.

Author

Karthick Raja M (@karthyick)

Related Packages

distill-json - JSON compression for LLMs

Cut LLM costs 30% with one decorator. pip install semantic-llm-cache

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Feb 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_llm_cache-0.1.0.tar.gz (33.6 kB view details)

Uploaded Feb 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semantic_llm_cache-0.1.0-py3-none-any.whl (24.9 kB view details)

Uploaded Feb 2, 2026 Python 3

File details

Details for the file semantic_llm_cache-0.1.0.tar.gz.

File metadata

Download URL: semantic_llm_cache-0.1.0.tar.gz
Upload date: Feb 2, 2026
Size: 33.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for semantic_llm_cache-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0de9a7984be926d43486e1fae6577092c74dd035281b8284918d7a008b568060`
MD5	`f4dedb327482d83c9c8f71f011df21d1`
BLAKE2b-256	`7dcc1a59e01a7b4f83ee0af219868250413b3007a1c08b16d3ef71b4ef8e37d1`

See more details on using hashes here.

File details

Details for the file semantic_llm_cache-0.1.0-py3-none-any.whl.

File metadata

Download URL: semantic_llm_cache-0.1.0-py3-none-any.whl
Upload date: Feb 2, 2026
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for semantic_llm_cache-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`573592d4c2d38a7cb42f6d615d51a61f2bc3a5375a6a02661215d2a35ff3812c`
MD5	`ad7cf700a47181b240c7d9f1b246cd60`
BLAKE2b-256	`dce2d731be75c73cef7d6007c0de54c2abefe9357bfa275a0836fd9cddb36ac4`

See more details on using hashes here.

semantic-llm-cache 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

semantic-llm-cache

Overview

Installation

Quick Start

Basic Caching (Exact Match)

Semantic Matching

TTL Expiration

Cache Statistics

Cache Management

Advanced Usage

Multiple Cache Backends

Context Manager

Wrapper Class

API Reference

@cache() Decorator

Parameters

Utility Functions

Backends

Performance

Requirements

Optional Dependencies

License

Author

Related Packages

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`@cache()` Decorator