Intelligent caching layer for LLM calls with semantic understanding, async deduplication, and cost control
Project description
LLMCacheX
Intelligent caching layer for LLM calls with semantic understanding, async deduplication, and cost control
Why LLMCacheX?
LLM calls are expensive. A single GPT-4 call costs $0.03+. In production, you:
- Pay for identical requests multiple times
- Have no way to replay production issues locally
- Can't see exactly what you're being charged for
- Lose money on prompt variations ("Explain X" vs "What is X?")
LLMCacheX solves all of this.
Without LLMCacheX: 10 identical requests → $0.30
With LLMCacheX: 10 identical requests → $0.03 (1 cache hit)
Semantic bonus: 10 similar requests → $0.03 (semantic match)
What It Does
🎯 Exact Match Caching
Deterministic hashing prevents duplicate LLM calls. Identical prompts always hit cache.
🧠 Semantic Caching
"Explain Python" = "What is Python?" = "Tell me about Python" → Same cached response
- 10-30x increase in cache hit rates
- Handles typos, phrasing variations, multilingual
- Configurable similarity threshold (0.85-0.97)
⚡ Async Deduplication
10 concurrent identical requests:
- First hits LLM
- Others wait for result
- All get same response
- You pay once, not 10x
🔄 Replay & Time Travel
Replay production requests locally with zero API usage:
# Record (production)
export LLMCACHEX_MODE=live
# Replay (local development)
export LLMCACHEX_MODE=replay
llmcachex run "same prompt" # Returns cached response
🏢 Multi-Tenant Org Scoping
Complete cache isolation per organization:
curl -H "X-Org-ID: acme" http://localhost:8000/api/v1/cache
curl -H "X-Org-ID: techcorp" http://localhost:8000/api/v1/cache
# Different caches, zero cross-contamination
📊 Cost Control
See exactly what you're paying for:
{
"content": "Python is...",
"cost": 0.0024,
"cache_type": "semantic",
"similarity": 0.94
}
Quick Start
Installation
# Basic (memory storage)
pip install llmcachex
# With gateway
pip install llmcachex[gateway]
# Full (with Redis, dev tools)
pip install llmcachex[all]
Environment Setup
export OPENAI_API_KEY="sk-..."
export LLMCACHEX_SEMANTIC=true
export LLMCACHEX_SEMANTIC_THRESHOLD=0.92
CLI Usage
# Simple request
llmcachex run "Explain quantum computing"
# With semantic tolerance
llmcachex run "What is quantum computing?" --threshold 0.90
# With org isolation
llmcachex run "Explain Python" --org acme_corp
# Disable semantic (exact only)
llmcachex run "Explain Python" --no-semantic
HTTP Gateway (SaaS Mode)
# Start server
llmcachex serve --port 8000
# Make request
curl -X POST http://localhost:8000/api/v1/cache \
-H "Content-Type: application/json" \
-H "X-Org-ID: acme" \
-d '{
"prompt": "Explain machine learning",
"model": "gpt-4o-mini",
"temperature": 0.3
}'
Response:
{
"content": "Machine learning is...",
"cached": true,
"cache_type": "semantic",
"cost": 0.0,
"similarity": 0.94,
"org_id": "acme"
}
Architecture
Request → Hash → Exact Match? ✓ Return
↓ No
Generate Embedding
↓
Semantic Match? ✓ Return
↓ No
Call LLM (with dedup lock)
↓
Store with Embedding
↓
Return + Cost
Storage Backends
| Backend | Use Case | Latency | Cost |
|---|---|---|---|
| Memory | Development, testing | <1ms | Free |
| Redis | Production, shared cache | 5-10ms | $$ |
| Vector DB | Large scale (100K+) | 50-100ms | $$$ |
Providers
- OpenAI (default): GPT-4, GPT-3.5, embeddings API
- Custom: Implement provider interface
Configuration
Environment Variables
# Core
OPENAI_API_KEY=sk-... # OpenAI API key (required)
LLMCACHEX_MODE=live|replay # live or replay mode
LLMCACHEX_ORG=default # Default org ID
# Storage
LLMCACHEX_STORAGE=memory|redis # Storage backend
REDIS_URL=redis://localhost:6379/0 # Redis connection
# Semantic Caching
LLMCACHEX_SEMANTIC=true|false # Enable semantic matching
LLMCACHEX_SEMANTIC_THRESHOLD=0.92 # Similarity threshold (0.0-1.0)
LLMCACHEX_EMBEDDING_MODEL=text-embedding-3-small # Embedding model
# Security
LLMCACHEX_API_KEY=secret # API key for gateway
Python API
Simple Usage
from llmcachex import LLMCache
from llmcachex.storage.memory import MemoryStorage
from llmcachex.providers.openai_provider import OpenAIProvider
from llmcachex.models import LLMRequest
import asyncio
# Initialize
cache = LLMCache(MemoryStorage(), OpenAIProvider())
# Make request
req = LLMRequest(
provider="openai",
model="gpt-4o-mini",
prompt="Explain Python",
temperature=0.3,
org_id="my_org"
)
result = asyncio.run(cache.run_async(req))
print(result.content) # Response text
print(result.cache_type) # "exact", "semantic", or "miss"
print(result.cost) # Dollar cost
print(result.similarity) # 0.94 for semantic hits
With Redis
from llmcachex.storage.redis import RedisStorage
cache = LLMCache(
RedisStorage("redis://localhost:6379/0"),
OpenAIProvider()
)
Custom Threshold
result = await cache.run_async(
req,
similarity_threshold=0.95 # More conservative
)
Examples
Example 1: Cost Savings
# Monitor real savings
results = []
for i in range(10):
result = await cache.run_async(request)
results.append(result)
misses = sum(1 for r in results if r.cache_type == "miss")
total_cost = sum(r.cost for r in results)
print(f"Cache hits: {10 - misses}")
print(f"Total cost: ${total_cost:.6f}")
Example 2: Replay Production
# Production (recording)
export LLMCACHEX_MODE=live
export REDIS_URL=redis://prod:6379/0
# Debug locally with zero cost
export LLMCACHEX_MODE=replay
llmcachex run "same prompt"
Example 3: Multi-Tenant SaaS
@app.post("/api/cache")
async def handle_cache(request: CacheRequest, org_id: str = Header()):
llm_req = LLMRequest(
prompt=request.prompt,
org_id=org_id
)
return await cache.run_async(llm_req)
Benchmarks
Speed
Exact match: < 1ms
Semantic match: 50-200ms
LLM call: 1000-5000ms
Win: 5-100x faster than LLM
Cost
1000 requests with 30% natural duplication + 20% semantic boost:
Exact only: 700 calls = $7.00
With semantic: 500 calls + embeddings = $5.50
Savings: 21% (compounds to $550/month)
Documentation
- GATEWAY.md - HTTP API reference
- SEMANTIC.md - Semantic caching guide
- API Docs - Interactive Swagger UI
Testing
pip install llmcachex[dev]
pytest tests/ -v
pytest tests/ --cov=llmcachex
mypy llmcachex/
ruff check llmcachex/
black llmcachex/
Roadmap
- Exact match caching
- Semantic caching
- Async deduplication
- Org-scoped isolation
- HTTP gateway
- Redis backend
- Vector database support
- Advanced analytics
- Streaming responses
- Custom providers
- Rate limiting
- Usage billing
Production Deployment
Quick Start
docker run -p 8000:8000 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
prabhnoor12/llmcachex:latest
With Docker Compose
docker-compose up -d # Redis + LLMCacheX
Contributing
- Fork the repo
- Create feature branch
- Add tests
- Format with
blackandruff - Submit PR
License
MIT License - see LICENSE
Support
- Issues: GitHub Issues
- Docs: SEMANTIC.md · GATEWAY.md
Made with ❤️ by prabhnoor12
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmcachex-0.2.0.tar.gz.
File metadata
- Download URL: llmcachex-0.2.0.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ace61577c661f5f87f4197cb399c8651c4924a469275e25585ec04b4437b5a62
|
|
| MD5 |
b1df38f8fe8ecce7cfa63ba5887599dc
|
|
| BLAKE2b-256 |
3cb071b1667e7b71053c399bbc92b476b554afed8a822260f3cfc9b3370f65ec
|
File details
Details for the file llmcachex-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llmcachex-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3b710348c0554b29966343d8617580df27d840ae4c4bc4c489c8da5e5e157c8
|
|
| MD5 |
c2afd8dd25c29613fd3fbb7036c55530
|
|
| BLAKE2b-256 |
051e465df0d4bca6fdff9f978828a96cd20954f0a408a05afa39261724d34167
|