Skip to main content

Intelligent caching layer for LLM calls with semantic understanding, async deduplication, and cost control

Project description

LLMCacheX

Intelligent caching layer for LLM calls with semantic understanding, async deduplication, and cost control

GitHub License: MIT Python 3.9+


Why LLMCacheX?

LLM calls are expensive. A single GPT-4 call costs $0.03+. In production, you:

  • Pay for identical requests multiple times
  • Have no way to replay production issues locally
  • Can't see exactly what you're being charged for
  • Lose money on prompt variations ("Explain X" vs "What is X?")

LLMCacheX solves all of this.

Without LLMCacheX:  10 identical requests → $0.30
With LLMCacheX:     10 identical requests → $0.03 (1 cache hit)
Semantic bonus:     10 similar requests  → $0.03 (semantic match)

What It Does

🎯 Exact Match Caching

Deterministic hashing prevents duplicate LLM calls. Identical prompts always hit cache.

🧠 Semantic Caching

"Explain Python" = "What is Python?" = "Tell me about Python" → Same cached response

  • 10-30x increase in cache hit rates
  • Handles typos, phrasing variations, multilingual
  • Configurable similarity threshold (0.85-0.97)

⚡ Async Deduplication

10 concurrent identical requests:

  • First hits LLM
  • Others wait for result
  • All get same response
  • You pay once, not 10x

🔄 Replay & Time Travel

Replay production requests locally with zero API usage:

# Record (production)
export LLMCACHEX_MODE=live

# Replay (local development)
export LLMCACHEX_MODE=replay
llmcachex run "same prompt"  # Returns cached response

🏢 Multi-Tenant Org Scoping

Complete cache isolation per organization:

curl -H "X-Org-ID: acme" http://localhost:8000/api/v1/cache
curl -H "X-Org-ID: techcorp" http://localhost:8000/api/v1/cache
# Different caches, zero cross-contamination

📊 Cost Control

See exactly what you're paying for:

{
  "content": "Python is...",
  "cost": 0.0024,
  "cache_type": "semantic",
  "similarity": 0.94
}

Quick Start

Installation

# Basic (memory storage)
pip install llmcachex

# With gateway
pip install llmcachex[gateway]

# Full (with Redis, dev tools)
pip install llmcachex[all]

Environment Setup

export OPENAI_API_KEY="sk-..."
export LLMCACHEX_SEMANTIC=true
export LLMCACHEX_SEMANTIC_THRESHOLD=0.92

CLI Usage

# Simple request
llmcachex run "Explain quantum computing"

# With semantic tolerance
llmcachex run "What is quantum computing?" --threshold 0.90

# With org isolation
llmcachex run "Explain Python" --org acme_corp

# Disable semantic (exact only)
llmcachex run "Explain Python" --no-semantic

HTTP Gateway (SaaS Mode)

# Start server
llmcachex serve --port 8000

# Make request
curl -X POST http://localhost:8000/api/v1/cache \
  -H "Content-Type: application/json" \
  -H "X-Org-ID: acme" \
  -d '{
    "prompt": "Explain machine learning",
    "model": "gpt-4o-mini",
    "temperature": 0.3
  }'

Response:

{
  "content": "Machine learning is...",
  "cached": true,
  "cache_type": "semantic",
  "cost": 0.0,
  "similarity": 0.94,
  "org_id": "acme"
}

Architecture

Request → Hash → Exact Match? ✓ Return
                    ↓ No
              Generate Embedding
                    ↓
              Semantic Match? ✓ Return
                    ↓ No
              Call LLM (with dedup lock)
                    ↓
              Store with Embedding
                    ↓
              Return + Cost

Storage Backends

Backend Use Case Latency Cost
Memory Development, testing <1ms Free
Redis Production, shared cache 5-10ms $$
Vector DB Large scale (100K+) 50-100ms $$$

Providers

  • OpenAI (default): GPT-4, GPT-3.5, embeddings API
  • Custom: Implement provider interface

Configuration

Environment Variables

# Core
OPENAI_API_KEY=sk-...              # OpenAI API key (required)
LLMCACHEX_MODE=live|replay         # live or replay mode
LLMCACHEX_ORG=default              # Default org ID

# Storage
LLMCACHEX_STORAGE=memory|redis     # Storage backend
REDIS_URL=redis://localhost:6379/0 # Redis connection

# Semantic Caching
LLMCACHEX_SEMANTIC=true|false      # Enable semantic matching
LLMCACHEX_SEMANTIC_THRESHOLD=0.92  # Similarity threshold (0.0-1.0)
LLMCACHEX_EMBEDDING_MODEL=text-embedding-3-small  # Embedding model

# Security
LLMCACHEX_API_KEY=secret           # API key for gateway

Python API

Simple Usage

from llmcachex import LLMCache
from llmcachex.storage.memory import MemoryStorage
from llmcachex.providers.openai_provider import OpenAIProvider
from llmcachex.models import LLMRequest
import asyncio

# Initialize
cache = LLMCache(MemoryStorage(), OpenAIProvider())

# Make request
req = LLMRequest(
    provider="openai",
    model="gpt-4o-mini",
    prompt="Explain Python",
    temperature=0.3,
    org_id="my_org"
)

result = asyncio.run(cache.run_async(req))

print(result.content)         # Response text
print(result.cache_type)      # "exact", "semantic", or "miss"
print(result.cost)            # Dollar cost
print(result.similarity)      # 0.94 for semantic hits

With Redis

from llmcachex.storage.redis import RedisStorage

cache = LLMCache(
    RedisStorage("redis://localhost:6379/0"),
    OpenAIProvider()
)

Custom Threshold

result = await cache.run_async(
    req,
    similarity_threshold=0.95  # More conservative
)

Examples

Example 1: Cost Savings

# Monitor real savings
results = []
for i in range(10):
    result = await cache.run_async(request)
    results.append(result)

misses = sum(1 for r in results if r.cache_type == "miss")
total_cost = sum(r.cost for r in results)

print(f"Cache hits: {10 - misses}")
print(f"Total cost: ${total_cost:.6f}")

Example 2: Replay Production

# Production (recording)
export LLMCACHEX_MODE=live
export REDIS_URL=redis://prod:6379/0

# Debug locally with zero cost
export LLMCACHEX_MODE=replay
llmcachex run "same prompt"

Example 3: Multi-Tenant SaaS

@app.post("/api/cache")
async def handle_cache(request: CacheRequest, org_id: str = Header()):
    llm_req = LLMRequest(
        prompt=request.prompt,
        org_id=org_id
    )
    return await cache.run_async(llm_req)

Benchmarks

Speed

Exact match:    < 1ms
Semantic match: 50-200ms
LLM call:       1000-5000ms

Win: 5-100x faster than LLM

Cost

1000 requests with 30% natural duplication + 20% semantic boost:

Exact only:  700 calls = $7.00
With semantic: 500 calls + embeddings = $5.50

Savings: 21% (compounds to $550/month)

Documentation


Testing

pip install llmcachex[dev]
pytest tests/ -v
pytest tests/ --cov=llmcachex
mypy llmcachex/
ruff check llmcachex/
black llmcachex/

Roadmap

  • Exact match caching
  • Semantic caching
  • Async deduplication
  • Org-scoped isolation
  • HTTP gateway
  • Redis backend
  • Vector database support
  • Advanced analytics
  • Streaming responses
  • Custom providers
  • Rate limiting
  • Usage billing

Production Deployment

Quick Start

docker run -p 8000:8000 \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  prabhnoor12/llmcachex:latest

With Docker Compose

docker-compose up -d  # Redis + LLMCacheX

Contributing

  1. Fork the repo
  2. Create feature branch
  3. Add tests
  4. Format with black and ruff
  5. Submit PR

License

MIT License - see LICENSE


Support


Made with ❤️ by prabhnoor12

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmcachex-0.2.0.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmcachex-0.2.0-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file llmcachex-0.2.0.tar.gz.

File metadata

  • Download URL: llmcachex-0.2.0.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for llmcachex-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ace61577c661f5f87f4197cb399c8651c4924a469275e25585ec04b4437b5a62
MD5 b1df38f8fe8ecce7cfa63ba5887599dc
BLAKE2b-256 3cb071b1667e7b71053c399bbc92b476b554afed8a822260f3cfc9b3370f65ec

See more details on using hashes here.

File details

Details for the file llmcachex-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llmcachex-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for llmcachex-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3b710348c0554b29966343d8617580df27d840ae4c4bc4c489c8da5e5e157c8
MD5 c2afd8dd25c29613fd3fbb7036c55530
BLAKE2b-256 051e465df0d4bca6fdff9f978828a96cd20954f0a408a05afa39261724d34167

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page