Intelligent caching layer for LLM calls with semantic understanding, async deduplication, and cost control

These details have not been verified by PyPI

Project links

Project description

LLMCacheX

Intelligent caching layer for LLM calls with semantic understanding, async deduplication, and cost control

Why LLMCacheX?

LLM calls are expensive. A single GPT-4 call costs $0.03+. In production, you:

Pay for identical requests multiple times
Have no way to replay production issues locally
Can't see exactly what you're being charged for
Lose money on prompt variations ("Explain X" vs "What is X?")

LLMCacheX solves all of this.

Without LLMCacheX:  10 identical requests → $0.30
With LLMCacheX:     10 identical requests → $0.03 (1 cache hit)
Semantic bonus:     10 similar requests  → $0.03 (semantic match)

What It Does

🎯 Exact Match Caching

Deterministic hashing prevents duplicate LLM calls. Identical prompts always hit cache.

🧠 Semantic Caching

"Explain Python" = "What is Python?" = "Tell me about Python" → Same cached response

10-30x increase in cache hit rates
Handles typos, phrasing variations, multilingual
Configurable similarity threshold (0.85-0.97)

⚡ Async Deduplication

10 concurrent identical requests:

First hits LLM
Others wait for result
All get same response
You pay once, not 10x

🔄 Replay & Time Travel

Replay production requests locally with zero API usage:

# Record (production)
export LLMCACHEX_MODE=live

# Replay (local development)
export LLMCACHEX_MODE=replay
llmcachex run "same prompt"  # Returns cached response

🏢 Multi-Tenant Org Scoping

Complete cache isolation per organization:

curl -H "X-Org-ID: acme" http://localhost:8000/api/v1/cache
curl -H "X-Org-ID: techcorp" http://localhost:8000/api/v1/cache
# Different caches, zero cross-contamination

📊 Cost Control

See exactly what you're paying for:

{
  "content": "Python is...",
  "cost": 0.0024,
  "cache_type": "semantic",
  "similarity": 0.94
}

Quick Start

Installation

# Basic (memory storage)
pip install llmcachex

# With gateway
pip install llmcachex[gateway]

# Full (with Redis, dev tools)
pip install llmcachex[all]

Environment Setup

export OPENAI_API_KEY="sk-..."
export LLMCACHEX_SEMANTIC=true
export LLMCACHEX_SEMANTIC_THRESHOLD=0.92

CLI Usage

# Simple request
llmcachex run "Explain quantum computing"

# With semantic tolerance
llmcachex run "What is quantum computing?" --threshold 0.90

# With org isolation
llmcachex run "Explain Python" --org acme_corp

# Disable semantic (exact only)
llmcachex run "Explain Python" --no-semantic

HTTP Gateway (SaaS Mode)

# Start server
llmcachex serve --port 8000

# Make request
curl -X POST http://localhost:8000/api/v1/cache \
  -H "Content-Type: application/json" \
  -H "X-Org-ID: acme" \
  -d '{
    "prompt": "Explain machine learning",
    "model": "gpt-4o-mini",
    "temperature": 0.3
  }'

Response:

{
  "content": "Machine learning is...",
  "cached": true,
  "cache_type": "semantic",
  "cost": 0.0,
  "similarity": 0.94,
  "org_id": "acme"
}

Architecture

Request → Hash → Exact Match? ✓ Return
                    ↓ No
              Generate Embedding
                    ↓
              Semantic Match? ✓ Return
                    ↓ No
              Call LLM (with dedup lock)
                    ↓
              Store with Embedding
                    ↓
              Return + Cost

Storage Backends

Backend	Use Case	Latency	Cost
Memory	Development, testing	<1ms	Free
Redis	Production, shared cache	5-10ms	$$
Vector DB	Large scale (100K+)	50-100ms	$$$

Providers

OpenAI (default): GPT-4, GPT-3.5, embeddings API
Custom: Implement provider interface

Configuration

Environment Variables

# Core
OPENAI_API_KEY=sk-...              # OpenAI API key (required)
LLMCACHEX_MODE=live|replay         # live or replay mode
LLMCACHEX_ORG=default              # Default org ID

# Storage
LLMCACHEX_STORAGE=memory|redis     # Storage backend
REDIS_URL=redis://localhost:6379/0 # Redis connection

# Semantic Caching
LLMCACHEX_SEMANTIC=true|false      # Enable semantic matching
LLMCACHEX_SEMANTIC_THRESHOLD=0.92  # Similarity threshold (0.0-1.0)
LLMCACHEX_EMBEDDING_MODEL=text-embedding-3-small  # Embedding model

# Security
LLMCACHEX_API_KEY=secret           # API key for gateway

Python API

Simple Usage

from llmcachex import LLMCache
from llmcachex.storage.memory import MemoryStorage
from llmcachex.providers.openai_provider import OpenAIProvider
from llmcachex.models import LLMRequest
import asyncio

# Initialize
cache = LLMCache(MemoryStorage(), OpenAIProvider())

# Make request
req = LLMRequest(
    provider="openai",
    model="gpt-4o-mini",
    prompt="Explain Python",
    temperature=0.3,
    org_id="my_org"
)

result = asyncio.run(cache.run_async(req))

print(result.content)         # Response text
print(result.cache_type)      # "exact", "semantic", or "miss"
print(result.cost)            # Dollar cost
print(result.similarity)      # 0.94 for semantic hits

With Redis

from llmcachex.storage.redis import RedisStorage

cache = LLMCache(
    RedisStorage("redis://localhost:6379/0"),
    OpenAIProvider()
)

Custom Threshold

result = await cache.run_async(
    req,
    similarity_threshold=0.95  # More conservative
)

Examples

Example 1: Cost Savings

# Monitor real savings
results = []
for i in range(10):
    result = await cache.run_async(request)
    results.append(result)

misses = sum(1 for r in results if r.cache_type == "miss")
total_cost = sum(r.cost for r in results)

print(f"Cache hits: {10 - misses}")
print(f"Total cost: ${total_cost:.6f}")

Example 2: Replay Production

# Production (recording)
export LLMCACHEX_MODE=live
export REDIS_URL=redis://prod:6379/0

# Debug locally with zero cost
export LLMCACHEX_MODE=replay
llmcachex run "same prompt"

Example 3: Multi-Tenant SaaS

@app.post("/api/cache")
async def handle_cache(request: CacheRequest, org_id: str = Header()):
    llm_req = LLMRequest(
        prompt=request.prompt,
        org_id=org_id
    )
    return await cache.run_async(llm_req)

Benchmarks

Speed

Exact match:    < 1ms
Semantic match: 50-200ms
LLM call:       1000-5000ms

Win: 5-100x faster than LLM

Cost

1000 requests with 30% natural duplication + 20% semantic boost:

Exact only:  700 calls = $7.00
With semantic: 500 calls + embeddings = $5.50

Savings: 21% (compounds to $550/month)

Documentation

GATEWAY.md - HTTP API reference
SEMANTIC.md - Semantic caching guide
API Docs - Interactive Swagger UI

Testing

pip install llmcachex[dev]
pytest tests/ -v
pytest tests/ --cov=llmcachex
mypy llmcachex/
ruff check llmcachex/
black llmcachex/

Roadmap

Exact match caching
Semantic caching
Async deduplication
Org-scoped isolation
HTTP gateway
Redis backend
Vector database support
Advanced analytics
Streaming responses
Custom providers
Rate limiting
Usage billing

Production Deployment

Quick Start

docker run -p 8000:8000 \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  prabhnoor12/llmcachex:latest

With Docker Compose

docker-compose up -d  # Redis + LLMCacheX

Contributing

Fork the repo
Create feature branch
Add tests
Format with black and ruff
Submit PR

License

MIT License - see LICENSE

Support

Issues: GitHub Issues
Docs: SEMANTIC.md · GATEWAY.md

Made with ❤️ by prabhnoor12

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmcachex-0.2.0.tar.gz (24.9 kB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmcachex-0.2.0-py3-none-any.whl (24.3 kB view details)

Uploaded Jan 10, 2026 Python 3

File details

Details for the file llmcachex-0.2.0.tar.gz.

File metadata

Download URL: llmcachex-0.2.0.tar.gz
Upload date: Jan 10, 2026
Size: 24.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for llmcachex-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ace61577c661f5f87f4197cb399c8651c4924a469275e25585ec04b4437b5a62`
MD5	`b1df38f8fe8ecce7cfa63ba5887599dc`
BLAKE2b-256	`3cb071b1667e7b71053c399bbc92b476b554afed8a822260f3cfc9b3370f65ec`

See more details on using hashes here.

File details

Details for the file llmcachex-0.2.0-py3-none-any.whl.

File metadata

Download URL: llmcachex-0.2.0-py3-none-any.whl
Upload date: Jan 10, 2026
Size: 24.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for llmcachex-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d3b710348c0554b29966343d8617580df27d840ae4c4bc4c489c8da5e5e157c8`
MD5	`c2afd8dd25c29613fd3fbb7036c55530`
BLAKE2b-256	`051e465df0d4bca6fdff9f978828a96cd20954f0a408a05afa39261724d34167`

See more details on using hashes here.

llmcachex 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLMCacheX

Why LLMCacheX?

What It Does

🎯 Exact Match Caching

🧠 Semantic Caching

⚡ Async Deduplication

🔄 Replay & Time Travel

🏢 Multi-Tenant Org Scoping

📊 Cost Control

Quick Start

Installation

Environment Setup

CLI Usage

HTTP Gateway (SaaS Mode)

Architecture

Storage Backends

Providers

Configuration

Environment Variables

Python API

Simple Usage

With Redis

Custom Threshold

Examples

Example 1: Cost Savings

Example 2: Replay Production

Example 3: Multi-Tenant SaaS

Benchmarks

Speed

Cost

Documentation

Testing

Roadmap

Production Deployment

Quick Start

With Docker Compose

Contributing

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes