Skip to main content

A drop-in, model-agnostic cache for Large Language Model API calls

Project description

LLM Cache

A drop-in, model-agnostic cache for Large Language Model API calls. Cache your OpenAI, Anthropic, and other LLM API responses to save costs and improve performance.

Author: Sherin Joseph Roy
Email: sherin.joseph2217@gmail.com
GitHub: @Sherin-SEF-AI

PyPI version Python 3.10+ License: MIT

Features

  • 🔐 Deterministic Hashing: SHA256-based request signature hashing
  • 💾 Multiple Backends: SQLite (default) and Redis support
  • 📊 Cost Tracking: Monitor API costs and savings
  • ⚡ Streaming Support: Cache and replay streamed responses
  • 🔧 Provider Agnostic: Works with OpenAI, Anthropic, Cohere, and more
  • 🛡️ Encryption: Optional AES-256 encryption for sensitive data
  • 🗜️ Compression: Zstandard compression to reduce storage
  • 🌐 HTTP Proxy: Transparent proxy mode for existing applications
  • 📈 Metrics: Prometheus-compatible metrics endpoint
  • ⚙️ TTL Support: Configurable time-to-live for cache entries

Quick Start

Installation

pip install llm-cache

Basic Usage

Decorator Pattern

from llm_cache import cached_call

@cached_call(provider="openai", model="gpt-4")
def ask_llm(prompt: str):
    # Your existing OpenAI call here
    return openai_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

# First call hits the API
response1 = ask_llm("What is Python?")
# Second call returns cached response
response2 = ask_llm("What is Python?")  # Instant!

Context Manager

from llm_cache import wrap_openai
import openai

client = openai.OpenAI()

# Wrap your client with caching
with wrap_openai(client, ttl_days=7):
    # All calls are automatically cached
    response1 = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello"}]
    )
    
    # Same request returns cached response
    response2 = client.chat.completions.create(
        model="gpt-4", 
        messages=[{"role": "user", "content": "Hello"}]
    )

Low-level API

from llm_cache import LLMCache

cache = LLMCache()

def fetch_from_openai(prompt):
    # Your actual API call
    return openai_client.chat.completions.create(...)

# Get or set from cache
response = cache.get_or_set(
    key="unique_request_hash",
    fetch_func=lambda: fetch_from_openai("What is AI?"),
    provider="openai",
    model="gpt-4",
    endpoint="/v1/chat/completions",
    request_data={"messages": [{"role": "user", "content": "What is AI?"}]}
)

HTTP Proxy Mode

Start a proxy server that intercepts and caches LLM API calls:

llm-cache serve --host 127.0.0.1 --port 8100

Then point your applications to the proxy instead of the original API:

import openai

# Use proxy instead of direct API
client = openai.OpenAI(
    base_url="http://127.0.0.1:8100",
    api_key="your-api-key"
)

# All calls are automatically cached
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

CLI Commands

View Statistics

# Basic stats
llm-cache stats

# Detailed stats with provider breakdown
llm-cache stats --verbose

List Cache Entries

# List recent entries
llm-cache list

# Filter by provider
llm-cache list --provider openai

# Filter by model
llm-cache list --model gpt-4

# Limit results
llm-cache list --limit 10

Inspect Entries

# Show entry details
llm-cache show <cache_key>

# Export entry to file
llm-cache show <cache_key> --output entry.json

Purge Cache

# Delete specific entry
llm-cache purge --key <cache_key>

# Delete expired entries
llm-cache purge --expired

# Delete entries older than 30 days
llm-cache purge --older 30

# Delete all entries for a model
llm-cache purge --model gpt-3.5-turbo

# Delete all entries (with confirmation)
llm-cache purge --all

Export Data

# Export to JSONL format
llm-cache export cache_dump.jsonl

# Export to JSON format
llm-cache export cache_dump.json --format json

# Export only OpenAI entries
llm-cache export openai_entries.jsonl --provider openai

Health Check

# Check system health
llm-cache doctor

Configuration

Environment Variables

# Cache settings
export LLMCACHE_TTL=30                    # Default TTL in days
export LLMCACHE_COMPRESSION=true          # Enable compression
export LLMCACHE_ENCRYPTION=false          # Enable encryption
export LLMCACHE_ENCRYPTION_KEY="secret"   # Encryption key

# Storage
export LLMCACHE_BACKEND=sqlite            # Backend (sqlite, redis)
export LLMCACHE_DATABASE_URL="..."        # Database URL

# Proxy settings
export LLMCACHE_PROXY_HOST=127.0.0.1
export LLMCACHE_PROXY_PORT=8100

# Logging
export LLMCACHE_LOG_LEVEL=INFO
export LLMCACHE_LOG_FILE=/path/to/logs

Configuration File

Create ~/.config/llm-cache/config.toml:

# Cache settings
backend = "sqlite"
default_ttl_days = 30
enable_compression = true
enable_encryption = false

# Proxy settings
proxy_host = "127.0.0.1"
proxy_port = 8100

# Pricing table (cost per 1K tokens)
[pricing_table]
openai.gpt-4 = { input = 0.03, output = 0.06 }
openai.gpt-3.5-turbo = { input = 0.0015, output = 0.002 }
anthropic.claude-3 = { input = 0.015, output = 0.075 }

Advanced Usage

Streaming Support

@cached_call(provider="openai", model="gpt-4")
def streaming_call(messages, stream=True):
    return openai_client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        stream=stream
    )

# First call collects the stream
response = streaming_call([{"role": "user", "content": "Hello"}], stream=True)

# Subsequent calls replay the cached stream
for chunk in response:
    print(chunk)

Custom TTL

@cached_call(provider="openai", model="gpt-4", ttl_days=7)
def short_lived_cache(prompt):
    return openai_client.chat.completions.create(...)

Encryption

import os
os.environ["LLMCACHE_ENCRYPTION_KEY"] = "your-secret-key"

cache = LLMCache(enable_encryption=True)
# All cached data will be encrypted

Redis Backend

cache = LLMCache(
    backend="redis",
    database_url="redis://localhost:6379/0"
)

Metrics

When running in proxy mode, access metrics at /metrics:

curl http://localhost:8100/metrics

Example output:

# HELP llm_cache_entries_total Total number of cache entries
# TYPE llm_cache_entries_total counter
llm_cache_entries_total 42

# HELP llm_cache_hits_total Total number of cache hits
# TYPE llm_cache_hits_total counter
llm_cache_hits_total 156

# HELP llm_cache_cost_saved_usd Total cost saved in USD
# TYPE llm_cache_cost_saved_usd counter
llm_cache_cost_saved_usd 12.34

Examples

OpenAI Integration

import openai
from llm_cache import wrap_openai

client = openai.OpenAI()

with wrap_openai(client):
    # All calls are cached
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Explain quantum computing"}],
        temperature=0.7
    )

Anthropic Integration

import anthropic
from llm_cache import cached_call

@cached_call(provider="anthropic", model="claude-3-sonnet")
def ask_claude(prompt):
    client = anthropic.Anthropic()
    return client.messages.create(
        model="claude-3-sonnet",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )

HTTP Client Integration

import httpx
from llm_cache import LLMCache

cache = LLMCache()

def cached_api_call(prompt):
    def fetch():
        with httpx.Client() as client:
            response = client.post(
                "https://api.openai.com/v1/chat/completions",
                headers={"Authorization": f"Bearer {api_key}"},
                json={
                    "model": "gpt-4",
                    "messages": [{"role": "user", "content": prompt}]
                }
            )
            return response.json()
    
    return cache.get_or_set(
        key=f"prompt_{hash(prompt)}",
        fetch_func=fetch,
        provider="openai",
        model="gpt-4",
        endpoint="/v1/chat/completions",
        request_data={"messages": [{"role": "user", "content": prompt}]}
    )

Performance

  • Cache Hit Rate: Typically 60-80% for repeated queries
  • Cost Savings: 40-60% reduction in API costs
  • Latency: Cache hits return in <1ms
  • Storage: ~1KB per cached response (compressed)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Run pytest
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_cache_pro-0.1.2.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_cache_pro-0.1.2-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_cache_pro-0.1.2.tar.gz.

File metadata

  • Download URL: llm_cache_pro-0.1.2.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_cache_pro-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0c02590af4a4f574d942a13259d5c9a6dd967d52931a0c9c7c8db32896634b22
MD5 88b3d6dc7b758a47e221fb02fbdee18f
BLAKE2b-256 cae591852fd551f483c0e9a88c966efcea39be25272b69b56016fa93ac8c4de0

See more details on using hashes here.

File details

Details for the file llm_cache_pro-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: llm_cache_pro-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_cache_pro-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 95a8c39836f0333813b9e7f8b7abe975e64678cc4209e4e66b9d62d3c6d56b2f
MD5 4d086d22f653c838437f51e8d3cd5692
BLAKE2b-256 1cfba9c8c3417abd8eb31c881927d270b2d0f2e2be3e381d9956346da11ebf9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page