Skip to main content

A drop-in, model-agnostic cache for Large Language Model API calls

Project description

LLM Cache

A drop-in, model-agnostic cache for Large Language Model API calls. Cache your OpenAI, Anthropic, and other LLM API responses to save costs and improve performance.

Author: Sherin Joseph Roy
Email: sherin.joseph2217@gmail.com
GitHub: @Sherin-SEF-AI

PyPI version Python 3.10+ License: MIT

Features

  • 🔐 Deterministic Hashing: SHA256-based request signature hashing
  • 💾 Multiple Backends: SQLite (default) and Redis support
  • 📊 Cost Tracking: Monitor API costs and savings
  • ⚡ Streaming Support: Cache and replay streamed responses
  • 🔧 Provider Agnostic: Works with OpenAI, Anthropic, Cohere, and more
  • 🛡️ Encryption: Optional AES-256 encryption for sensitive data
  • 🗜️ Compression: Zstandard compression to reduce storage
  • 🌐 HTTP Proxy: Transparent proxy mode for existing applications
  • 📈 Metrics: Prometheus-compatible metrics endpoint
  • ⚙️ TTL Support: Configurable time-to-live for cache entries

Quick Start

Installation

pip install llm-cache

Basic Usage

Decorator Pattern

from llm_cache import cached_call

@cached_call(provider="openai", model="gpt-4")
def ask_llm(prompt: str):
    # Your existing OpenAI call here
    return openai_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

# First call hits the API
response1 = ask_llm("What is Python?")
# Second call returns cached response
response2 = ask_llm("What is Python?")  # Instant!

Context Manager

from llm_cache import wrap_openai
import openai

client = openai.OpenAI()

# Wrap your client with caching
with wrap_openai(client, ttl_days=7):
    # All calls are automatically cached
    response1 = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello"}]
    )
    
    # Same request returns cached response
    response2 = client.chat.completions.create(
        model="gpt-4", 
        messages=[{"role": "user", "content": "Hello"}]
    )

Low-level API

from llm_cache import LLMCache

cache = LLMCache()

def fetch_from_openai(prompt):
    # Your actual API call
    return openai_client.chat.completions.create(...)

# Get or set from cache
response = cache.get_or_set(
    key="unique_request_hash",
    fetch_func=lambda: fetch_from_openai("What is AI?"),
    provider="openai",
    model="gpt-4",
    endpoint="/v1/chat/completions",
    request_data={"messages": [{"role": "user", "content": "What is AI?"}]}
)

HTTP Proxy Mode

Start a proxy server that intercepts and caches LLM API calls:

llm-cache serve --host 127.0.0.1 --port 8100

Then point your applications to the proxy instead of the original API:

import openai

# Use proxy instead of direct API
client = openai.OpenAI(
    base_url="http://127.0.0.1:8100",
    api_key="your-api-key"
)

# All calls are automatically cached
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

CLI Commands

View Statistics

# Basic stats
llm-cache stats

# Detailed stats with provider breakdown
llm-cache stats --verbose

List Cache Entries

# List recent entries
llm-cache list

# Filter by provider
llm-cache list --provider openai

# Filter by model
llm-cache list --model gpt-4

# Limit results
llm-cache list --limit 10

Inspect Entries

# Show entry details
llm-cache show <cache_key>

# Export entry to file
llm-cache show <cache_key> --output entry.json

Purge Cache

# Delete specific entry
llm-cache purge --key <cache_key>

# Delete expired entries
llm-cache purge --expired

# Delete entries older than 30 days
llm-cache purge --older 30

# Delete all entries for a model
llm-cache purge --model gpt-3.5-turbo

# Delete all entries (with confirmation)
llm-cache purge --all

Export Data

# Export to JSONL format
llm-cache export cache_dump.jsonl

# Export to JSON format
llm-cache export cache_dump.json --format json

# Export only OpenAI entries
llm-cache export openai_entries.jsonl --provider openai

Health Check

# Check system health
llm-cache doctor

Configuration

Environment Variables

# Cache settings
export LLMCACHE_TTL=30                    # Default TTL in days
export LLMCACHE_COMPRESSION=true          # Enable compression
export LLMCACHE_ENCRYPTION=false          # Enable encryption
export LLMCACHE_ENCRYPTION_KEY="secret"   # Encryption key

# Storage
export LLMCACHE_BACKEND=sqlite            # Backend (sqlite, redis)
export LLMCACHE_DATABASE_URL="..."        # Database URL

# Proxy settings
export LLMCACHE_PROXY_HOST=127.0.0.1
export LLMCACHE_PROXY_PORT=8100

# Logging
export LLMCACHE_LOG_LEVEL=INFO
export LLMCACHE_LOG_FILE=/path/to/logs

Configuration File

Create ~/.config/llm-cache/config.toml:

# Cache settings
backend = "sqlite"
default_ttl_days = 30
enable_compression = true
enable_encryption = false

# Proxy settings
proxy_host = "127.0.0.1"
proxy_port = 8100

# Pricing table (cost per 1K tokens)
[pricing_table]
openai.gpt-4 = { input = 0.03, output = 0.06 }
openai.gpt-3.5-turbo = { input = 0.0015, output = 0.002 }
anthropic.claude-3 = { input = 0.015, output = 0.075 }

Advanced Usage

Streaming Support

@cached_call(provider="openai", model="gpt-4")
def streaming_call(messages, stream=True):
    return openai_client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        stream=stream
    )

# First call collects the stream
response = streaming_call([{"role": "user", "content": "Hello"}], stream=True)

# Subsequent calls replay the cached stream
for chunk in response:
    print(chunk)

Custom TTL

@cached_call(provider="openai", model="gpt-4", ttl_days=7)
def short_lived_cache(prompt):
    return openai_client.chat.completions.create(...)

Encryption

import os
os.environ["LLMCACHE_ENCRYPTION_KEY"] = "your-secret-key"

cache = LLMCache(enable_encryption=True)
# All cached data will be encrypted

Redis Backend

cache = LLMCache(
    backend="redis",
    database_url="redis://localhost:6379/0"
)

Metrics

When running in proxy mode, access metrics at /metrics:

curl http://localhost:8100/metrics

Example output:

# HELP llm_cache_entries_total Total number of cache entries
# TYPE llm_cache_entries_total counter
llm_cache_entries_total 42

# HELP llm_cache_hits_total Total number of cache hits
# TYPE llm_cache_hits_total counter
llm_cache_hits_total 156

# HELP llm_cache_cost_saved_usd Total cost saved in USD
# TYPE llm_cache_cost_saved_usd counter
llm_cache_cost_saved_usd 12.34

Examples

OpenAI Integration

import openai
from llm_cache import wrap_openai

client = openai.OpenAI()

with wrap_openai(client):
    # All calls are cached
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Explain quantum computing"}],
        temperature=0.7
    )

Anthropic Integration

import anthropic
from llm_cache import cached_call

@cached_call(provider="anthropic", model="claude-3-sonnet")
def ask_claude(prompt):
    client = anthropic.Anthropic()
    return client.messages.create(
        model="claude-3-sonnet",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )

HTTP Client Integration

import httpx
from llm_cache import LLMCache

cache = LLMCache()

def cached_api_call(prompt):
    def fetch():
        with httpx.Client() as client:
            response = client.post(
                "https://api.openai.com/v1/chat/completions",
                headers={"Authorization": f"Bearer {api_key}"},
                json={
                    "model": "gpt-4",
                    "messages": [{"role": "user", "content": prompt}]
                }
            )
            return response.json()
    
    return cache.get_or_set(
        key=f"prompt_{hash(prompt)}",
        fetch_func=fetch,
        provider="openai",
        model="gpt-4",
        endpoint="/v1/chat/completions",
        request_data={"messages": [{"role": "user", "content": prompt}]}
    )

Performance

  • Cache Hit Rate: Typically 60-80% for repeated queries
  • Cost Savings: 40-60% reduction in API costs
  • Latency: Cache hits return in <1ms
  • Storage: ~1KB per cached response (compressed)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Run pytest
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_cache_pro-0.1.1.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_cache_pro-0.1.1-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_cache_pro-0.1.1.tar.gz.

File metadata

  • Download URL: llm_cache_pro-0.1.1.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_cache_pro-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fd82f14749a522897930f28c1297e463845b1f9c0b827b77ba77bd1ecc2047e3
MD5 dace7471a8cdb99197e6fec2a39bab4a
BLAKE2b-256 c82a44fa09bcf4983af82da06ede680c8ff363b3e31d42971899edcb86ad2475

See more details on using hashes here.

File details

Details for the file llm_cache_pro-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_cache_pro-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_cache_pro-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a41e20af06ede0978d3204732b431211a7800161a7991e4f2b3e02205053ef0e
MD5 76e422f8668ab727c54c45fb8ad19c1a
BLAKE2b-256 be9be3382866440084b1ad6512b91827da76f1dfd6b7186fc6f69809f5027295

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page