Skip to main content

Universal prompt cache optimizer for Anthropic, OpenAI & Gemini

Project description

Python 3.10+ Version 0.1.0 Tests 125 passed MIT License

CacheGuardian

Stop overpaying for LLM API calls you've already made.

A drop-in Python middleware that wraps Anthropic, OpenAI, and Gemini SDKs to
automatically optimize prompt caching and show you exactly how much money you're saving.


The Problem

Every major LLM provider offers prompt caching — cached tokens cost 10-90% less than regular input tokens. But developers silently break their cache due to non-obvious pitfalls:

  • Non-deterministic tool ordering
  • System prompt mutations between requests
  • Model switches mid-session
  • Dynamic content placed before static content
  • And dozens of other subtle mistakes

The result? You pay full price for tokens the provider already computed, and you don't even know it's happening. The only signal you get is a number buried in the API response.

The Solution

CacheGuardian wraps your existing SDK client in one line of code. It detects cache breaks locally in < 1 millisecond, automatically fixes the most common mistakes, and logs exactly how much money you're saving — or wasting — on every single call.

import cacheguardian
import anthropic

# Before: client = anthropic.Anthropic()
client = cacheguardian.wrap(anthropic.Anthropic())

# Everything else stays exactly the same.
# CacheGuardian works silently in the background.
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful assistant.",
    tools=[...],
    messages=[...]
)
[cacheguardian] L1 HIT | Cache hit 94.2% | Saved $0.0340 | Session total: $1.24 saved

Installation

pip install cacheguardian

Then install it alongside the provider(s) you use:

# Anthropic Claude
pip install cacheguardian anthropic

# OpenAI GPT / o-series
pip install cacheguardian openai

# Google Gemini
pip install cacheguardian google-genai

# Optional: Redis for distributed caching (L2)
pip install cacheguardian redis

Requirements: Python 3.10+


Quick Start

Anthropic

import cacheguardian
import anthropic

client = cacheguardian.wrap(anthropic.Anthropic())

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful coding assistant.",
    tools=[
        {
            "name": "search_code",
            "description": "Search the codebase",
            "input_schema": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
            },
        }
    ],
    messages=[{"role": "user", "content": "Find all TODO comments"}],
)

What CacheGuardian does automatically:

Optimization What it does
Injects cache_control Adds top-level auto-caching if you forgot it
Sorts tools Alphabetical by name — prevents ordering-based misses
Stabilizes JSON keys Sorts all dict keys recursively for consistent hashing
Intermediate breakpoints Adds cache_control markers every 15 messages when you exceed 20 blocks
Smart TTL Switches from 5-minute to 1-hour TTL when your request intervals are > 5 min

OpenAI

import cacheguardian
import openai

client = cacheguardian.wrap(openai.OpenAI())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain prompt caching."},
    ],
)

What CacheGuardian does automatically:

Optimization What it does
Derives prompt_cache_key Generates a stable routing key so your requests hit the same physical cache hardware
Reorders content Moves system messages before user messages for better prefix overlap
Sorts tools Same as Anthropic — deterministic ordering
Smart retention Sets prompt_cache_retention="24h" when your request intervals are > 10 min
1024-token threshold Suppresses false-positive cache warnings for prompts under 1024 tokens

Google Gemini

import cacheguardian
from google import genai

client = cacheguardian.wrap(genai.Client())

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Summarize this document.",
    config={"system_instruction": "You are an expert analyst."},
)

What CacheGuardian does automatically:

Optimization What it does
Implicit → Explicit promotion Creates a CachedContent object when the cost-benefit math says it saves money
TTL optimization Calculates optimal TTL from your request frequency to minimize storage costs
Zombie cache cleanup Persists a cache registry to disk — cleans up orphaned caches even after crashes
Storage cost tracking Tracks Gemini's per-hour storage fees separately so you see real ROI

Async Clients

CacheGuardian supports async clients out of the box. All diffing and metric extraction runs via asyncio.to_thread() so the event loop is never blocked.

import cacheguardian
import anthropic

client = cacheguardian.wrap(anthropic.AsyncAnthropic())

# Use await as normal
response = await client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)

Works with AsyncAnthropic, AsyncOpenAI, and Gemini's async methods.


How It Works

CacheGuardian uses a tiered L1 / L2 / L3 architecture inspired by CPU cache hierarchies and production inference gateways:

┌───────────────────────────────────────────────────────────────┐
│                     Your Application                          │
├───────────────────────────────────────────────────────────────┤
│                    cacheguardian.wrap()                          │
├──────────┬────────────────────────────────────┬───────────────┤
│          │                                    │               │
│    L1    │  Local Python Dict                 │    < 1 ms     │
│          │  Rolling segment fingerprints      │               │
│          │  Instant divergence detection      │               │
│          │                                    │               │
├──────────┼────────────────────────────────────┼───────────────┤
│          │                                    │               │
│    L2    │  Redis (optional)                  │   2 – 10 ms   │
│          │  Cross-worker prefix sharing       │               │
│          │  Gemini CachedContent coordination │               │
│          │                                    │               │
├──────────┼────────────────────────────────────┼───────────────┤
│          │                                    │               │
│    L3    │  Provider API                      │  100 ms – 2s  │
│          │  The actual transformer KV-cache   │               │
│          │                                    │               │
└──────────┴────────────────────────────────────┴───────────────┘

L1: The Secret Sauce

Instead of hashing your entire prompt, CacheGuardian hashes it in segments:

system prompt  →  sha256  →  "a3f1..."
tools block    →  sha256  →  "b7e2..."
message[0]     →  sha256  →  "c9d4..."
message[1]     →  sha256  →  "e1a6..."

When your next request comes in, it compares segment hashes sequentially. The moment one doesn't match, it knows exactly which segment diverged — without scanning the full content. This is how it achieves < 1ms detection even on 100k-token prompts.

What L1 Enables

  • Pre-emptive warnings — Detect a 1-character typo in a 50,000-token system prompt before the request is sent and the bill is generated
  • dry_run mode — Test your prompt structure against the cache without spending a cent
  • Actionable suggestions — Not just "cache missed" but "your system prompt changed — use a system-reminder message instead"

Dry Run Mode

Test whether your prompt would hit the cache — without making an API call.

import cacheguardian

client = cacheguardian.wrap(anthropic.Anthropic())

# First, make a real call to establish the cache
client.messages.create(model="claude-sonnet-4-20250514", ...)

# Now test a new prompt against the cache for free
result = cacheguardian.dry_run(
    client,
    model="claude-sonnet-4-20250514",
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "New question"}],
)

print(result.would_hit_cache)       # True / False
print(result.prefix_match_depth)    # "75% — 3/4 segments match (diverged at message[1])"
print(result.estimated_savings)     # 0.034
print(result.warnings)              # [CacheBreakWarning(...)]
[cacheguardian] DRY RUN — would HIT cache | Estimated savings: $0.0340

Zero cost. Instant feedback. Iterate on your prompt structure before spending a cent.


Configuration

client = cacheguardian.wrap(
    anthropic.Anthropic(),

    # Automatically fix safe issues (tool sorting, cache_control injection)
    auto_fix=True,                  # default: True

    # TTL strategy for Anthropic
    ttl_strategy="auto",            # "auto" | "5m" | "1h" — default: "auto"

    # Strict mode: raise exceptions instead of warnings on cache breaks
    strict_mode=False,              # default: False

    # Logging verbosity
    log_level="INFO",               # "DEBUG" | "INFO" | "WARNING" | "ERROR"

    # Alert when cache hit rate drops below this threshold
    min_cache_hit_rate=0.7,         # default: 0.7

    # OpenAI: custom function to derive prompt_cache_key
    cache_key_fn=lambda session: f"user_{session.session_id}",

    # L2: Redis URL for distributed environments
    l2_backend="redis://localhost:6379",

    # Privacy: add timing jitter to prevent cache-timing side-channel attacks
    privacy_mode=False,             # default: False
    privacy_jitter_ms=(50, 200),    # jitter range in ms — default: (50, 200)
)

Pricing Overrides

CacheGuardian ships with default pricing tables for all supported models. If prices change, override them:

from cacheguardian.config import PricingConfig

client = cacheguardian.wrap(
    anthropic.Anthropic(),
    pricing_overrides={
        "anthropic": {
            "claude-sonnet-4-20250514": PricingConfig(
                base_input=3.00,        # $ per million tokens
                cache_read=0.30,        # 90% discount
                cache_write_5m=3.75,    # 25% premium (5-min TTL)
                cache_write_1h=6.00,    # 100% premium (1-hour TTL)
                output=15.00,
            ),
        },
    },
)

The Promotion Formula

For Gemini (explicit CachedContent) and Anthropic (1-hour TTL), CacheGuardian uses a break-even formula to decide when upgrading is worth the cost:

N  >  (C_write + S × T) / (C_input − C_cache_read)
Symbol Meaning
N Expected number of requests reusing this content
C_write One-time cost to write the cache
S Storage cost per hour (Gemini only; 0 for Anthropic)
T TTL in hours
C_input Standard input token cost
C_cache_read Discounted cache read cost

CacheGuardian tracks your request frequency automatically and promotes when N crosses the break-even threshold.


System Prompt Templates

Dynamic content in system prompts (dates, user names, config values) is one of the most common cache killers. CacheGuardian provides a template pattern that keeps the cache-friendly static part as the system prompt and injects dynamic values as messages instead:

from cacheguardian.core.optimizer import SystemPromptTemplate

template = SystemPromptTemplate(
    "You are a helpful assistant. The current date is {date}. User: {user_name}."
)

# Use the static template as your system prompt (never changes → always cached)
system_prompt = template.static_part

# Inject dynamic values as a system-reminder message (appended, not prefixed)
reminder = template.render_dynamic(date="2026-02-20", user_name="Alice")
# → "Updated context: date=2026-02-20, user_name=Alice"

Distributed Caching with Redis (L2)

In serverless or multi-worker environments, different workers may not know about each other's sessions. The optional L2 cache solves this:

client = cacheguardian.wrap(
    anthropic.Anthropic(),
    l2_backend="redis://localhost:6379",
)

What L2 enables:

  • Cross-worker prefix sharing — Worker B knows that Worker A already warmed the cache for a given prefix
  • Gemini cache coordination — Prevents redundant CachedContent creation fees when multiple workers process the same content
  • Rate limit coordination — Shared request counting across workers
  • Graceful degradation — If Redis is unavailable, CacheGuardian falls back to L1-only with zero errors

Gemini Safety Lock

Gemini's explicit caches incur storage fees of $4.50 per million tokens per hour. If your process crashes without cleaning up, those caches keep billing.

CacheGuardian solves this with a disk-persisted cache registry:

from cacheguardian.providers.gemini import GeminiProvider

# On startup, clean up any zombie caches from previous crashes
provider = GeminiProvider(config, gemini_client=client)
cleaned = provider.cleanup_stale_caches(max_age_hours=2.0)
print(f"Cleaned {cleaned} orphaned caches")

The registry is written to ~/.cache/cacheguardian/gemini_registry.json on every cache creation. On next startup, any caches not accessed within the threshold are automatically deleted.


Architecture

cacheguardian/
├── __init__.py                  # Public API: wrap(), dry_run(), configure()
├── types.py                     # Core data types (9 dataclasses)
├── config.py                    # Configuration + pricing tables
│
├── cache/
│   ├── fingerprint.py           # Normalize → segment → rolling SHA-256 hash
│   ├── l1.py                    # Local dict: <1ms fingerprint comparison
│   └── l2.py                    # Optional Redis: cross-worker coordination
│
├── core/
│   ├── session.py               # Session state tracking across API calls
│   ├── optimizer.py             # Transforms: sort tools, stabilize JSON, templates
│   ├── differ.py                # Segment-level diff engine with cost estimation
│   ├── metrics.py               # Cost formulas for all 3 providers
│   ├── promoter.py              # Break-even promotion logic
│   └── logger.py                # Rich colored terminal output
│
├── providers/
│   ├── base.py                  # Abstract provider interface
│   ├── anthropic.py             # cache_control, breakpoints, TTL, JSON stabilization
│   ├── openai.py                # prompt_cache_key, retention, content reordering
│   └── gemini.py                # CachedContent lifecycle, promotion, safety lock
│
├── middleware/
│   ├── interceptor.py           # Sync wrapper: L1 → transform → L3 → metrics
│   └── async_interceptor.py     # Async wrapper: non-blocking via asyncio.to_thread()
│
└── persistence/
    └── cache_registry.py        # Disk-persisted Gemini cache safety lock

Design Principles

  1. Exact prefix matching only. No semantic or embedding-based caching. CacheGuardian guarantees 100% accuracy — it will never serve a "similar" cached result for a different question.

  2. Never modify the response. CacheGuardian transforms the request (sorting tools, injecting cache_control) and logs after the response. The response object you receive is identical to what the raw SDK would return.

  3. Composition over inheritance. SDK clients are wrapped, not subclassed. This makes CacheGuardian resilient to SDK version changes.

  4. Optional everything. Install only the provider SDKs you use. Redis is optional. Privacy mode is optional. Every feature degrades gracefully when its dependency is absent.


Contributing

Contributions are welcome! Here's how to set up the development environment:

git clone https://github.com/kclaka/cacheguardian.git
cd cacheguardian

python3 -m venv .venv
source .venv/bin/activate

pip install -e ".[dev]"

# Run the test suite
pytest -v

125 tests cover all modules: fingerprinting, L1 cache, transforms, session tracking, cost calculations, promotion logic, and all three providers.


License

MIT


CacheGuardian — because the best API call is the one you don't pay full price for.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cacheguardian-0.1.0.tar.gz (47.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cacheguardian-0.1.0-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file cacheguardian-0.1.0.tar.gz.

File metadata

  • Download URL: cacheguardian-0.1.0.tar.gz
  • Upload date:
  • Size: 47.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for cacheguardian-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b089ac48686f057a257f0b9c5f0a245720e454170ff699c6a57c93e3cb5d4479
MD5 8663ac078828914855210ba0e7d67431
BLAKE2b-256 c09f9cf7188168fdd829f21a0afda3cbe0ef12846c95bb75bae42450db3e96a8

See more details on using hashes here.

File details

Details for the file cacheguardian-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cacheguardian-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 42.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for cacheguardian-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b0f5429788be89a4e9beed1da62a9e14b19686503d92b366def9a0be11c0c10b
MD5 06d5815b38c2bd0d7a2c57de49cec701
BLAKE2b-256 ef82f47386b8ed7b5305c54f716d64f7e2f361d3bc5d73f4294ec771cf79e93c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page