Universal prompt cache optimizer for Anthropic, OpenAI & Gemini

These details have not been verified by PyPI

Project links

Project description

Python 3.10+ Version 0.1.0 Tests 125 passed MIT License

CacheGuardian

Stop overpaying for LLM API calls you've already made.

A drop-in Python middleware that wraps Anthropic, OpenAI, and Gemini SDKs to
automatically optimize prompt caching and show you exactly how much money you're saving.

The Problem

Every major LLM provider offers prompt caching — cached tokens cost 10-90% less than regular input tokens. But developers silently break their cache due to non-obvious pitfalls:

Non-deterministic tool ordering
System prompt mutations between requests
Model switches mid-session
Dynamic content placed before static content
And dozens of other subtle mistakes

The result? You pay full price for tokens the provider already computed, and you don't even know it's happening. The only signal you get is a number buried in the API response.

The Solution

CacheGuardian wraps your existing SDK client in one line of code. It detects cache breaks locally in < 1 millisecond, automatically fixes the most common mistakes, and logs exactly how much money you're saving — or wasting — on every single call.

import cacheguardian
import anthropic

# Before: client = anthropic.Anthropic()
client = cacheguardian.wrap(anthropic.Anthropic())

# Everything else stays exactly the same.
# CacheGuardian works silently in the background.
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful assistant.",
    tools=[...],
    messages=[...]
)

[cacheguardian] L1 HIT | Cache hit 94.2% | Saved $0.0340 | Session total: $1.24 saved

Installation

pip install cacheguardian

Then install it alongside the provider(s) you use:

# Anthropic Claude
pip install cacheguardian anthropic

# OpenAI GPT / o-series
pip install cacheguardian openai

# Google Gemini
pip install cacheguardian google-genai

# Optional: Redis for distributed caching (L2)
pip install cacheguardian redis

Requirements: Python 3.10+

Quick Start

Anthropic

import cacheguardian
import anthropic

client = cacheguardian.wrap(anthropic.Anthropic())

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful coding assistant.",
    tools=[
        {
            "name": "search_code",
            "description": "Search the codebase",
            "input_schema": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
            },
        }
    ],
    messages=[{"role": "user", "content": "Find all TODO comments"}],
)

What CacheGuardian does automatically:

Optimization	What it does
Injects `cache_control`	Adds top-level auto-caching if you forgot it
Sorts tools	Alphabetical by name — prevents ordering-based misses
Stabilizes JSON keys	Sorts all dict keys recursively for consistent hashing
Intermediate breakpoints	Adds `cache_control` markers every 15 messages when you exceed 20 blocks
Smart TTL	Switches from 5-minute to 1-hour TTL when your request intervals are > 5 min

OpenAI

import cacheguardian
import openai

client = cacheguardian.wrap(openai.OpenAI())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain prompt caching."},
    ],
)

What CacheGuardian does automatically:

Optimization	What it does
Derives `prompt_cache_key`	Generates a stable routing key so your requests hit the same physical cache hardware
Reorders content	Moves system messages before user messages for better prefix overlap
Sorts tools	Same as Anthropic — deterministic ordering
Smart retention	Sets `prompt_cache_retention="24h"` when your request intervals are > 10 min
1024-token threshold	Suppresses false-positive cache warnings for prompts under 1024 tokens

Google Gemini

import cacheguardian
from google import genai

client = cacheguardian.wrap(genai.Client())

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Summarize this document.",
    config={"system_instruction": "You are an expert analyst."},
)

What CacheGuardian does automatically:

Optimization	What it does
Implicit → Explicit promotion	Creates a `CachedContent` object when the cost-benefit math says it saves money
TTL optimization	Calculates optimal TTL from your request frequency to minimize storage costs
Zombie cache cleanup	Persists a cache registry to disk — cleans up orphaned caches even after crashes
Storage cost tracking	Tracks Gemini's per-hour storage fees separately so you see real ROI

Async Clients

CacheGuardian supports async clients out of the box. All diffing and metric extraction runs via asyncio.to_thread() so the event loop is never blocked.

import cacheguardian
import anthropic

client = cacheguardian.wrap(anthropic.AsyncAnthropic())

# Use await as normal
response = await client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)

Works with AsyncAnthropic, AsyncOpenAI, and Gemini's async methods.

How It Works

CacheGuardian uses a tiered L1 / L2 / L3 architecture inspired by CPU cache hierarchies and production inference gateways:

┌───────────────────────────────────────────────────────────────┐
│                     Your Application                          │
├───────────────────────────────────────────────────────────────┤
│                    cacheguardian.wrap()                          │
├──────────┬────────────────────────────────────┬───────────────┤
│          │                                    │               │
│    L1    │  Local Python Dict                 │    < 1 ms     │
│          │  Rolling segment fingerprints      │               │
│          │  Instant divergence detection      │               │
│          │                                    │               │
├──────────┼────────────────────────────────────┼───────────────┤
│          │                                    │               │
│    L2    │  Redis (optional)                  │   2 – 10 ms   │
│          │  Cross-worker prefix sharing       │               │
│          │  Gemini CachedContent coordination │               │
│          │                                    │               │
├──────────┼────────────────────────────────────┼───────────────┤
│          │                                    │               │
│    L3    │  Provider API                      │  100 ms – 2s  │
│          │  The actual transformer KV-cache   │               │
│          │                                    │               │
└──────────┴────────────────────────────────────┴───────────────┘

L1: The Secret Sauce

Instead of hashing your entire prompt, CacheGuardian hashes it in segments:

system prompt  →  sha256  →  "a3f1..."
tools block    →  sha256  →  "b7e2..."
message[0]     →  sha256  →  "c9d4..."
message[1]     →  sha256  →  "e1a6..."

When your next request comes in, it compares segment hashes sequentially. The moment one doesn't match, it knows exactly which segment diverged — without scanning the full content. This is how it achieves < 1ms detection even on 100k-token prompts.

What L1 Enables

Pre-emptive warnings — Detect a 1-character typo in a 50,000-token system prompt before the request is sent and the bill is generated
dry_run mode — Test your prompt structure against the cache without spending a cent
Actionable suggestions — Not just "cache missed" but "your system prompt changed — use a system-reminder message instead"

Dry Run Mode

Test whether your prompt would hit the cache — without making an API call.

import cacheguardian

client = cacheguardian.wrap(anthropic.Anthropic())

# First, make a real call to establish the cache
client.messages.create(model="claude-sonnet-4-20250514", ...)

# Now test a new prompt against the cache for free
result = cacheguardian.dry_run(
    client,
    model="claude-sonnet-4-20250514",
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "New question"}],
)

print(result.would_hit_cache)       # True / False
print(result.prefix_match_depth)    # "75% — 3/4 segments match (diverged at message[1])"
print(result.estimated_savings)     # 0.034
print(result.warnings)              # [CacheBreakWarning(...)]

[cacheguardian] DRY RUN — would HIT cache | Estimated savings: $0.0340

Zero cost. Instant feedback. Iterate on your prompt structure before spending a cent.

Configuration

client = cacheguardian.wrap(
    anthropic.Anthropic(),

    # Automatically fix safe issues (tool sorting, cache_control injection)
    auto_fix=True,                  # default: True

    # TTL strategy for Anthropic
    ttl_strategy="auto",            # "auto" | "5m" | "1h" — default: "auto"

    # Strict mode: raise exceptions instead of warnings on cache breaks
    strict_mode=False,              # default: False

    # Logging verbosity
    log_level="INFO",               # "DEBUG" | "INFO" | "WARNING" | "ERROR"

    # Alert when cache hit rate drops below this threshold
    min_cache_hit_rate=0.7,         # default: 0.7

    # OpenAI: custom function to derive prompt_cache_key
    cache_key_fn=lambda session: f"user_{session.session_id}",

    # L2: Redis URL for distributed environments
    l2_backend="redis://localhost:6379",

    # Privacy: add timing jitter to prevent cache-timing side-channel attacks
    privacy_mode=False,             # default: False
    privacy_jitter_ms=(50, 200),    # jitter range in ms — default: (50, 200)
)

Pricing Overrides

CacheGuardian ships with default pricing tables for all supported models. If prices change, override them:

from cacheguardian.config import PricingConfig

client = cacheguardian.wrap(
    anthropic.Anthropic(),
    pricing_overrides={
        "anthropic": {
            "claude-sonnet-4-20250514": PricingConfig(
                base_input=3.00,        # $ per million tokens
                cache_read=0.30,        # 90% discount
                cache_write_5m=3.75,    # 25% premium (5-min TTL)
                cache_write_1h=6.00,    # 100% premium (1-hour TTL)
                output=15.00,
            ),
        },
    },
)

The Promotion Formula

For Gemini (explicit CachedContent) and Anthropic (1-hour TTL), CacheGuardian uses a break-even formula to decide when upgrading is worth the cost:

N  >  (C_write + S × T) / (C_input − C_cache_read)

Symbol	Meaning
N	Expected number of requests reusing this content
C_write	One-time cost to write the cache
S	Storage cost per hour (Gemini only; 0 for Anthropic)
T	TTL in hours
C_input	Standard input token cost
C_cache_read	Discounted cache read cost

CacheGuardian tracks your request frequency automatically and promotes when N crosses the break-even threshold.

System Prompt Templates

Dynamic content in system prompts (dates, user names, config values) is one of the most common cache killers. CacheGuardian provides a template pattern that keeps the cache-friendly static part as the system prompt and injects dynamic values as messages instead:

from cacheguardian.core.optimizer import SystemPromptTemplate

template = SystemPromptTemplate(
    "You are a helpful assistant. The current date is {date}. User: {user_name}."
)

# Use the static template as your system prompt (never changes → always cached)
system_prompt = template.static_part

# Inject dynamic values as a system-reminder message (appended, not prefixed)
reminder = template.render_dynamic(date="2026-02-20", user_name="Alice")
# → "Updated context: date=2026-02-20, user_name=Alice"

Distributed Caching with Redis (L2)

In serverless or multi-worker environments, different workers may not know about each other's sessions. The optional L2 cache solves this:

client = cacheguardian.wrap(
    anthropic.Anthropic(),
    l2_backend="redis://localhost:6379",
)

What L2 enables:

Cross-worker prefix sharing — Worker B knows that Worker A already warmed the cache for a given prefix
Gemini cache coordination — Prevents redundant CachedContent creation fees when multiple workers process the same content
Rate limit coordination — Shared request counting across workers
Graceful degradation — If Redis is unavailable, CacheGuardian falls back to L1-only with zero errors

Gemini Safety Lock

Gemini's explicit caches incur storage fees of $4.50 per million tokens per hour. If your process crashes without cleaning up, those caches keep billing.

CacheGuardian solves this with a disk-persisted cache registry:

from cacheguardian.providers.gemini import GeminiProvider

# On startup, clean up any zombie caches from previous crashes
provider = GeminiProvider(config, gemini_client=client)
cleaned = provider.cleanup_stale_caches(max_age_hours=2.0)
print(f"Cleaned {cleaned} orphaned caches")

The registry is written to ~/.cache/cacheguardian/gemini_registry.json on every cache creation. On next startup, any caches not accessed within the threshold are automatically deleted.

Architecture

cacheguardian/
├── __init__.py                  # Public API: wrap(), dry_run(), configure()
├── types.py                     # Core data types (9 dataclasses)
├── config.py                    # Configuration + pricing tables
│
├── cache/
│   ├── fingerprint.py           # Normalize → segment → rolling SHA-256 hash
│   ├── l1.py                    # Local dict: <1ms fingerprint comparison
│   └── l2.py                    # Optional Redis: cross-worker coordination
│
├── core/
│   ├── session.py               # Session state tracking across API calls
│   ├── optimizer.py             # Transforms: sort tools, stabilize JSON, templates
│   ├── differ.py                # Segment-level diff engine with cost estimation
│   ├── metrics.py               # Cost formulas for all 3 providers
│   ├── promoter.py              # Break-even promotion logic
│   └── logger.py                # Rich colored terminal output
│
├── providers/
│   ├── base.py                  # Abstract provider interface
│   ├── anthropic.py             # cache_control, breakpoints, TTL, JSON stabilization
│   ├── openai.py                # prompt_cache_key, retention, content reordering
│   └── gemini.py                # CachedContent lifecycle, promotion, safety lock
│
├── middleware/
│   ├── interceptor.py           # Sync wrapper: L1 → transform → L3 → metrics
│   └── async_interceptor.py     # Async wrapper: non-blocking via asyncio.to_thread()
│
└── persistence/
    └── cache_registry.py        # Disk-persisted Gemini cache safety lock

Design Principles

Exact prefix matching only. No semantic or embedding-based caching. CacheGuardian guarantees 100% accuracy — it will never serve a "similar" cached result for a different question.
Never modify the response. CacheGuardian transforms the request (sorting tools, injecting cache_control) and logs after the response. The response object you receive is identical to what the raw SDK would return.
Composition over inheritance. SDK clients are wrapped, not subclassed. This makes CacheGuardian resilient to SDK version changes.
Optional everything. Install only the provider SDKs you use. Redis is optional. Privacy mode is optional. Every feature degrades gracefully when its dependency is absent.

Contributing

Contributions are welcome! Here's how to set up the development environment:

git clone https://github.com/kclaka/cacheguardian.git
cd cacheguardian

python3 -m venv .venv
source .venv/bin/activate

pip install -e ".[dev]"

# Run the test suite
pytest -v

125 tests cover all modules: fingerprinting, L1 cache, transforms, session tracking, cost calculations, promotion logic, and all three providers.

License

MIT

CacheGuardian — because the best API call is the one you don't pay full price for.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Feb 22, 2026

0.3.0

Feb 22, 2026

0.2.0

Feb 22, 2026

This version

0.1.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cacheguardian-0.1.0.tar.gz (47.4 kB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cacheguardian-0.1.0-py3-none-any.whl (42.0 kB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file cacheguardian-0.1.0.tar.gz.

File metadata

Download URL: cacheguardian-0.1.0.tar.gz
Upload date: Feb 20, 2026
Size: 47.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for cacheguardian-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b089ac48686f057a257f0b9c5f0a245720e454170ff699c6a57c93e3cb5d4479`
MD5	`8663ac078828914855210ba0e7d67431`
BLAKE2b-256	`c09f9cf7188168fdd829f21a0afda3cbe0ef12846c95bb75bae42450db3e96a8`

See more details on using hashes here.

File details

Details for the file cacheguardian-0.1.0-py3-none-any.whl.

File metadata

Download URL: cacheguardian-0.1.0-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 42.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for cacheguardian-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0f5429788be89a4e9beed1da62a9e14b19686503d92b366def9a0be11c0c10b`
MD5	`06d5815b38c2bd0d7a2c57de49cec701`
BLAKE2b-256	`ef82f47386b8ed7b5305c54f716d64f7e2f361d3bc5d73f4294ec771cf79e93c`

See more details on using hashes here.

cacheguardian 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

CacheGuardian

The Problem

The Solution

Installation

Quick Start

Anthropic

OpenAI

Google Gemini

Async Clients

How It Works

L1: The Secret Sauce

What L1 Enables

Dry Run Mode

Configuration

Pricing Overrides

The Promotion Formula

System Prompt Templates

Distributed Caching with Redis (L2)

Gemini Safety Lock

Architecture

Design Principles

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes