Skip to main content

Intelligent Prompt Enhancement & Token Caching Proxy

Project description

Version 1.4.0 Python 3.11+ 117 Tests Passing CI MIT License

LayerCache

Intelligent Prompt Enhancement & Token Caching Proxy
A self-hosted, provider-agnostic LLM proxy that cuts costs by 30-60% and latency by 40%+ through aggressive token caching and cache-safe prompt engineering.


Table of Contents


Overview

LayerCache sits between your application and LLM providers (Anthropic, OpenAI, Google Gemini). It is a drop-in replacement for your LLM provider's base URL — just point your OpenAI SDK at LayerCache.

In the background, LayerCache:

  1. Canonicalizes your prompts for byte-for-byte deterministic output (maximizing prefix cache hits)
  2. Injects provider-specific cache markers at stable layer boundaries
  3. Truncates long conversations to fit within a token budget (keeping recent turns, dropping old ones)
  4. Warns when your prefix is too short for provider caching to work
  5. Applies prompt enhancements (Chain of Thought, few-shot examples, etc.) without breaking the cache
  6. Caches semantically similar queries to bypass the LLM entirely on repeat requests
  7. Tracks metrics — token savings, cost reduction, cache hit rates — via Prometheus and a built-in web dashboard

Why LayerCache?

Problem LayerCache Solution
Prompt prefix cache misses due to whitespace/ordering differences Automatic canonicalization ensures identical prompts produce byte-for-byte identical output
Adding prompt enhancements (CoT, few-shots) breaks provider caching Layered architecture (L0-L4) ensures enhancements are injected after the cached prefix
No visibility into cache performance or cost savings Built-in Prometheus metrics and JSON dashboard showing hit rates, tokens saved, and $ saved
Different providers have different caching mechanisms Provider adapters handle Anthropic (ephemeral markers), OpenAI (auto-caching), and Gemini (CachedContent)
Repeated similar queries waste tokens and money Semantic cache with embedding similarity matching bypasses the LLM for near-duplicate queries
Long conversations grow an unbounded prefix, reducing cache effectiveness Automatic L2 session truncation keeps only the last N tokens of conversation history
Silent cache misses with no diagnostic Runtime warning when L0+L1+L2 is below the provider caching threshold (~1024 tokens)

Core Concept: The Layered Prompt Architecture

The key insight behind LayerCache is that prompts have naturally occurring layers with different stability profiles. By enforcing strict separation between these layers, we can optimize caching and enhance prompts without invalidating provider prefix caches.

Layer Content Mutability Cache Status
L0: System Core persona, safety rules, output format Immutable Cached
L1: Context Domain knowledge, tool definitions, static few-shots Updated rarely Cached
L2: Session Conversation history, user preferences Per session/turn Cached (short TTL)
L3: Enhancement Dynamic instructions (CoT, RAG, dynamic few-shots) Per request Uncached
L4: User Input The actual user query Dynamic Uncached

Cache breakpoints are placed at L0/L1/L2 boundaries. Enhancements are injected at L3, ensuring they never invalidate the stable prefix.

Features

Cache Optimization

  • Prompt Canonicalizer — Whitespace normalization, JSON minification, tool sorting for byte-for-byte deterministic output
  • Layered Architecture (L0-L4) — Separates system, context, session, enhancement, and user content so enhancements never invalidate the cached prefix
  • Provider Cache Markers — Anthropic cache_control, OpenAI auto-prefix caching, Gemini CachedContent
  • Injection at Stable Layers — Markers placed at L0/L1/L2 boundaries; L3/L4 left uncached

Session Management

  • L2 Session Truncation — Automatically drops old conversation turns to keep the cacheable prefix within a token budget (turn-group-aware, preserves tool-call clusters)
  • Prefix Threshold Diagnostics — Info-level warning when L0+L1+L2 is below the ~1024-token caching threshold

Semantic Cache

  • Local Embeddings — FastEmbed (BAAI/bge-small-en-v1.5) in ProcessPoolExecutor
  • Dual-Key Strategy — Prefix hash (exact) + query embedding (semantic similarity)
  • Configurable TTLs — Per-request and default TTLs with automatic cleanup

Prompt Enhancements

  • Enhancement API — Composable prompt engineering via request metadata
  • Suffix Injection — Enhancements injected at L3, never breaking L0-L2 cache
  • Dynamic Few-Shot Selector — Embedding-based retrieval of relevant examples
  • Prompt Registry — Named, versioned prompt templates (YAML/JSON)

Observability & Management

  • Prometheus + JSON Metrics — Token savings, cost reduction, cache hit rates
  • Web Dashboard — Overview charts, per-model breakdown, cache browser, config editor, live log viewer (Jinja2 + HTMX + Chart.js)
  • Persistent Time-Series — Metric snapshots in SQLite with background collection loop
  • Config Hot-Reload — Update log level, pipeline timeout/retries at runtime without restart
  • Universal Routing — LiteLLM-based multi-provider routing with automatic failover

Quick Start

Option 1: Docker (Recommended)

# Clone the repository
git clone https://github.com/your-org/layercache.git
cd layercache

# Set your API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

# Start the proxy
docker-compose up -d

Option 2: pip install

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export ANTHROPIC_API_KEY=your-key
export OPENAI_API_KEY=your-key

# Run the proxy
uvicorn layercache.main:app --host 0.0.0.0 --port 8000

Verify it works

curl http://localhost:8000/health
# {"status":"healthy","version":"1.4.0","semantic_cache":true}

Open http://localhost:8000/dashboard for the web dashboard (config editor, metrics charts, logs, template CRUD).

Dashboard Overview
Dashboard overview with live metrics

Dashboard Models
Per-model breakdown with adapter column

Dashboard Config
In-browser config editor

Usage Examples

Basic Proxy (Zero Configuration)

Just point your existing OpenAI client at LayerCache. No code changes needed — caching works automatically.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-ant-your-anthropic-key"  # Provider key passed through
)

response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet-20241022",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain async/await in Python."}
    ]
)

With Cache-Safe Enhancements

Add Chain of Thought reasoning without breaking the cache prefix:

response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet-20241022",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the time complexity of quicksort?"}
    ],
    extra_body={
        "lc_enhancements": ["chain_of_thought"]
    }
)

Using a Prompt Template

Reference a named template from the registry instead of sending L0/L1 with every request:

response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet-20241022",
    messages=[
        {"role": "user", "content": "Review this code for bugs."}
    ],
    extra_body={
        "lc_template": "code-assistant"
    }
)

Controlling Semantic Cache

# Skip semantic cache for this request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_body={
        "lc_cache_ttl": 0,           # No semantic caching
        "lc_enhancements": ["self_critique"]
    }
)

# Custom TTL (10 minutes)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_body={
        "lc_cache_ttl": 600
    }
)

Checking Cache Performance

# JSON dashboard
curl http://localhost:8000/v1/cache/metrics

# Prometheus metrics
curl http://localhost:8000/metrics

API Reference

OpenAI-Compatible Endpoints

Method Endpoint Description
POST /v1/chat/completions Chat completions (drop-in OpenAI replacement)
POST /v1/messages Anthropic Messages API (drop-in Claude Code replacement)
GET /v1/models List available models

Management Endpoints

Method Endpoint Description
GET /health Health check
GET /v1/cache/metrics Cache performance metrics (JSON)
GET /v1/cache/metrics/history Bucketed time-series for charting
GET /v1/cache/metrics/status Snapshot age tracking
GET /metrics Prometheus metrics (text/plain)
GET /v1/prompts/templates List prompt templates
POST /v1/prompts/templates Create/update a template
DELETE /v1/prompts/templates/{name} Delete a template
POST /v1/prompts/reload Reload templates from disk

Dashboard Endpoints

Method Endpoint Description
GET /dashboard Overview with stat cards + charts
GET /dashboard/models Provider/model table
GET /dashboard/cache Semantic cache stats + invalidation
GET /dashboard/templates Prompt template CRUD
GET /dashboard/config YAML config editor
POST /dashboard/config/save Save config (HTMX, CSRF-protected)
GET /dashboard/logs Log tail from ring buffer
GET /dashboard/login Login form (when proxy key is set)
POST /dashboard/login Login action

LayerCache Request Extensions

These fields can be added to any POST /v1/chat/completions request:

Field Type Default Description
lc_template string null Name of a prompt template to use for L0/L1
lc_enhancements string[] [] Enhancement names to apply at L3
lc_cache_ttl int 300 Semantic cache TTL in seconds (0 = skip)
lc_layer_hints object null Explicit index -> layer mapping
lc_skip_semantic_cache bool false Skip semantic cache lookup entirely
lc_bypass_cache bool false Skip all caching (semantic + provider)

Built-in Enhancements

Name Description
chain_of_thought Instructs the LLM to reason step-by-step
structured_json Enforces JSON output format (optional schema)
self_critique Instructs the LLM to review and refine its own response
dynamic_few_shot Retrieves relevant few-shot examples from a local vector store

Configuration

All configuration is done via layercache.yaml. A JSON Schema is provided for IDE autocompletion (VS Code, PyCharm). Regenerate it with layercache-schema:

# yaml-language-server: $schema=./layercache.schema.json
proxy:
  host: 0.0.0.0
  port: 8000
  proxy_api_key: "your-optional-proxy-secret"  # Protect the proxy itself

providers:
  anthropic:
    api_key_env: ANTHROPIC_API_KEY        # Env var holding the key
  openai:
    api_key_env: OPENAI_API_KEY
  gemini:
    api_key_env: GOOGLE_API_KEY
  deepseek:
    api_key_env: DEEPSEEK_API_KEY          # Any LiteLLM provider works
    # adapter: openai                      # Override cache strategy (auto-detected if unset)

caching:
  semantic:
    enabled: true
    db_path: /data/semantic_cache.db
    default_ttl: 300              # 5 minutes
    similarity_threshold: 0.95    # Cosine similarity for semantic cache
    embedder: "BAAI/bge-small-en-v1.5"
  max_session_tokens: 2000        # Optional: truncate L2 to keep within token budget
  metrics:
    db_path: /data/metrics.db     # Time-series snapshot storage
    snapshot_interval_seconds: 60  # Background snapshot interval
    snapshot_retention_days: 7     # Snapshot retention

enhancements:
  registered:
    - name: chain_of_thought
    - name: structured_json
    - name: self_critique
    - name: dynamic_few_shot
      config:
        vector_store: /data/few_shots/examples.json
        top_k: 3

Environment Variables

Variable Description Required
ANTHROPIC_API_KEY Anthropic API key If using Anthropic
OPENAI_API_KEY OpenAI API key If using OpenAI
GOOGLE_API_KEY Google Gemini API key If using Gemini
(custom) Any env var name per providers.{name}.api_key_env in config Depends on config

Docker Deployment

# Build and start
docker-compose up -d

# View logs
docker-compose logs -f layercache

# Stop
docker-compose down

Docker Volumes

Host Path Container Path Purpose
./data /data Persistent storage (cache DB, templates, examples)
./layercache.yaml /app/layercache.yaml Configuration file (read-only)

Architecture

Client Application
        │
        ▼
┌──────────────────────────────────────┐
│          LayerCache Proxy            │
│  ┌────────────────────────────────┐  │
│  │     Request Pipeline           │  │
│  │  1. Semantic Cache Lookup     │  │
│  │  2. Stratify (L0→L4)          │  │
│  │  3. Canonicalize              │  │
│  │  3b. Truncate Session        │  │
│  │  3c. Prefix Threshold Check  │  │
│  │  4. Enhance (L3 injection)    │  │
│  │  5. Inject Cache Markers      │  │
│  │  6. Route via LiteLLM         │  │
│  │  7. Handle Response           │  │
│  │  8. Store & Record Metrics    │  │
│  └────────────────────────────────┘  │
│                                      │
│  ┌──────────┐ ┌────────┐ ┌────────┐ │
│  │ Semantic │ │ Prompt │ │Metrics │ │
│  │  Cache   │ │Registry│ │Collector│ │
│  └──────────┘ └────────┘ └────────┘ │
└──────────────────────────────────────┘
        │         │         │
        ▼         ▼         ▼
   Anthropic   OpenAI    Gemini

Development

Prerequisites

  • Python 3.11+
  • pip

Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=layercache --cov-report=term-missing

Running Tests

# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_stratifier.py -v

# With verbose output
pytest tests/ -v --tb=short

Code Quality

# Lint and format
ruff check layercache/
ruff format layercache/

# Type checking
mypy layercache/

Project Structure

layercache/
├── layercache/                   # Core package
│   ├── main.py                   # FastAPI application
│   ├── pipeline.py               # Request processing pipeline
│   ├── models.py                 # Pydantic data models
│   ├── stratifier.py             # L0-L4 message classification
│   ├── canonicalizer.py          # Prompt normalization
│   ├── config.py                 # YAML configuration
│   ├── schema.py                 # JSON Schema generator for IDE autocompletion
│   ├── adapters/                 # Provider cache marker injection
│   │   ├── anthropic.py         # Anthropic cache_control
│   │   ├── anthropic_messages.py # /v1/messages wire-format shim
│   │   ├── openai.py             # OpenAI auto-caching
│   │   └── gemini.py             # Gemini CachedContent
│   ├── enhancements/             # Cache-safe prompt enhancements
│   │   ├── base.py               # BaseEnhancement ABC
│   │   ├── chain_of_thought.py   # Step-by-step reasoning
│   │   ├── structured_output.py  # JSON format enforcement
│   │   ├── self_critique.py      # Self-review injection
│   │   └── dynamic_few_shot.py   # Vector-based example retrieval
│   ├── cache/                    # Semantic caching
│   │   ├── semantic.py           # SQLite-backed cache
│   │   └── embedder.py           # FastEmbed wrapper
│   ├── dashboard/                # Web dashboard (Jinja2 + HTMX)
│   │   ├── router.py             # Dashboard routes
│   │   └── templates/            # Jinja2 templates
│   ├── metrics/                  # Observability
│   │   ├── collector.py          # Prometheus + ROI tracking
│   │   └── storage.py            # Persistent time-series snapshots
│   ├── static/                   # Dashboard assets
│   └── registry/                 # Prompt template management
│       └── prompt_registry.py    # YAML/JSON template loader
├── tests/                        # Test suite (117 tests)
├── data/                         # Sample data
│   ├── prompts/                  # Prompt templates
│   └── few_shots/                # Few-shot examples
├── docs/                         # Documentation
│   ├── PRD.md                    # Product Requirements
│   ├── TDD.md                    # Technical Design
│   ├── IMPLEMENTATION_PLAN.md    # Sprint plan
│   ├── ARCHITECTURE.md           # Architecture deep-dive
│   ├── DEPLOYMENT.md             # Deployment guide
│   ├── USER_GUIDE.md             # User guide
│   └── API.md                    # API reference
├── Dockerfile                    # Production image
├── docker-compose.yml            # Docker Compose config
├── layercache.yaml               # Default configuration
├── pyproject.toml                # Python project config
├── layercache.schema.json        # JSON Schema for IDE autocompletion
└── requirements.txt              # Dependencies

Documentation

Document Description
PRD Product Requirements Document
TDD Technical Design Document
Implementation Plan 8-sprint development roadmap
Architecture System architecture deep-dive
Roadmap Prioritized future development plan
Deployment Guide Production deployment instructions
User Guide Comprehensive usage guide
API Reference Full API documentation
Contributing How to contribute, setup, and PR process
CHANGELOG Version history and changes

License

Built with OpenCode Go — fork, automate, ship.

This project is licensed under the MIT License. See LICENSE for the full text.

What this means:

  • ✅ You can freely use, copy, modify, and distribute this software
  • ✅ You can use it for commercial and private purposes
  • ✅ You must include a copy of the license and copyright notice
  • ⚠️ The software is provided "as-is" without warranty

Third-party code: Vendored libraries (Chart.js, HTMX) are included under their own licenses in THIRD_PARTY_NOTICES.md.

For more information, visit https://opensource.org/licenses/MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

layercache-1.4.0.tar.gz (909.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

layercache-1.4.0-py3-none-any.whl (168.9 kB view details)

Uploaded Python 3

File details

Details for the file layercache-1.4.0.tar.gz.

File metadata

  • Download URL: layercache-1.4.0.tar.gz
  • Upload date:
  • Size: 909.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for layercache-1.4.0.tar.gz
Algorithm Hash digest
SHA256 e3e5de3139cc7eb5e2967a6c4343e3c440f63c3dfbc672a9b93d24251c51fae8
MD5 d50d8843abb32c329e28fb70777a35ee
BLAKE2b-256 fe5c52b4d17ee626ce4484f0e6fd03b8c10671711e672d784fe701baf262847b

See more details on using hashes here.

Provenance

The following attestation bundles were made for layercache-1.4.0.tar.gz:

Publisher: release.yml on ZeroClue/layercache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file layercache-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: layercache-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 168.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for layercache-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b170b6481e126b85af51097fac8f866be6de292cd4763722f8ea8cd23c27734
MD5 80754dc60532cbf733593bc39989b800
BLAKE2b-256 0536882580a492b0da5dcd109253fa6f85c32530226d0b9e22c46fe915e70abd

See more details on using hashes here.

Provenance

The following attestation bundles were made for layercache-1.4.0-py3-none-any.whl:

Publisher: release.yml on ZeroClue/layercache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page