Intelligent Prompt Enhancement & Token Caching Proxy

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ZeroClue

These details have not been verified by PyPI

Project description

Version 1.4.0 Python 3.11+ 117 Tests Passing MIT License

LayerCache

Intelligent Prompt Enhancement & Token Caching Proxy
A self-hosted, provider-agnostic LLM proxy that cuts costs by 30-60% and latency by 40%+ through aggressive token caching and cache-safe prompt engineering.

Overview
Why LayerCache?
Core Concept: The Layered Prompt Architecture
Features
Quick Start
Usage Examples
API Reference
Configuration
Docker Deployment
Architecture
Development
Documentation
License

Overview

LayerCache sits between your application and LLM providers (Anthropic, OpenAI, Google Gemini). It is a drop-in replacement for your LLM provider's base URL — just point your OpenAI SDK at LayerCache.

In the background, LayerCache:

Canonicalizes your prompts for byte-for-byte deterministic output (maximizing prefix cache hits)
Injects provider-specific cache markers at stable layer boundaries
Truncates long conversations to fit within a token budget (keeping recent turns, dropping old ones)
Warns when your prefix is too short for provider caching to work
Applies prompt enhancements (Chain of Thought, few-shot examples, etc.) without breaking the cache
Caches semantically similar queries to bypass the LLM entirely on repeat requests
Tracks metrics — token savings, cost reduction, cache hit rates — via Prometheus and a built-in web dashboard

Why LayerCache?

Problem	LayerCache Solution
Prompt prefix cache misses due to whitespace/ordering differences	Automatic canonicalization ensures identical prompts produce byte-for-byte identical output
Adding prompt enhancements (CoT, few-shots) breaks provider caching	Layered architecture (L0-L4) ensures enhancements are injected after the cached prefix
No visibility into cache performance or cost savings	Built-in Prometheus metrics and JSON dashboard showing hit rates, tokens saved, and $ saved
Different providers have different caching mechanisms	Provider adapters handle Anthropic (ephemeral markers), OpenAI (auto-caching), and Gemini (CachedContent)
Repeated similar queries waste tokens and money	Semantic cache with embedding similarity matching bypasses the LLM for near-duplicate queries
Long conversations grow an unbounded prefix, reducing cache effectiveness	Automatic L2 session truncation keeps only the last N tokens of conversation history
Silent cache misses with no diagnostic	Runtime warning when L0+L1+L2 is below the provider caching threshold (~1024 tokens)

Core Concept: The Layered Prompt Architecture

The key insight behind LayerCache is that prompts have naturally occurring layers with different stability profiles. By enforcing strict separation between these layers, we can optimize caching and enhance prompts without invalidating provider prefix caches.

Layer	Content	Mutability	Cache Status
L0: System	Core persona, safety rules, output format	Immutable	Cached
L1: Context	Domain knowledge, tool definitions, static few-shots	Updated rarely	Cached
L2: Session	Conversation history, user preferences	Per session/turn	Cached (short TTL)
L3: Enhancement	Dynamic instructions (CoT, RAG, dynamic few-shots)	Per request	Uncached
L4: User Input	The actual user query	Dynamic	Uncached

Cache breakpoints are placed at L0/L1/L2 boundaries. Enhancements are injected at L3, ensuring they never invalidate the stable prefix.

Features

Cache Optimization

Prompt Canonicalizer — Whitespace normalization, JSON minification, tool sorting for byte-for-byte deterministic output
Layered Architecture (L0-L4) — Separates system, context, session, enhancement, and user content so enhancements never invalidate the cached prefix
Provider Cache Markers — Anthropic cache_control, OpenAI auto-prefix caching, Gemini CachedContent
Injection at Stable Layers — Markers placed at L0/L1/L2 boundaries; L3/L4 left uncached

Session Management

L2 Session Truncation — Automatically drops old conversation turns to keep the cacheable prefix within a token budget (turn-group-aware, preserves tool-call clusters)
Prefix Threshold Diagnostics — Info-level warning when L0+L1+L2 is below the ~1024-token caching threshold

Semantic Cache

Local Embeddings — FastEmbed (BAAI/bge-small-en-v1.5) in ProcessPoolExecutor
Dual-Key Strategy — Prefix hash (exact) + query embedding (semantic similarity)
Configurable TTLs — Per-request and default TTLs with automatic cleanup

Prompt Enhancements

Enhancement API — Composable prompt engineering via request metadata
Suffix Injection — Enhancements injected at L3, never breaking L0-L2 cache
Dynamic Few-Shot Selector — Embedding-based retrieval of relevant examples
Prompt Registry — Named, versioned prompt templates (YAML/JSON)

Observability & Management

Prometheus + JSON Metrics — Token savings, cost reduction, cache hit rates
Web Dashboard — Overview charts, per-model breakdown, cache browser, config editor, live log viewer (Jinja2 + HTMX + Chart.js)
Persistent Time-Series — Metric snapshots in SQLite with background collection loop
Config Hot-Reload — Update log level, pipeline timeout/retries at runtime without restart
Universal Routing — LiteLLM-based multi-provider routing with automatic failover

Quick Start

Option 1: Docker (Recommended)

# Clone the repository
git clone https://github.com/your-org/layercache.git
cd layercache

# Set your API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

# Start the proxy
docker-compose up -d

Option 2: pip install

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export ANTHROPIC_API_KEY=your-key
export OPENAI_API_KEY=your-key

# Run the proxy
uvicorn layercache.main:app --host 0.0.0.0 --port 8000

Verify it works

curl http://localhost:8000/health
# {"status":"healthy","version":"1.4.0","semantic_cache":true}

Open http://localhost:8000/dashboard for the web dashboard (config editor, metrics charts, logs, template CRUD).

Dashboard overview with live metrics

Dashboard Models
Per-model breakdown with adapter column

Dashboard Config
In-browser config editor

Usage Examples

Basic Proxy (Zero Configuration)

Just point your existing OpenAI client at LayerCache. No code changes needed — caching works automatically.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-ant-your-anthropic-key"  # Provider key passed through
)

response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet-20241022",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain async/await in Python."}
    ]
)

With Cache-Safe Enhancements

Add Chain of Thought reasoning without breaking the cache prefix:

response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet-20241022",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the time complexity of quicksort?"}
    ],
    extra_body={
        "lc_enhancements": ["chain_of_thought"]
    }
)

Using a Prompt Template

Reference a named template from the registry instead of sending L0/L1 with every request:

response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet-20241022",
    messages=[
        {"role": "user", "content": "Review this code for bugs."}
    ],
    extra_body={
        "lc_template": "code-assistant"
    }
)

Controlling Semantic Cache

# Skip semantic cache for this request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_body={
        "lc_cache_ttl": 0,           # No semantic caching
        "lc_enhancements": ["self_critique"]
    }
)

# Custom TTL (10 minutes)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_body={
        "lc_cache_ttl": 600
    }
)

Checking Cache Performance

# JSON dashboard
curl http://localhost:8000/v1/cache/metrics

# Prometheus metrics
curl http://localhost:8000/metrics

API Reference

OpenAI-Compatible Endpoints

Method	Endpoint	Description
`POST`	`/v1/chat/completions`	Chat completions (drop-in OpenAI replacement)
`POST`	`/v1/messages`	Anthropic Messages API (drop-in Claude Code replacement)
`GET`	`/v1/models`	List available models

Management Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`GET`	`/v1/cache/metrics`	Cache performance metrics (JSON)
`GET`	`/v1/cache/metrics/history`	Bucketed time-series for charting
`GET`	`/v1/cache/metrics/status`	Snapshot age tracking
`GET`	`/metrics`	Prometheus metrics (text/plain)
`GET`	`/v1/prompts/templates`	List prompt templates
`POST`	`/v1/prompts/templates`	Create/update a template
`DELETE`	`/v1/prompts/templates/{name}`	Delete a template
`POST`	`/v1/prompts/reload`	Reload templates from disk

Dashboard Endpoints

Method	Endpoint	Description
`GET`	`/dashboard`	Overview with stat cards + charts
`GET`	`/dashboard/models`	Provider/model table
`GET`	`/dashboard/cache`	Semantic cache stats + invalidation
`GET`	`/dashboard/templates`	Prompt template CRUD
`GET`	`/dashboard/config`	YAML config editor
`POST`	`/dashboard/config/save`	Save config (HTMX, CSRF-protected)
`GET`	`/dashboard/logs`	Log tail from ring buffer
`GET`	`/dashboard/login`	Login form (when proxy key is set)
`POST`	`/dashboard/login`	Login action

LayerCache Request Extensions

These fields can be added to any POST /v1/chat/completions request:

Field	Type	Default	Description
`lc_template`	`string`	`null`	Name of a prompt template to use for L0/L1
`lc_enhancements`	`string[]`	`[]`	Enhancement names to apply at L3
`lc_cache_ttl`	`int`	`300`	Semantic cache TTL in seconds (0 = skip)
`lc_layer_hints`	`object`	`null`	Explicit `index -> layer` mapping
`lc_skip_semantic_cache`	`bool`	`false`	Skip semantic cache lookup entirely
`lc_bypass_cache`	`bool`	`false`	Skip all caching (semantic + provider)

Built-in Enhancements

Name	Description
`chain_of_thought`	Instructs the LLM to reason step-by-step
`structured_json`	Enforces JSON output format (optional schema)
`self_critique`	Instructs the LLM to review and refine its own response
`dynamic_few_shot`	Retrieves relevant few-shot examples from a local vector store

Configuration

All configuration is done via layercache.yaml. A JSON Schema is provided for IDE autocompletion (VS Code, PyCharm). Regenerate it with layercache-schema:

# yaml-language-server: $schema=./layercache.schema.json
proxy:
  host: 0.0.0.0
  port: 8000
  proxy_api_key: "your-optional-proxy-secret"  # Protect the proxy itself

providers:
  anthropic:
    api_key_env: ANTHROPIC_API_KEY        # Env var holding the key
  openai:
    api_key_env: OPENAI_API_KEY
  gemini:
    api_key_env: GOOGLE_API_KEY
  deepseek:
    api_key_env: DEEPSEEK_API_KEY          # Any LiteLLM provider works
    # adapter: openai                      # Override cache strategy (auto-detected if unset)

caching:
  semantic:
    enabled: true
    db_path: /data/semantic_cache.db
    default_ttl: 300              # 5 minutes
    similarity_threshold: 0.95    # Cosine similarity for semantic cache
    embedder: "BAAI/bge-small-en-v1.5"
  max_session_tokens: 2000        # Optional: truncate L2 to keep within token budget
  metrics:
    db_path: /data/metrics.db     # Time-series snapshot storage
    snapshot_interval_seconds: 60  # Background snapshot interval
    snapshot_retention_days: 7     # Snapshot retention

enhancements:
  registered:
    - name: chain_of_thought
    - name: structured_json
    - name: self_critique
    - name: dynamic_few_shot
      config:
        vector_store: /data/few_shots/examples.json
        top_k: 3

Environment Variables

Variable	Description	Required
`ANTHROPIC_API_KEY`	Anthropic API key	If using Anthropic
`OPENAI_API_KEY`	OpenAI API key	If using OpenAI
`GOOGLE_API_KEY`	Google Gemini API key	If using Gemini
(custom)	Any env var name per `providers.{name}.api_key_env` in config	Depends on config

Docker Deployment

# Build and start
docker-compose up -d

# View logs
docker-compose logs -f layercache

# Stop
docker-compose down

Docker Volumes

Host Path	Container Path	Purpose
`./data`	`/data`	Persistent storage (cache DB, templates, examples)
`./layercache.yaml`	`/app/layercache.yaml`	Configuration file (read-only)

Architecture

Client Application
        │
        ▼
┌──────────────────────────────────────┐
│          LayerCache Proxy            │
│  ┌────────────────────────────────┐  │
│  │     Request Pipeline           │  │
│  │  1. Semantic Cache Lookup     │  │
│  │  2. Stratify (L0→L4)          │  │
│  │  3. Canonicalize              │  │
│  │  3b. Truncate Session        │  │
│  │  3c. Prefix Threshold Check  │  │
│  │  4. Enhance (L3 injection)    │  │
│  │  5. Inject Cache Markers      │  │
│  │  6. Route via LiteLLM         │  │
│  │  7. Handle Response           │  │
│  │  8. Store & Record Metrics    │  │
│  └────────────────────────────────┘  │
│                                      │
│  ┌──────────┐ ┌────────┐ ┌────────┐ │
│  │ Semantic │ │ Prompt │ │Metrics │ │
│  │  Cache   │ │Registry│ │Collector│ │
│  └──────────┘ └────────┘ └────────┘ │
└──────────────────────────────────────┘
        │         │         │
        ▼         ▼         ▼
   Anthropic   OpenAI    Gemini

Development

Prerequisites

Python 3.11+
pip

Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=layercache --cov-report=term-missing

Running Tests

# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_stratifier.py -v

# With verbose output
pytest tests/ -v --tb=short

Code Quality

# Lint and format
ruff check layercache/
ruff format layercache/

# Type checking
mypy layercache/

Project Structure

layercache/
├── layercache/                   # Core package
│   ├── main.py                   # FastAPI application
│   ├── pipeline.py               # Request processing pipeline
│   ├── models.py                 # Pydantic data models
│   ├── stratifier.py             # L0-L4 message classification
│   ├── canonicalizer.py          # Prompt normalization
│   ├── config.py                 # YAML configuration
│   ├── schema.py                 # JSON Schema generator for IDE autocompletion
│   ├── adapters/                 # Provider cache marker injection
│   │   ├── anthropic.py         # Anthropic cache_control
│   │   ├── anthropic_messages.py # /v1/messages wire-format shim
│   │   ├── openai.py             # OpenAI auto-caching
│   │   └── gemini.py             # Gemini CachedContent
│   ├── enhancements/             # Cache-safe prompt enhancements
│   │   ├── base.py               # BaseEnhancement ABC
│   │   ├── chain_of_thought.py   # Step-by-step reasoning
│   │   ├── structured_output.py  # JSON format enforcement
│   │   ├── self_critique.py      # Self-review injection
│   │   └── dynamic_few_shot.py   # Vector-based example retrieval
│   ├── cache/                    # Semantic caching
│   │   ├── semantic.py           # SQLite-backed cache
│   │   └── embedder.py           # FastEmbed wrapper
│   ├── dashboard/                # Web dashboard (Jinja2 + HTMX)
│   │   ├── router.py             # Dashboard routes
│   │   └── templates/            # Jinja2 templates
│   ├── metrics/                  # Observability
│   │   ├── collector.py          # Prometheus + ROI tracking
│   │   └── storage.py            # Persistent time-series snapshots
│   ├── static/                   # Dashboard assets
│   └── registry/                 # Prompt template management
│       └── prompt_registry.py    # YAML/JSON template loader
├── tests/                        # Test suite (117 tests)
├── data/                         # Sample data
│   ├── prompts/                  # Prompt templates
│   └── few_shots/                # Few-shot examples
├── docs/                         # Documentation
│   ├── PRD.md                    # Product Requirements
│   ├── TDD.md                    # Technical Design
│   ├── IMPLEMENTATION_PLAN.md    # Sprint plan
│   ├── ARCHITECTURE.md           # Architecture deep-dive
│   ├── DEPLOYMENT.md             # Deployment guide
│   ├── USER_GUIDE.md             # User guide
│   └── API.md                    # API reference
├── Dockerfile                    # Production image
├── docker-compose.yml            # Docker Compose config
├── layercache.yaml               # Default configuration
├── pyproject.toml                # Python project config
├── layercache.schema.json        # JSON Schema for IDE autocompletion
└── requirements.txt              # Dependencies

Documentation

Document	Description
PRD	Product Requirements Document
TDD	Technical Design Document
Implementation Plan	8-sprint development roadmap
Architecture	System architecture deep-dive
Roadmap	Prioritized future development plan
Deployment Guide	Production deployment instructions
User Guide	Comprehensive usage guide
API Reference	Full API documentation
Contributing	How to contribute, setup, and PR process
CHANGELOG	Version history and changes

License

Built with OpenCode Go — fork, automate, ship.

This project is licensed under the MIT License. See LICENSE for the full text.

What this means:

✅ You can freely use, copy, modify, and distribute this software
✅ You can use it for commercial and private purposes
✅ You must include a copy of the license and copyright notice
⚠️ The software is provided "as-is" without warranty

Third-party code: Vendored libraries (Chart.js, HTMX) are included under their own licenses in THIRD_PARTY_NOTICES.md.

For more information, visit https://opensource.org/licenses/MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ZeroClue

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.7.0

May 28, 2026

1.6.0

May 27, 2026

1.5.0

May 27, 2026

This version

1.4.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

layercache-1.4.0.tar.gz (909.5 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

layercache-1.4.0-py3-none-any.whl (168.9 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file layercache-1.4.0.tar.gz.

File metadata

Download URL: layercache-1.4.0.tar.gz
Upload date: May 26, 2026
Size: 909.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for layercache-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`e3e5de3139cc7eb5e2967a6c4343e3c440f63c3dfbc672a9b93d24251c51fae8`
MD5	`d50d8843abb32c329e28fb70777a35ee`
BLAKE2b-256	`fe5c52b4d17ee626ce4484f0e6fd03b8c10671711e672d784fe701baf262847b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for layercache-1.4.0.tar.gz:

Publisher: release.yml on ZeroClue/layercache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: layercache-1.4.0.tar.gz
- Subject digest: e3e5de3139cc7eb5e2967a6c4343e3c440f63c3dfbc672a9b93d24251c51fae8
- Sigstore transparency entry: 1633421169
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: ZeroClue/layercache@7a2c3f94d1a7b1b209de4f83cea8ed770737be03
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/ZeroClue
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7a2c3f94d1a7b1b209de4f83cea8ed770737be03
- Trigger Event: push

File details

Details for the file layercache-1.4.0-py3-none-any.whl.

File metadata

Download URL: layercache-1.4.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 168.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for layercache-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b170b6481e126b85af51097fac8f866be6de292cd4763722f8ea8cd23c27734`
MD5	`80754dc60532cbf733593bc39989b800`
BLAKE2b-256	`0536882580a492b0da5dcd109253fa6f85c32530226d0b9e22c46fe915e70abd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for layercache-1.4.0-py3-none-any.whl:

Publisher: release.yml on ZeroClue/layercache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: layercache-1.4.0-py3-none-any.whl
- Subject digest: 2b170b6481e126b85af51097fac8f866be6de292cd4763722f8ea8cd23c27734
- Sigstore transparency entry: 1633421176
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: ZeroClue/layercache@7a2c3f94d1a7b1b209de4f83cea8ed770737be03
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/ZeroClue
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7a2c3f94d1a7b1b209de4f83cea8ed770737be03
- Trigger Event: push

layercache 1.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LayerCache

Table of Contents

Overview

Why LayerCache?

Core Concept: The Layered Prompt Architecture

Features

Cache Optimization

Session Management

Semantic Cache

Prompt Enhancements

Observability & Management

Quick Start

Option 1: Docker (Recommended)

Option 2: pip install

Verify it works

Usage Examples

Basic Proxy (Zero Configuration)

With Cache-Safe Enhancements

Using a Prompt Template

Controlling Semantic Cache

Checking Cache Performance

API Reference

OpenAI-Compatible Endpoints

Management Endpoints

Dashboard Endpoints

LayerCache Request Extensions

Built-in Enhancements

Configuration

Environment Variables

Docker Deployment

Docker Volumes

Architecture

Development

Prerequisites

Setup

Running Tests

Code Quality

Project Structure

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance