Skip to main content

Framework-neutral LLM cache orchestration, prompt layout, and LMCache/vLLM helper utilities.

Project description

Prompt Cache Kit

Prompt Cache Kit is a framework-neutral Python package for LLM cache orchestration. It helps applications decide what should be cacheable, mark provider-specific prompt cache boundaries, cache final responses safely, and normalize cache/token usage telemetry.

It does not claim to implement universal KV-cache reuse around arbitrary model objects. True KV caching belongs inside inference engines such as vLLM, SGLang, TensorRT-LLM, or LMCache. This package gives your application a clean control plane around those systems.

What It Provides

  • Response caching for Python callables and model-like objects.
  • Prompt caching strategies using an extendable Strategy pattern.
  • Provider-specific cache-point compilation for Anthropic and Bedrock/LangChain.
  • Redis and in-memory response-cache backends.
  • LangChain, LangGraph, and CrewAI-friendly wrapping helpers.
  • Usage normalization for OpenAI/OpenRouter, Anthropic, and LangChain metadata.
  • OpenTelemetry GenAI-style usage attribute export.
  • vLLM prefix-caching and LMCache connector config helpers.

Install

For local development:

pip install -e ".[dev]"

Optional extras:

pip install "prompt-cache-kit[redis]"
pip install "prompt-cache-kit[langchain]"

Mental Model

Prompt Cache Kit separates four layers:

  1. Response cache: stores the final result of deterministic or acceptable-to-replay LLM calls.
  2. Prompt cache strategy: decides which message boundaries should become cache points.
  3. Provider compiler: renders cache points as Anthropic cache_control, Bedrock cachePoint, or generic metadata.
  4. Engine helper: generates vLLM/LMCache configuration and checks LMCache HTTP health/status endpoints.

This split matters because response caching, provider prompt caching, and engine KV caching have different guarantees.

Quick Start

from prompt_cache_kit import CachePolicy, MemoryCacheBackend, cached

backend = MemoryCacheBackend()
policy = CachePolicy(namespace="research-agent", ttl_seconds=3600)

@cached(backend=backend, policy=policy)
def call_model(messages, model="gpt-4.1-mini", temperature=0):
    return {"content": "real model response here"}

call_model([{"role": "user", "content": "Explain KV caching"}])
call_model([{"role": "user", "content": "Explain KV caching"}])

print(backend.stats())

Response Cache Backends

In Memory

from prompt_cache_kit import MemoryCacheBackend

backend = MemoryCacheBackend(max_size=1000)

Redis

from prompt_cache_kit import CachePolicy, RedisCacheBackend, cached

backend = RedisCacheBackend(
    url="redis://localhost:6379/0",
    namespace="my-agent-cache",
)
policy = CachePolicy(namespace="invoice-agent", ttl_seconds=300)

@cached(backend=backend, policy=policy)
def call_llm(messages, model="gpt-4.1-mini"):
    return client.chat.completions.create(model=model, messages=messages)

RedisCacheBackend stores arbitrary Python responses with pickle. Use a dedicated Redis namespace/database and avoid sharing it across trust boundaries.

Wrapping Model Objects

Prompt Cache Kit uses duck typing. It can wrap common model methods without importing the framework.

from prompt_cache_kit import CachedModel, MemoryCacheBackend

cached_model = CachedModel(
    model=some_model,
    backend=MemoryCacheBackend(),
    methods=("invoke", "ainvoke", "__call__"),
)

result = cached_model.invoke("hello")

Convenience wrappers:

from prompt_cache_kit import wrap_crewai_llm, wrap_langchain_model

cached_langchain_model = wrap_langchain_model(chat_model)
cached_crewai_llm = wrap_crewai_llm(llm)

Streaming/generator methods pass through by default. Correct streamed-response caching needs a recorder/replay layer and is intentionally not hidden here.

Prompt Caching Strategy Pattern

The main LLD extension point is PromptCachingStrategy.

from prompt_cache_kit import CacheDirective, PromptCachingStrategy

class LastAssistantStrategy(PromptCachingStrategy):
    def select(self, context):
        for index in range(len(context.messages) - 1, -1, -1):
            if context.messages[index]["role"] == "assistant":
                return [CacheDirective(message_index=index, id="last-assistant-prefix")]
        return []

Built-in strategies:

  • CacheUntilPromptCachingStrategy
  • ManualPromptCachingStrategy
  • RollingPromptCachingStrategy
  • StablePrefixPromptCachingStrategy

Convenience factories:

from prompt_cache_kit import cache_at, cache_until, rolling_cache

manual = cache_at(0, 2, 4, ids={0: "system-v3", 4: "old-thread-v1"}).excluding(2)
boundary = cache_until(1, id="system-and-document-v1")
rolling = rolling_cache(every_messages=2, min_tokens=1024, max_points=4)

Compile a strategy into provider-specific messages:

from prompt_cache_kit import apply_cache_points

messages = [
    {"role": "system", "content": "Long stable system prompt...", "stable": True},
    {"role": "user", "content": "Long stable uploaded document...", "stable": True},
    {"role": "user", "content": "What changed in section 4?", "stable": False},
]

anthropic_messages = apply_cache_points(messages, plan=boundary, provider="anthropic")
bedrock_messages = apply_cache_points(messages, plan=boundary, provider="bedrock")

Anthropic output marks the text block with:

{"cache_control": {"type": "ephemeral"}}

Bedrock/LangChain output appends:

{"cachePoint": {"type": "default"}}

Prompt Layout

PromptLayout helps keep stable content before dynamic content.

from prompt_cache_kit import PromptLayout

layout = (
    PromptLayout()
    .stable_system("You are a precise research assistant.")
    .cache_point("system-v1", provider_hint="anthropic")
    .stable_context("Long product manual or policy document...")
    .cache_point("manual-v1")
    .dynamic_user("What changed in section 4?")
)

messages = layout.to_anthropic_messages()
instrumented = layout.to_instrumented_messages()
issues = layout.lint()
usage_before_call = layout.usage()

Provider payloads are clean. Internal metadata such as stable and cache_point only appears in to_instrumented_messages().

Usage And Telemetry

Normalize provider usage objects:

from prompt_cache_kit import normalize_usage

stats = normalize_usage({
    "usage": {
        "prompt_tokens": 10339,
        "completion_tokens": 60,
        "total_tokens": 10399,
        "prompt_tokens_details": {
            "cached_tokens": 10318,
            "cache_write_tokens": 0,
        },
    }
}, provider="openrouter")

print(stats.cache_read_input_tokens)
print(stats.to_openai_usage())
print(stats.to_otel_attributes())

LangChain usage extraction:

from prompt_cache_kit import extract_langchain_usage

response = cached_langchain_model.invoke(messages)
usage = extract_langchain_usage(response)
print(usage.to_otel_attributes())

LangChain And LangGraph

Use wrap_langchain_model() for response caching and apply_cache_points() for prompt cache markers:

from prompt_cache_kit import (
    CachePolicy,
    MemoryCacheBackend,
    StablePrefixPromptCachingStrategy,
    apply_cache_points,
    wrap_langchain_model,
)

cached_model = wrap_langchain_model(
    chat_model,
    backend=MemoryCacheBackend(),
    policy=CachePolicy(namespace="langgraph-node", ttl_seconds=300),
)

strategy = StablePrefixPromptCachingStrategy(min_tokens=1024, max_points=4)
messages = apply_cache_points(raw_messages, plan=strategy, provider="bedrock")
response = cached_model.invoke(messages)

For Bedrock via LangChain, the message content follows the documented pattern of text blocks plus a cachePoint block.

CrewAI

from prompt_cache_kit import CachePolicy, MemoryCacheBackend, cache_until, apply_cache_points, wrap_crewai_llm

cached_llm = wrap_crewai_llm(
    llm,
    backend=MemoryCacheBackend(),
    policy=CachePolicy(namespace="crew", ttl_seconds=300),
)

messages = apply_cache_points(
    [
        {"role": "system", "content": "Long crew instructions...", "stable": True},
        {"role": "user", "content": "Current task", "stable": False},
    ],
    plan=cache_until(0, id="crew-system-v1"),
    provider="anthropic",
)

result = cached_llm.call(messages)

vLLM And LMCache

For plain vLLM prefix caching:

from prompt_cache_kit import VLLMConfig

cfg = VLLMConfig(model="Qwen/Qwen3-8B", enable_prefix_caching=True)
print("vllm serve " + " ".join(cfg.to_cli_args()))

For current LMCache multiprocess mode:

mp = cfg.with_lmcache_mp(host="127.0.0.1", port=5555)
print("vllm serve " + " ".join(mp.to_cli_args()))
print(mp.to_engine_kwargs())

This emits a kv_transfer_config using LMCacheMPConnector and kv_connector_extra_config with lmcache.mp.host and lmcache.mp.port.

For deployments using the simpler LMCacheConnectorV1 shape:

classic = cfg.with_lmcache_v1()
print(classic.kv_transfer_config)

LMCache HTTP control:

from prompt_cache_kit import LMCacheClient

client = LMCacheClient("http://localhost:8080")
print(client.health())
print(client.status())

Examples

See the examples folder:

Package Boundaries

Prompt Cache Kit can:

  • Decide where cache points should go.
  • Render cache points for known provider formats.
  • Cache complete LLM responses.
  • Normalize token/cache usage.
  • Generate vLLM/LMCache configuration snippets.

Prompt Cache Kit cannot:

  • Read or write arbitrary model KV tensors.
  • Make non-deterministic model calls deterministic.
  • Guarantee provider-side prompt cache hits if the provider silently rejects a breakpoint.
  • Replace LMCache, vLLM, SGLang, or provider-native prompt caches.

Development

pip install -e ".[dev]"
python -m pytest

See docs/architecture.md for the internal module layout and extension-point design.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_cache_kit-0.1.0.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prompt_cache_kit-0.1.0-py3-none-any.whl (34.2 kB view details)

Uploaded Python 3

File details

Details for the file prompt_cache_kit-0.1.0.tar.gz.

File metadata

  • Download URL: prompt_cache_kit-0.1.0.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.7

File hashes

Hashes for prompt_cache_kit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6c6c33061fd9b2f287bbd02fedf3ddafb855505aabd509e309467ab05099c2de
MD5 2d1c30fec8829488abd60fda01031483
BLAKE2b-256 6a4c8c5bdded44178ebeff7e5adcb5d2a530827081c96514c89821e2313b3612

See more details on using hashes here.

File details

Details for the file prompt_cache_kit-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for prompt_cache_kit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f8a482b3982d55b6f332dfa0379e6b2b1c98d576916d1cf6d896262d24db0b91
MD5 c90d3166785792bec1bd37b654b7515d
BLAKE2b-256 e74d4604a8940bf5d871f97dc5d45154dceb085d8e376c720e79102eacf26476

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page