Framework-neutral LLM cache orchestration, prompt layout, and LMCache/vLLM helper utilities.
Project description
Prompt Cache Kit
Prompt Cache Kit is a framework-neutral Python package for LLM cache orchestration. It helps applications decide what should be cacheable, mark provider-specific prompt cache boundaries, cache final responses safely, and normalize cache/token usage telemetry.
It does not claim to implement universal KV-cache reuse around arbitrary model objects. True KV caching belongs inside inference engines such as vLLM, SGLang, TensorRT-LLM, or LMCache. This package gives your application a clean control plane around those systems.
What It Provides
- Response caching for Python callables and model-like objects.
- Prompt caching strategies using an extendable Strategy pattern.
- Provider-specific cache-point compilation for Anthropic and Bedrock/LangChain.
- Redis and in-memory response-cache backends.
- LangChain, LangGraph, and CrewAI-friendly wrapping helpers.
- Usage normalization for OpenAI/OpenRouter, Anthropic, and LangChain metadata.
- OpenTelemetry GenAI-style usage attribute export.
- vLLM prefix-caching and LMCache connector config helpers.
Install
For local development:
pip install -e ".[dev]"
Optional extras:
pip install "prompt-cache-kit[redis]"
pip install "prompt-cache-kit[langchain]"
Mental Model
Prompt Cache Kit separates four layers:
- Response cache: stores the final result of deterministic or acceptable-to-replay LLM calls.
- Prompt cache strategy: decides which message boundaries should become cache points.
- Provider compiler: renders cache points as Anthropic
cache_control, BedrockcachePoint, or generic metadata. - Engine helper: generates vLLM/LMCache configuration and checks LMCache HTTP health/status endpoints.
This split matters because response caching, provider prompt caching, and engine KV caching have different guarantees.
Quick Start
from prompt_cache_kit import CachePolicy, MemoryCacheBackend, cached
backend = MemoryCacheBackend()
policy = CachePolicy(namespace="research-agent", ttl_seconds=3600)
@cached(backend=backend, policy=policy)
def call_model(messages, model="gpt-4.1-mini", temperature=0):
return {"content": "real model response here"}
call_model([{"role": "user", "content": "Explain KV caching"}])
call_model([{"role": "user", "content": "Explain KV caching"}])
print(backend.stats())
Response Cache Backends
In Memory
from prompt_cache_kit import MemoryCacheBackend
backend = MemoryCacheBackend(max_size=1000)
Redis
from prompt_cache_kit import CachePolicy, RedisCacheBackend, cached
backend = RedisCacheBackend(
url="redis://localhost:6379/0",
namespace="my-agent-cache",
)
policy = CachePolicy(namespace="invoice-agent", ttl_seconds=300)
@cached(backend=backend, policy=policy)
def call_llm(messages, model="gpt-4.1-mini"):
return client.chat.completions.create(model=model, messages=messages)
RedisCacheBackend stores arbitrary Python responses with pickle. Use a dedicated
Redis namespace/database and avoid sharing it across trust boundaries.
Wrapping Model Objects
Prompt Cache Kit uses duck typing. It can wrap common model methods without importing the framework.
from prompt_cache_kit import CachedModel, MemoryCacheBackend
cached_model = CachedModel(
model=some_model,
backend=MemoryCacheBackend(),
methods=("invoke", "ainvoke", "__call__"),
)
result = cached_model.invoke("hello")
Convenience wrappers:
from prompt_cache_kit import wrap_crewai_llm, wrap_langchain_model
cached_langchain_model = wrap_langchain_model(chat_model)
cached_crewai_llm = wrap_crewai_llm(llm)
Streaming/generator methods pass through by default. Correct streamed-response caching needs a recorder/replay layer and is intentionally not hidden here.
Prompt Caching Strategy Pattern
The main LLD extension point is PromptCachingStrategy.
from prompt_cache_kit import CacheDirective, PromptCachingStrategy
class LastAssistantStrategy(PromptCachingStrategy):
def select(self, context):
for index in range(len(context.messages) - 1, -1, -1):
if context.messages[index]["role"] == "assistant":
return [CacheDirective(message_index=index, id="last-assistant-prefix")]
return []
Built-in strategies:
CacheUntilPromptCachingStrategyManualPromptCachingStrategyRollingPromptCachingStrategyStablePrefixPromptCachingStrategy
Convenience factories:
from prompt_cache_kit import cache_at, cache_until, rolling_cache
manual = cache_at(0, 2, 4, ids={0: "system-v3", 4: "old-thread-v1"}).excluding(2)
boundary = cache_until(1, id="system-and-document-v1")
rolling = rolling_cache(every_messages=2, min_tokens=1024, max_points=4)
Compile a strategy into provider-specific messages:
from prompt_cache_kit import apply_cache_points
messages = [
{"role": "system", "content": "Long stable system prompt...", "stable": True},
{"role": "user", "content": "Long stable uploaded document...", "stable": True},
{"role": "user", "content": "What changed in section 4?", "stable": False},
]
anthropic_messages = apply_cache_points(messages, plan=boundary, provider="anthropic")
bedrock_messages = apply_cache_points(messages, plan=boundary, provider="bedrock")
Anthropic output marks the text block with:
{"cache_control": {"type": "ephemeral"}}
Bedrock/LangChain output appends:
{"cachePoint": {"type": "default"}}
Prompt Layout
PromptLayout helps keep stable content before dynamic content.
from prompt_cache_kit import PromptLayout
layout = (
PromptLayout()
.stable_system("You are a precise research assistant.")
.cache_point("system-v1", provider_hint="anthropic")
.stable_context("Long product manual or policy document...")
.cache_point("manual-v1")
.dynamic_user("What changed in section 4?")
)
messages = layout.to_anthropic_messages()
instrumented = layout.to_instrumented_messages()
issues = layout.lint()
usage_before_call = layout.usage()
Provider payloads are clean. Internal metadata such as stable and cache_point
only appears in to_instrumented_messages().
Usage And Telemetry
Normalize provider usage objects:
from prompt_cache_kit import normalize_usage
stats = normalize_usage({
"usage": {
"prompt_tokens": 10339,
"completion_tokens": 60,
"total_tokens": 10399,
"prompt_tokens_details": {
"cached_tokens": 10318,
"cache_write_tokens": 0,
},
}
}, provider="openrouter")
print(stats.cache_read_input_tokens)
print(stats.to_openai_usage())
print(stats.to_otel_attributes())
LangChain usage extraction:
from prompt_cache_kit import extract_langchain_usage
response = cached_langchain_model.invoke(messages)
usage = extract_langchain_usage(response)
print(usage.to_otel_attributes())
LangChain And LangGraph
Use wrap_langchain_model() for response caching and apply_cache_points() for
prompt cache markers:
from prompt_cache_kit import (
CachePolicy,
MemoryCacheBackend,
StablePrefixPromptCachingStrategy,
apply_cache_points,
wrap_langchain_model,
)
cached_model = wrap_langchain_model(
chat_model,
backend=MemoryCacheBackend(),
policy=CachePolicy(namespace="langgraph-node", ttl_seconds=300),
)
strategy = StablePrefixPromptCachingStrategy(min_tokens=1024, max_points=4)
messages = apply_cache_points(raw_messages, plan=strategy, provider="bedrock")
response = cached_model.invoke(messages)
For Bedrock via LangChain, the message content follows the documented pattern of
text blocks plus a cachePoint block.
CrewAI
from prompt_cache_kit import CachePolicy, MemoryCacheBackend, cache_until, apply_cache_points, wrap_crewai_llm
cached_llm = wrap_crewai_llm(
llm,
backend=MemoryCacheBackend(),
policy=CachePolicy(namespace="crew", ttl_seconds=300),
)
messages = apply_cache_points(
[
{"role": "system", "content": "Long crew instructions...", "stable": True},
{"role": "user", "content": "Current task", "stable": False},
],
plan=cache_until(0, id="crew-system-v1"),
provider="anthropic",
)
result = cached_llm.call(messages)
vLLM And LMCache
For plain vLLM prefix caching:
from prompt_cache_kit import VLLMConfig
cfg = VLLMConfig(model="Qwen/Qwen3-8B", enable_prefix_caching=True)
print("vllm serve " + " ".join(cfg.to_cli_args()))
For current LMCache multiprocess mode:
mp = cfg.with_lmcache_mp(host="127.0.0.1", port=5555)
print("vllm serve " + " ".join(mp.to_cli_args()))
print(mp.to_engine_kwargs())
This emits a kv_transfer_config using LMCacheMPConnector and
kv_connector_extra_config with lmcache.mp.host and lmcache.mp.port.
For deployments using the simpler LMCacheConnectorV1 shape:
classic = cfg.with_lmcache_v1()
print(classic.kv_transfer_config)
LMCache HTTP control:
from prompt_cache_kit import LMCacheClient
client = LMCacheClient("http://localhost:8080")
print(client.health())
print(client.status())
Examples
See the examples folder:
- custom_inference.py: response caching for a custom model function.
- custom_strategy.py: extend
PromptCachingStrategy. - redis_backend.py: Redis response-cache backend.
- langgraph_usage.py: LangGraph-style node integration.
- crewai_usage.py: CrewAI LLM wrapping.
- langchain_bedrock_cachepoint.py: LangChain Bedrock
cachePointcontent. - vllm_lmcache_config.py: vLLM prefix caching and LMCache connector config.
Package Boundaries
Prompt Cache Kit can:
- Decide where cache points should go.
- Render cache points for known provider formats.
- Cache complete LLM responses.
- Normalize token/cache usage.
- Generate vLLM/LMCache configuration snippets.
Prompt Cache Kit cannot:
- Read or write arbitrary model KV tensors.
- Make non-deterministic model calls deterministic.
- Guarantee provider-side prompt cache hits if the provider silently rejects a breakpoint.
- Replace LMCache, vLLM, SGLang, or provider-native prompt caches.
Development
pip install -e ".[dev]"
python -m pytest
See docs/architecture.md for the internal module layout and extension-point design.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prompt_cache_kit-0.1.0.tar.gz.
File metadata
- Download URL: prompt_cache_kit-0.1.0.tar.gz
- Upload date:
- Size: 30.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c6c33061fd9b2f287bbd02fedf3ddafb855505aabd509e309467ab05099c2de
|
|
| MD5 |
2d1c30fec8829488abd60fda01031483
|
|
| BLAKE2b-256 |
6a4c8c5bdded44178ebeff7e5adcb5d2a530827081c96514c89821e2313b3612
|
File details
Details for the file prompt_cache_kit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: prompt_cache_kit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8a482b3982d55b6f332dfa0379e6b2b1c98d576916d1cf6d896262d24db0b91
|
|
| MD5 |
c90d3166785792bec1bd37b654b7515d
|
|
| BLAKE2b-256 |
e74d4604a8940bf5d871f97dc5d45154dceb085d8e376c720e79102eacf26476
|