Framework-neutral LLM cache orchestration, prompt layout, and LMCache/vLLM helper utilities.

These details have not been verified by PyPI

Project links

Project description

Prompt Cache Kit

Prompt Cache Kit is a framework-neutral Python package for LLM cache orchestration. It helps applications decide what should be cacheable, mark provider-specific prompt cache boundaries, cache final responses safely, and normalize cache/token usage telemetry.

It does not claim to implement universal KV-cache reuse around arbitrary model objects. True KV caching belongs inside inference engines such as vLLM, SGLang, TensorRT-LLM, or LMCache. This package gives your application a clean control plane around those systems.

What It Provides

Response caching for Python callables and model-like objects.
Prompt caching strategies using an extendable Strategy pattern.
Provider-specific cache-point compilation for Anthropic and Bedrock/LangChain.
Redis and in-memory response-cache backends.
LangChain, LangGraph, and CrewAI-friendly wrapping helpers.
Usage normalization for OpenAI/OpenRouter, Anthropic, and LangChain metadata.
OpenTelemetry GenAI-style usage attribute export.
vLLM prefix-caching and LMCache connector config helpers.

Install

For local development:

pip install -e ".[dev]"

Optional extras:

pip install "prompt-cache-kit[redis]"
pip install "prompt-cache-kit[langchain]"

Mental Model

Prompt Cache Kit separates four layers:

Response cache: stores the final result of deterministic or acceptable-to-replay LLM calls.
Prompt cache strategy: decides which message boundaries should become cache points.
Provider compiler: renders cache points as Anthropic cache_control, Bedrock cachePoint, or generic metadata.
Engine helper: generates vLLM/LMCache configuration and checks LMCache HTTP health/status endpoints.

This split matters because response caching, provider prompt caching, and engine KV caching have different guarantees.

Quick Start

from prompt_cache_kit import CachePolicy, MemoryCacheBackend, cached

backend = MemoryCacheBackend()
policy = CachePolicy(namespace="research-agent", ttl_seconds=3600)

@cached(backend=backend, policy=policy)
def call_model(messages, model="gpt-4.1-mini", temperature=0):
    return {"content": "real model response here"}

call_model([{"role": "user", "content": "Explain KV caching"}])
call_model([{"role": "user", "content": "Explain KV caching"}])

print(backend.stats())

Response Cache Backends

In Memory

from prompt_cache_kit import MemoryCacheBackend

backend = MemoryCacheBackend(max_size=1000)

Redis

from prompt_cache_kit import CachePolicy, RedisCacheBackend, cached

backend = RedisCacheBackend(
    url="redis://localhost:6379/0",
    namespace="my-agent-cache",
)
policy = CachePolicy(namespace="invoice-agent", ttl_seconds=300)

@cached(backend=backend, policy=policy)
def call_llm(messages, model="gpt-4.1-mini"):
    return client.chat.completions.create(model=model, messages=messages)

RedisCacheBackend stores arbitrary Python responses with pickle. Use a dedicated Redis namespace/database and avoid sharing it across trust boundaries.

Wrapping Model Objects

Prompt Cache Kit uses duck typing. It can wrap common model methods without importing the framework.

from prompt_cache_kit import CachedModel, MemoryCacheBackend

cached_model = CachedModel(
    model=some_model,
    backend=MemoryCacheBackend(),
    methods=("invoke", "ainvoke", "__call__"),
)

result = cached_model.invoke("hello")

Convenience wrappers:

from prompt_cache_kit import wrap_crewai_llm, wrap_langchain_model

cached_langchain_model = wrap_langchain_model(chat_model)
cached_crewai_llm = wrap_crewai_llm(llm)

Streaming/generator methods pass through by default. Correct streamed-response caching needs a recorder/replay layer and is intentionally not hidden here.

Prompt Caching Strategy Pattern

The main LLD extension point is PromptCachingStrategy.

from prompt_cache_kit import CacheDirective, PromptCachingStrategy

class LastAssistantStrategy(PromptCachingStrategy):
    def select(self, context):
        for index in range(len(context.messages) - 1, -1, -1):
            if context.messages[index]["role"] == "assistant":
                return [CacheDirective(message_index=index, id="last-assistant-prefix")]
        return []

Built-in strategies:

CacheUntilPromptCachingStrategy
ManualPromptCachingStrategy
RollingPromptCachingStrategy
StablePrefixPromptCachingStrategy

Convenience factories:

from prompt_cache_kit import cache_at, cache_until, rolling_cache

manual = cache_at(0, 2, 4, ids={0: "system-v3", 4: "old-thread-v1"}).excluding(2)
boundary = cache_until(1, id="system-and-document-v1")
rolling = rolling_cache(every_messages=2, min_tokens=1024, max_points=4)

Compile a strategy into provider-specific messages:

from prompt_cache_kit import apply_cache_points

messages = [
    {"role": "system", "content": "Long stable system prompt...", "stable": True},
    {"role": "user", "content": "Long stable uploaded document...", "stable": True},
    {"role": "user", "content": "What changed in section 4?", "stable": False},
]

anthropic_messages = apply_cache_points(messages, plan=boundary, provider="anthropic")
bedrock_messages = apply_cache_points(messages, plan=boundary, provider="bedrock")

Anthropic output marks the text block with:

{"cache_control": {"type": "ephemeral"}}

Bedrock/LangChain output appends:

{"cachePoint": {"type": "default"}}

Prompt Layout

PromptLayout helps keep stable content before dynamic content.

from prompt_cache_kit import PromptLayout

layout = (
    PromptLayout()
    .stable_system("You are a precise research assistant.")
    .cache_point("system-v1", provider_hint="anthropic")
    .stable_context("Long product manual or policy document...")
    .cache_point("manual-v1")
    .dynamic_user("What changed in section 4?")
)

messages = layout.to_anthropic_messages()
instrumented = layout.to_instrumented_messages()
issues = layout.lint()
usage_before_call = layout.usage()

Provider payloads are clean. Internal metadata such as stable and cache_point only appears in to_instrumented_messages().

Usage And Telemetry

Normalize provider usage objects:

from prompt_cache_kit import normalize_usage

stats = normalize_usage({
    "usage": {
        "prompt_tokens": 10339,
        "completion_tokens": 60,
        "total_tokens": 10399,
        "prompt_tokens_details": {
            "cached_tokens": 10318,
            "cache_write_tokens": 0,
        },
    }
}, provider="openrouter")

print(stats.cache_read_input_tokens)
print(stats.to_openai_usage())
print(stats.to_otel_attributes())

LangChain usage extraction:

from prompt_cache_kit import extract_langchain_usage

response = cached_langchain_model.invoke(messages)
usage = extract_langchain_usage(response)
print(usage.to_otel_attributes())

LangChain And LangGraph

Use wrap_langchain_model() for response caching and apply_cache_points() for prompt cache markers:

from prompt_cache_kit import (
    CachePolicy,
    MemoryCacheBackend,
    StablePrefixPromptCachingStrategy,
    apply_cache_points,
    wrap_langchain_model,
)

cached_model = wrap_langchain_model(
    chat_model,
    backend=MemoryCacheBackend(),
    policy=CachePolicy(namespace="langgraph-node", ttl_seconds=300),
)

strategy = StablePrefixPromptCachingStrategy(min_tokens=1024, max_points=4)
messages = apply_cache_points(raw_messages, plan=strategy, provider="bedrock")
response = cached_model.invoke(messages)

For Bedrock via LangChain, the message content follows the documented pattern of text blocks plus a cachePoint block.

CrewAI

from prompt_cache_kit import CachePolicy, MemoryCacheBackend, cache_until, apply_cache_points, wrap_crewai_llm

cached_llm = wrap_crewai_llm(
    llm,
    backend=MemoryCacheBackend(),
    policy=CachePolicy(namespace="crew", ttl_seconds=300),
)

messages = apply_cache_points(
    [
        {"role": "system", "content": "Long crew instructions...", "stable": True},
        {"role": "user", "content": "Current task", "stable": False},
    ],
    plan=cache_until(0, id="crew-system-v1"),
    provider="anthropic",
)

result = cached_llm.call(messages)

vLLM And LMCache

For plain vLLM prefix caching:

from prompt_cache_kit import VLLMConfig

cfg = VLLMConfig(model="Qwen/Qwen3-8B", enable_prefix_caching=True)
print("vllm serve " + " ".join(cfg.to_cli_args()))

For current LMCache multiprocess mode:

mp = cfg.with_lmcache_mp(host="127.0.0.1", port=5555)
print("vllm serve " + " ".join(mp.to_cli_args()))
print(mp.to_engine_kwargs())

This emits a kv_transfer_config using LMCacheMPConnector and kv_connector_extra_config with lmcache.mp.host and lmcache.mp.port.

For deployments using the simpler LMCacheConnectorV1 shape:

classic = cfg.with_lmcache_v1()
print(classic.kv_transfer_config)

LMCache HTTP control:

from prompt_cache_kit import LMCacheClient

client = LMCacheClient("http://localhost:8080")
print(client.health())
print(client.status())

Examples

See the examples folder:

custom_inference.py: response caching for a custom model function.
custom_strategy.py: extend PromptCachingStrategy.
redis_backend.py: Redis response-cache backend.
langgraph_usage.py: LangGraph-style node integration.
crewai_usage.py: CrewAI LLM wrapping.
langchain_bedrock_cachepoint.py: LangChain Bedrock cachePoint content.
vllm_lmcache_config.py: vLLM prefix caching and LMCache connector config.

Package Boundaries

Prompt Cache Kit can:

Decide where cache points should go.
Render cache points for known provider formats.
Cache complete LLM responses.
Normalize token/cache usage.
Generate vLLM/LMCache configuration snippets.

Prompt Cache Kit cannot:

Read or write arbitrary model KV tensors.
Make non-deterministic model calls deterministic.
Guarantee provider-side prompt cache hits if the provider silently rejects a breakpoint.
Replace LMCache, vLLM, SGLang, or provider-native prompt caches.

Development

pip install -e ".[dev]"
python -m pytest

See docs/architecture.md for the internal module layout and extension-point design.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_cache_kit-0.1.0.tar.gz (30.5 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prompt_cache_kit-0.1.0-py3-none-any.whl (34.2 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file prompt_cache_kit-0.1.0.tar.gz.

File metadata

Download URL: prompt_cache_kit-0.1.0.tar.gz
Upload date: May 10, 2026
Size: 30.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.7

File hashes

Hashes for prompt_cache_kit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6c6c33061fd9b2f287bbd02fedf3ddafb855505aabd509e309467ab05099c2de`
MD5	`2d1c30fec8829488abd60fda01031483`
BLAKE2b-256	`6a4c8c5bdded44178ebeff7e5adcb5d2a530827081c96514c89821e2313b3612`

See more details on using hashes here.

File details

Details for the file prompt_cache_kit-0.1.0-py3-none-any.whl.

File metadata

Download URL: prompt_cache_kit-0.1.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 34.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.7

File hashes

Hashes for prompt_cache_kit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8a482b3982d55b6f332dfa0379e6b2b1c98d576916d1cf6d896262d24db0b91`
MD5	`c90d3166785792bec1bd37b654b7515d`
BLAKE2b-256	`e74d4604a8940bf5d871f97dc5d45154dceb085d8e376c720e79102eacf26476`

See more details on using hashes here.

prompt-cache-kit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Prompt Cache Kit

What It Provides

Install

Mental Model

Quick Start

Response Cache Backends

In Memory

Redis

Wrapping Model Objects

Prompt Caching Strategy Pattern

Prompt Layout

Usage And Telemetry

LangChain And LangGraph

CrewAI

vLLM And LMCache

Examples

Package Boundaries

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes