Cache-optimized LLM conversation history management with static/dynamic system prompts, transition modes, and compaction hooks.
Project description
llmbuffer
Cache-optimized LLM conversation history management.
Most LLM applications naively concatenate their system prompt, conversation history, and any dynamic context into a single message list — and rebuild it from scratch every turn. This works, but it leaves significant money and latency on the table by constantly invalidating the provider's prompt cache.
llmbuffer assembles your messages in the order that maximises cache reuse, manages the boundary between stable and changing content, and handles compaction when history grows too long — all without you having to think about it.
[Static System Prompt] → [Long-Lived History] → [Dynamic Context] → [Recent Messages]
cached ✓ cached ✓ not cached not cached
The static system prompt and committed conversation history form a byte-stable prefix that is never mutated or re-ordered across turns. The frequently-changing parts — RAG results, timestamps, in-flight tool calls — live at the end where they can't invalidate the prefix.
Install
pip install llmbuffer
Optional extras for live benchmarking:
pip install "llmbuffer[anthropic]" # Anthropic prompt caching
pip install "llmbuffer[openai]" # OpenAI prefix caching
llmbuffer has zero required dependencies — just Python 3.9+.
Quickstart
Stateful (in-process)
from llmbuffer import PromptManager, PromptConfig, AnthropicAdapter
manager = PromptManager(PromptConfig(
static_system_prompt="You are a senior software engineering assistant...",
transition_mode="agent_cycle", # auto-commit turns to the stable prefix
adapter=AnthropicAdapter(), # inject cache_control markers
max_tokens=8_000, # compact long-lived history beyond this
))
# Each turn:
manager.append({"role": "user", "content": user_message})
messages = manager.build_messages(dynamic_system_prompt=rag_context)
reply = anthropic_client.messages.create(messages=messages, ...)
manager.append({"role": "assistant", "content": reply})
Stateless (web app / serverless)
Pure functions over a JSON-serializable state dict — persist it anywhere between requests:
from llmbuffer import functional, new_state, dumps, loads, PromptConfig
config = PromptConfig(
static_system_prompt="You are a senior software engineering assistant...",
transition_mode="manual",
)
# Load state from DB / session
state = loads(row.conversation_json) if row else new_state()
# Build messages, call LLM, store updated state
state = functional.append_message(state, {"role": "user", "content": text}, config)
messages = functional.build_messages(state, config, dynamic_system_prompt=rag_context)
# ... call your LLM ...
state = functional.append_message(state, reply, config)
row.conversation_json = dumps(state)
How it works
Message ordering
build_messages() always emits messages in this exact order:
| Position | Content | Cache behaviour |
|---|---|---|
| 1 | Static system prompt | Cached — never changes |
| 2 | Long-lived history | Cached — stable, grows slowly |
| 3 | Dynamic context | Not cached — RAG results, timestamps, etc. |
| 4 | Short-term history | Not cached — current turn, tool calls |
Transition modes
Control when messages graduate from short-term into the stable long-lived history:
| Mode | Behaviour |
|---|---|
none |
Every message goes straight into long-lived history |
manual |
Messages stay short-term until you call transition() |
agent_cycle |
Commits automatically when a non-tool-call assistant message ends the turn |
Compaction
When the long-lived history exceeds max_tokens, a compaction hook reduces it to max_tokens // 2 (configurable). The default hook truncates oldest-first; supply your own to summarise instead:
def summarise(messages, target_tokens, adapter):
summary = call_llm_to_summarise(messages)
return [{"role": "system", "content": summary}]
config = PromptConfig(max_tokens=8_000, compaction_hook=summarise)
Provider adapters
| Adapter | Cache markers | Token counting |
|---|---|---|
OpenAIAdapter (default) |
None needed — automatic prefix caching | ~4 chars/token |
AnthropicAdapter |
cache_control: {type: ephemeral} injected at prefix boundaries |
~4 chars/token |
TransformersAdapter(tok) |
None | Exact via HF tokenizer |
Subclass ProviderAdapter to add a new provider — override count_tokens() and/or apply_cache_markers().
Benchmark
The benchmark suite runs a multi-turn conversation through both llmbuffer and a naive approach, and reports cache hits from the provider's own usage metadata.
The naive approach puts the static and dynamic system prompts together at the start of every message list and drops the oldest messages when the context limit is hit — this is the default pattern in most LLM applications today.
Results (simulated, 15 turns, Anthropic pricing)
The simulated provider models provider prefix caching exactly: a turn is a cache hit when its message list shares a prefix with a previously-seen turn. Run
--provider anthropicor--provider openaifor live numbers.
| Turn | Dynamic changed | llmbuffer cached | naive cached |
|---|---|---|---|
| 1 | yes | ✗ 0 | ✗ 0 |
| 2 | — | ✓ 1,213 | ✓ 1,340 |
| 3 | — | ✓ 1,245 | ✓ 1,368 |
| 4 | yes | ✓ 1,274 | ✗ 0 |
| 5 | — | ✓ 1,297 | ✓ 1,416 |
| 6 | — | ✓ 1,325 | ✓ 1,443 |
| 7 | yes | ✓ 1,351 | ✗ 0 |
| 8 | — | ✓ 1,379 | ✓ 1,497 |
| 9 | — | ✓ 1,403 | ✓ 1,525 |
| 10 | yes | ✓ 1,430 | ✗ 0 |
| 11 | — | ✓ 1,458 | ✓ 1,568 |
| 12 | — | ✓ 1,479 | ✓ 1,597 |
| 13 | yes | ✓ 1,507 | ✗ 0 |
| 14 | — | ✓ 1,535 | ✓ 1,651 |
| 15 | — | ✓ 1,561 | ✓ 1,677 |
| Metric | llmbuffer | naive |
|---|---|---|
| Cache hit ratio | 85.3% | 66.1% |
| Total cached tokens | 19,457 | 15,082 |
| Est. cost (Anthropic, with caching) | $0.016 | $0.028 |
| Est. savings vs no caching | 76.7% | 59.5% |
Every time the dynamic context rotates (turns 4, 7, 10, 13) the naive approach suffers a full cache miss — the changed system prompt invalidates the entire prefix. llmbuffer keeps the static system and long-lived history stable, so only the new suffix is uncached regardless of what the dynamic context does.
Run it yourself
# No API key needed:
uv run python -m llmbuffer.benchmark --provider simulated --compare --turns 15
# Live providers (needs API key):
uv run python -m llmbuffer.benchmark --provider anthropic --compare --turns 15
uv run python -m llmbuffer.benchmark --provider openai --compare --turns 15
uv run python -m llmbuffer.benchmark --provider gemini --compare --turns 15
# Ollama (local, needs server log access):
uv run python -m llmbuffer.benchmark --provider ollama \
--ollama-log ~/.ollama/logs/server.log --compare
# JSON output:
uv run python -m llmbuffer.benchmark --provider anthropic --compare --format json
Development
# Clone and set up:
git clone https://github.com/scottpurdy/llmbuffer
cd llmbuffer
uv sync
# Run tests:
uv run pytest
# Run benchmark (simulated, no API key needed):
uv run python -m llmbuffer.benchmark --provider simulated --compare
The test suite includes explicit cache-stability tests asserting that the static system prompt and long-lived history are byte-identical across turns — verifying the cache prefix is never accidentally mutated.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmbuffer-0.1.0.tar.gz.
File metadata
- Download URL: llmbuffer-0.1.0.tar.gz
- Upload date:
- Size: 320.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9380e0e56e255b34c653f782cc29c6418ccfd12072de4172d4fc57f5ed72b3db
|
|
| MD5 |
631d6b1c41b563f74aece899990732c8
|
|
| BLAKE2b-256 |
3868f1e147ee0eed3d8dfc96bc579db2baa5105f5e8f6844c36772f529874f44
|
File details
Details for the file llmbuffer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmbuffer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9022d1d30be01cbd6b1996b21f81efa91eba87aec4c5b8b80a43a45c0b023c5c
|
|
| MD5 |
d70fd2b4bc4791a53940a5dff75388e7
|
|
| BLAKE2b-256 |
791c4f5350b0a7842f76b3fcdabee0f418d636731190c818c5f2d7d9fac3ef22
|