Cache-optimized LLM conversation history management with static/dynamic system prompts, transition modes, and compaction hooks.

These details have not been verified by PyPI

Project links

Project description

llmbuffer

Cache-optimized LLM conversation history management.

Most LLM applications naively concatenate their system prompt, conversation history, and any dynamic context into a single message list — and rebuild it from scratch every turn. This works, but it leaves significant money and latency on the table by constantly invalidating the provider's prompt cache.

llmbuffer assembles your messages in the order that maximises cache reuse, manages the boundary between stable and changing content, and handles compaction when history grows too long — all without you having to think about it.

[Static System Prompt] → [Long-Lived History] → [Dynamic Context] → [Recent Messages]
       cached ✓                  cached ✓             not cached          not cached

The static system prompt and committed conversation history form a byte-stable prefix that is never mutated or re-ordered across turns. The frequently-changing parts — RAG results, timestamps, in-flight tool calls — live at the end where they can't invalidate the prefix.

Install

pip install llmbuffer

Optional extras for live benchmarking:

pip install "llmbuffer[anthropic]"    # Anthropic prompt caching
pip install "llmbuffer[openai]"       # OpenAI prefix caching

llmbuffer has zero required dependencies — just Python 3.9+.

Quickstart

Stateful (in-process)

from llmbuffer import PromptManager, AnthropicAdapter

manager = PromptManager(
    static_system_prompt="You are a senior software engineering assistant...",
    transition_mode="agent_cycle",   # auto-commit turns to the stable prefix
    adapter=AnthropicAdapter(),      # inject cache_control markers
    max_tokens=8_000,                # compact long-lived history beyond this
)

# Each turn:
manager.append({"role": "user", "content": user_message})
messages = manager.build_messages(dynamic_system_prompt=rag_context)
reply = anthropic_client.messages.create(messages=messages, ...)
manager.append({"role": "assistant", "content": reply})

Stateless (web app / serverless)

Pure functions over a JSON-serializable state dict — persist it anywhere between requests:

from llmbuffer import functional, new_state, dumps, loads

SYSTEM = "You are a senior software engineering assistant..."

# Load state from DB / session
state = loads(row.conversation_json) if row else new_state()

# Build messages, call LLM, store updated state
state = functional.append_message(state, {"role": "user", "content": text},
                                  transition_mode="manual")
messages = functional.build_messages(state, static_system_prompt=SYSTEM,
                                     dynamic_system_prompt=rag_context)
# ... call your LLM ...
state = functional.append_message(state, reply, transition_mode="manual")
state = functional.compact(state, max_tokens=8_000)   # explicit in the functional API
row.conversation_json = dumps(state)

Each function takes only the settings it uses — there's no config object to thread through. Compaction is an explicit compact() call in the functional API (the stateful PromptManager runs it automatically).

How it works

Message ordering

build_messages() always emits messages in this exact order:

Position	Content	Cache behaviour
1	Static system prompt	Cached — never changes
2	Long-lived history	Cached — stable, grows slowly
3	Dynamic context	Not cached — RAG results, timestamps, etc.
4	Short-term history	Not cached — current turn, tool calls

Transition modes

Control when messages graduate from short-term into the stable long-lived history:

Mode	Behaviour
`none`	Every message goes straight into long-lived history
`manual`	Messages stay short-term until you call `transition()`
`agent_cycle`	Commits automatically when a non-tool-call assistant message ends the turn

Transition hooks

Before messages move from short-term into the long-lived (cached) history, an optional transition_hook can rewrite them — useful for trimming verbose tool outputs or stripping content you don't want locked into the stable prefix forever.

def trim_tool_outputs(messages):
    """Keep only the last 20 lines of any tool output before it enters long-lived history."""
    result = []
    for msg in messages:
        if msg.get("role") == "tool":
            content = msg.get("content", "")
            lines = content.splitlines()
            if len(lines) > 20:
                kept = "\n".join(lines[-20:])
                msg = {**msg, "content": f"[…{len(lines) - 20} lines truncated]\n{kept}"}
        result.append(msg)
    return result

manager = PromptManager(
    transition_mode="agent_cycle",
    transition_hook=trim_tool_outputs,
)
# Functional API: pass the hook directly
# state = functional.append_message(state, msg, transition_mode="agent_cycle",
#                                   transition_hook=trim_tool_outputs)

The hook receives the list of short-term messages being committed and returns whatever should actually land in long-lived history. Drop messages entirely, summarise them, replace binary blobs with descriptions — the returned list is what gets cached.

Compaction

When the long-lived history exceeds max_tokens, a compaction hook reduces it to max_tokens // 2 (configurable). The default hook truncates oldest-first; supply your own to summarise instead:

def summarise(messages, target_tokens, adapter):
    summary = call_llm_to_summarise(messages)
    return [{"role": "system", "content": summary}]

manager = PromptManager(max_tokens=8_000, compaction_hook=summarise)
# Functional API: compaction is an explicit call
# state = functional.compact(state, max_tokens=8_000, compaction_hook=summarise)

Provider adapters

Adapter	Cache markers	Token counting
`OpenAIAdapter` (default)	None needed — automatic prefix caching	~4 chars/token
`AnthropicAdapter`	`cache_control: {type: ephemeral}` injected at prefix boundaries	~4 chars/token
`TransformersAdapter(tok)`	None	Exact via HF tokenizer

Subclass ProviderAdapter to add a new provider — override count_tokens() and/or apply_cache_markers().

Benchmark

The benchmark suite runs a multi-turn conversation through both llmbuffer and a naive approach, and reports cache hits from the provider's own usage metadata.

The naive approach puts the static and dynamic system prompts together at the start of every message list and drops the oldest messages when the context limit is hit — this is the default pattern in most LLM applications today.

Results (simulated, 15 turns, Anthropic pricing)

The simulated provider models provider prefix caching exactly: a turn is a cache hit when its message list shares a prefix with a previously-seen turn. Run --provider anthropic or --provider openai for live numbers.

Turn	Dynamic changed	llmbuffer cached	naive cached
1	yes	✗ 0	✗ 0
2	—	✓ 1,213	✓ 1,340
3	—	✓ 1,245	✓ 1,368
4	yes	✓ 1,274	✗ 0
5	—	✓ 1,297	✓ 1,416
6	—	✓ 1,325	✓ 1,443
7	yes	✓ 1,351	✗ 0
8	—	✓ 1,379	✓ 1,497
9	—	✓ 1,403	✓ 1,525
10	yes	✓ 1,430	✗ 0
11	—	✓ 1,458	✓ 1,568
12	—	✓ 1,479	✓ 1,597
13	yes	✓ 1,507	✗ 0
14	—	✓ 1,535	✓ 1,651
15	—	✓ 1,561	✓ 1,677

Metric	llmbuffer	naive
Cache hit ratio	85.3%	66.1%
Total cached tokens	19,457	15,082
Est. cost (Anthropic, with caching)	$0.016	$0.028
Est. savings vs no caching	76.7%	59.5%

Every time the dynamic context rotates (turns 4, 7, 10, 13) the naive approach suffers a full cache miss — the changed system prompt invalidates the entire prefix. llmbuffer keeps the static system and long-lived history stable, so only the new suffix is uncached regardless of what the dynamic context does.

Run it yourself

# No API key needed:
uv run python -m llmbuffer.benchmark --provider simulated --compare --turns 15

# Live providers (needs API key):
uv run python -m llmbuffer.benchmark --provider anthropic --compare --turns 15
uv run python -m llmbuffer.benchmark --provider openai --compare --turns 15
uv run python -m llmbuffer.benchmark --provider gemini --compare --turns 15

# Ollama (local, needs server log access):
uv run python -m llmbuffer.benchmark --provider ollama \
    --ollama-log ~/.ollama/logs/server.log --compare

# JSON output:
uv run python -m llmbuffer.benchmark --provider anthropic --compare --format json

Development

# Clone and set up:
git clone https://github.com/scottpurdy/llmbuffer
cd llmbuffer
uv sync

# Run tests:
uv run pytest

# Run benchmark (simulated, no API key needed):
uv run python -m llmbuffer.benchmark --provider simulated --compare

The test suite includes explicit cache-stability tests asserting that the static system prompt and long-lived history are byte-identical across turns — verifying the cache prefix is never accidentally mutated.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Jun 10, 2026

This version

0.2.0

Jun 10, 2026

0.1.0

Jun 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmbuffer-0.2.0.tar.gz (321.2 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmbuffer-0.2.0-py3-none-any.whl (26.6 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file llmbuffer-0.2.0.tar.gz.

File metadata

Download URL: llmbuffer-0.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 321.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llmbuffer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`1db002eb384943e921a140a3cc22a4f06c9ae0721b444d13c0ffd4de28540890`
MD5	`ea717707917a09e4da64a7bd7bbcd0bd`
BLAKE2b-256	`515051eb0da99edf3d5fa72868b801fcba903b2a371d348344521acd65acc138`

See more details on using hashes here.

File details

Details for the file llmbuffer-0.2.0-py3-none-any.whl.

File metadata

Download URL: llmbuffer-0.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 26.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llmbuffer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55b6c9c657d80a99f350966662ed8d318191054a3197871c8d591643f53b1e7e`
MD5	`fa55ae09c7c3a7b26e743e145b330c34`
BLAKE2b-256	`e1fdf5e42e1d303cb3c38f064f02358e0b6b6bdb70d618a9986731fab6b27e8c`

See more details on using hashes here.

llmbuffer 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmbuffer

Install

Quickstart

Stateful (in-process)

Stateless (web app / serverless)

How it works

Message ordering

Transition modes

Transition hooks

Compaction

Provider adapters

Benchmark

Results (simulated, 15 turns, Anthropic pricing)

Run it yourself

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes