LLM context management — token counting, rolling summarization, dollar savings reports
Project description
ctxmgr
LLM context management for Python — token counting, rolling summarization, and dollar savings reports.
LLM APIs charge per token. A 20-turn support conversation can hit 800+ tokens before the user even asks their real question. ctxmgr compresses the history with Claude Haiku (cheap and fast) so you send fewer tokens to your expensive model — and you get an exact dollar amount saved per call.
pip install ctxmgr
Quick start — 2 lines
from ctxmgr import compress
result = compress(messages, max_tokens=4000, model="claude-sonnet-4-6")
print(f"Saved {result.saved_tokens} tokens — ${result.estimated_savings_usd:.4f} per call")
Before / After — real numbers
The following uses tests/test_20turn_review.py, a 20-turn conversation about token economics (871 tokens). With a 600-token budget, ctxmgr compresses it end-to-end via a live Claude Haiku call:
ORIGINAL : 31 messages, 871 tokens
COMPRESSED: 9 messages, 544 tokens
BUDGET : 600 tokens
REDUCTION : 327 tokens (37.5%)
The compressed result keeps:
- the system prompt (assistant persona, always pinned)
- a single summary message covering the 14 oldest turns
- the last 3 user/assistant pairs verbatim (most recent context, always pinned)
Savings per call at different model tiers:
| Model | Tokens saved | $/call saved |
|---|---|---|
| claude-haiku-4-5 | 327 | $0.000327 |
| claude-sonnet-4-6 | 327 | $0.000981 |
| claude-opus-4-8 | 327 | $0.001635 |
At 10,000 calls/day on Sonnet 4.6 that is ~$358/month saved.
Three conversation types
Benchmarked on realistic fixtures with medium aggressiveness (3 pinned pairs):
| Type | Original | Compressed | Reduction | $/call (Sonnet) |
|---|---|---|---|---|
| Support chat | 426 tok | 237 tok | 44.4% | $0.000567 |
| Coding assistant | 726 tok | 515 tok | 29.1% | $0.000633 |
| RAG Q&A | 541 tok | 251 tok | 53.6% | $0.000870 |
API
compress(messages, max_tokens, model, aggressiveness)
from ctxmgr import compress, CompressionResult
result: CompressionResult = compress(
messages, # list of {"role": ..., "content": ...}
max_tokens=4000, # token budget for the result
model="claude-sonnet-4-6", # used only to calculate dollar savings
aggressiveness="medium", # "light" | "medium" | "aggressive"
)
CompressionResult fields:
| Field | Type | Description |
|---|---|---|
messages |
list[dict] |
Compressed conversation |
original_tokens |
int |
Token count before compression |
compressed_tokens |
int |
Token count after compression |
saved_tokens |
int |
original - compressed |
ratio |
float |
compressed / original (lower = more compression) |
estimated_savings_usd |
float |
saved_tokens × model_input_price |
aggressiveness |
str |
Level used |
Aggressiveness levels — controls how many recent user/assistant pairs are pinned (never summarized):
| Level | Pinned pairs | Use when |
|---|---|---|
"light" |
5 pairs | Long coding sessions, high coherence needed |
"medium" |
3 pairs | General-purpose (default) |
"aggressive" |
1 pair | Support chats, RAG lookups, cost is critical |
TokenCounter
from ctxmgr import TokenCounter
counter = TokenCounter("claude-sonnet-4-6")
print(counter.count("Hello, world!")) # 4
print(counter.count_messages(messages)) # full conversation estimate
Supported models: all claude-* and gpt-* variants. Unknown models fall back to cl100k_base.
Accepts both plain-string content and list-of-blocks format (OpenAI tool calls, Anthropic multi-modal).
RollingSummarizer
Lower-level class if you need more control:
from ctxmgr import RollingSummarizer
summarizer = RollingSummarizer(
model="claude-haiku-4-5-20251001", # summarization model
token_budget=4000,
pin_last_pairs=3,
)
compressed = summarizer.compress(messages)
Message format support
Both Anthropic and OpenAI message formats work:
# Plain strings (both APIs)
{"role": "user", "content": "What is a token?"}
# OpenAI list-of-blocks (vision, tool calls)
{"role": "user", "content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://..."}},
]}
# Anthropic list-of-blocks (tool use)
{"role": "assistant", "content": [
{"type": "text", "text": "I'll look that up."},
{"type": "tool_use", "id": "tu_01", "name": "search", "input": {"q": "tokens"}},
]}
Images and tool-use blocks are counted as short placeholders ([image], [tool:name]) so token estimates stay meaningful.
Edge cases
| Scenario | Behaviour |
|---|---|
Empty history [] |
Returns [], saved_tokens=0 |
| Single-turn (no assistant reply) | Returns unchanged — nothing to summarize |
| Single message larger than budget | Returns unchanged — cannot split a single message |
| Already under budget | Returns unchanged, no API call made |
content=None |
Treated as empty string |
How it works
- Count —
TokenCounteruses tiktoken (cl100k_basefor Claude,o200k_basefor GPT-4o) to estimate the token count of the full conversation. - Split — the system prompt and last N user/assistant pairs are pinned. Everything older is passed to the summarizer.
- Summarize — Claude Haiku receives the old turns and returns a single summary message in under 300 words.
- Reassemble —
[system prompt] + [summary] + [pinned tail]replaces the original history. - Report —
CompressionResultreturns exact token counts and estimated dollar savings at the target model's input price.
Requirements
- Python 3.10+
anthropic >= 0.100.0tiktoken >= 0.7.0ANTHROPIC_API_KEYenv variable (used only when compression actually runs; token counting is fully local)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenai-0.1.0.tar.gz.
File metadata
- Download URL: tokenai-0.1.0.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53a0194241c38d6657c7f65d5b4fa881f30bed8d514aee3b497d8bd2b333e58c
|
|
| MD5 |
befaf2d39861d65e217d2c73346687f5
|
|
| BLAKE2b-256 |
6921359e93e1c1855e912cdc4bb091eda4353170d84b8ea8c02e95fcc3508889
|
File details
Details for the file tokenai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tokenai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bc500102cca5460fe85e8f131ad715c963ff3b74eb4a84f0fdb089c9a72b069
|
|
| MD5 |
f4cb04982455db25278d2082f83bfe06
|
|
| BLAKE2b-256 |
6e5150ebdf9254ad24a97a037fe4a7e8d576318ba7ca7cb757e03d148226122d
|