LLM context management — token counting, rolling summarization, dollar savings reports

These details have not been verified by PyPI

Project links

Project description

ctxmgr

LLM context management for Python — token counting, rolling summarization, and dollar savings reports.

LLM APIs charge per token. A 20-turn support conversation can hit 800+ tokens before the user even asks their real question. ctxmgr compresses the history with Claude Haiku (cheap and fast) so you send fewer tokens to your expensive model — and you get an exact dollar amount saved per call.

pip install ctxmgr

Quick start — 2 lines

from ctxmgr import compress

result = compress(messages, max_tokens=4000, model="claude-sonnet-4-6")
print(f"Saved {result.saved_tokens} tokens — ${result.estimated_savings_usd:.4f} per call")

Before / After — real numbers

The following uses tests/test_20turn_review.py, a 20-turn conversation about token economics (871 tokens). With a 600-token budget, ctxmgr compresses it end-to-end via a live Claude Haiku call:

ORIGINAL  : 31 messages, 871 tokens
COMPRESSED:  9 messages, 544 tokens
BUDGET    : 600 tokens
REDUCTION : 327 tokens (37.5%)

The compressed result keeps:

the system prompt (assistant persona, always pinned)
a single summary message covering the 14 oldest turns
the last 3 user/assistant pairs verbatim (most recent context, always pinned)

Savings per call at different model tiers:

Model	Tokens saved	$/call saved
claude-haiku-4-5	327	$0.000327
claude-sonnet-4-6	327	$0.000981
claude-opus-4-8	327	$0.001635

At 10,000 calls/day on Sonnet 4.6 that is ~$358/month saved.

Three conversation types

Benchmarked on realistic fixtures with medium aggressiveness (3 pinned pairs):

Type	Original	Compressed	Reduction	$/call (Sonnet)
Support chat	426 tok	237 tok	44.4%	$0.000567
Coding assistant	726 tok	515 tok	29.1%	$0.000633
RAG Q&A	541 tok	251 tok	53.6%	$0.000870

API

`compress(messages, max_tokens, model, aggressiveness)`

from ctxmgr import compress, CompressionResult

result: CompressionResult = compress(
    messages,                      # list of {"role": ..., "content": ...}
    max_tokens=4000,               # token budget for the result
    model="claude-sonnet-4-6",     # used only to calculate dollar savings
    aggressiveness="medium",       # "light" | "medium" | "aggressive"
)

CompressionResult fields:

Field	Type	Description
`messages`	`list[dict]`	Compressed conversation
`original_tokens`	`int`	Token count before compression
`compressed_tokens`	`int`	Token count after compression
`saved_tokens`	`int`	`original - compressed`
`ratio`	`float`	`compressed / original` (lower = more compression)
`estimated_savings_usd`	`float`	`saved_tokens × model_input_price`
`aggressiveness`	`str`	Level used

Aggressiveness levels — controls how many recent user/assistant pairs are pinned (never summarized):

Level	Pinned pairs	Use when
`"light"`	5 pairs	Long coding sessions, high coherence needed
`"medium"`	3 pairs	General-purpose (default)
`"aggressive"`	1 pair	Support chats, RAG lookups, cost is critical

`TokenCounter`

from ctxmgr import TokenCounter

counter = TokenCounter("claude-sonnet-4-6")
print(counter.count("Hello, world!"))        # 4
print(counter.count_messages(messages))      # full conversation estimate

Supported models: all claude-* and gpt-* variants. Unknown models fall back to cl100k_base.

Accepts both plain-string content and list-of-blocks format (OpenAI tool calls, Anthropic multi-modal).

`RollingSummarizer`

Lower-level class if you need more control:

from ctxmgr import RollingSummarizer

summarizer = RollingSummarizer(
    model="claude-haiku-4-5-20251001",  # summarization model
    token_budget=4000,
    pin_last_pairs=3,
)
compressed = summarizer.compress(messages)

Message format support

Both Anthropic and OpenAI message formats work:

# Plain strings (both APIs)
{"role": "user", "content": "What is a token?"}

# OpenAI list-of-blocks (vision, tool calls)
{"role": "user", "content": [
    {"type": "text", "text": "Describe this image."},
    {"type": "image_url", "image_url": {"url": "https://..."}},
]}

# Anthropic list-of-blocks (tool use)
{"role": "assistant", "content": [
    {"type": "text", "text": "I'll look that up."},
    {"type": "tool_use", "id": "tu_01", "name": "search", "input": {"q": "tokens"}},
]}

Images and tool-use blocks are counted as short placeholders ([image], [tool:name]) so token estimates stay meaningful.

Edge cases

Scenario	Behaviour
Empty history `[]`	Returns `[]`, `saved_tokens=0`
Single-turn (no assistant reply)	Returns unchanged — nothing to summarize
Single message larger than budget	Returns unchanged — cannot split a single message
Already under budget	Returns unchanged, no API call made
`content=None`	Treated as empty string

How it works

Count — TokenCounter uses tiktoken (cl100k_base for Claude, o200k_base for GPT-4o) to estimate the token count of the full conversation.
Split — the system prompt and last N user/assistant pairs are pinned. Everything older is passed to the summarizer.
Summarize — Claude Haiku receives the old turns and returns a single summary message in under 300 words.
Reassemble — [system prompt] + [summary] + [pinned tail] replaces the original history.
Report — CompressionResult returns exact token counts and estimated dollar savings at the target model's input price.

Requirements

Python 3.10+
anthropic >= 0.100.0
tiktoken >= 0.7.0
ANTHROPIC_API_KEY env variable (used only when compression actually runs; token counting is fully local)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenai-0.1.0.tar.gz (21.4 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenai-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file tokenai-0.1.0.tar.gz.

File metadata

Download URL: tokenai-0.1.0.tar.gz
Upload date: Jun 16, 2026
Size: 21.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for tokenai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`53a0194241c38d6657c7f65d5b4fa881f30bed8d514aee3b497d8bd2b333e58c`
MD5	`befaf2d39861d65e217d2c73346687f5`
BLAKE2b-256	`6921359e93e1c1855e912cdc4bb091eda4353170d84b8ea8c02e95fcc3508889`

See more details on using hashes here.

File details

Details for the file tokenai-0.1.0-py3-none-any.whl.

File metadata

Download URL: tokenai-0.1.0-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for tokenai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6bc500102cca5460fe85e8f131ad715c963ff3b74eb4a84f0fdb089c9a72b069`
MD5	`f4cb04982455db25278d2082f83bfe06`
BLAKE2b-256	`6e5150ebdf9254ad24a97a037fe4a7e8d576318ba7ca7cb757e03d148226122d`

See more details on using hashes here.

tokenai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ctxmgr

Quick start — 2 lines

Before / After — real numbers

Three conversation types

API

`compress(messages, max_tokens, model, aggressiveness)`

`TokenCounter`

`RollingSummarizer`

Message format support

Edge cases

How it works

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes