Skip to main content

LLM context management — token counting, rolling summarization, dollar savings reports

Project description

ctxmgr

LLM context management for Python — token counting, rolling summarization, and dollar savings reports.

LLM APIs charge per token. A 20-turn support conversation can hit 800+ tokens before the user even asks their real question. ctxmgr compresses the history with Claude Haiku (cheap and fast) so you send fewer tokens to your expensive model — and you get an exact dollar amount saved per call.

pip install ctxmgr

Quick start — 2 lines

from ctxmgr import compress

result = compress(messages, max_tokens=4000, model="claude-sonnet-4-6")
print(f"Saved {result.saved_tokens} tokens — ${result.estimated_savings_usd:.4f} per call")

Before / After — real numbers

The following uses tests/test_20turn_review.py, a 20-turn conversation about token economics (871 tokens). With a 600-token budget, ctxmgr compresses it end-to-end via a live Claude Haiku call:

ORIGINAL  : 31 messages, 871 tokens
COMPRESSED:  9 messages, 544 tokens
BUDGET    : 600 tokens
REDUCTION : 327 tokens (37.5%)

The compressed result keeps:

  • the system prompt (assistant persona, always pinned)
  • a single summary message covering the 14 oldest turns
  • the last 3 user/assistant pairs verbatim (most recent context, always pinned)

Savings per call at different model tiers:

Model Tokens saved $/call saved
claude-haiku-4-5 327 $0.000327
claude-sonnet-4-6 327 $0.000981
claude-opus-4-8 327 $0.001635

At 10,000 calls/day on Sonnet 4.6 that is ~$358/month saved.


Three conversation types

Benchmarked on realistic fixtures with medium aggressiveness (3 pinned pairs):

Type Original Compressed Reduction $/call (Sonnet)
Support chat 426 tok 237 tok 44.4% $0.000567
Coding assistant 726 tok 515 tok 29.1% $0.000633
RAG Q&A 541 tok 251 tok 53.6% $0.000870

API

compress(messages, max_tokens, model, aggressiveness)

from ctxmgr import compress, CompressionResult

result: CompressionResult = compress(
    messages,                      # list of {"role": ..., "content": ...}
    max_tokens=4000,               # token budget for the result
    model="claude-sonnet-4-6",     # used only to calculate dollar savings
    aggressiveness="medium",       # "light" | "medium" | "aggressive"
)

CompressionResult fields:

Field Type Description
messages list[dict] Compressed conversation
original_tokens int Token count before compression
compressed_tokens int Token count after compression
saved_tokens int original - compressed
ratio float compressed / original (lower = more compression)
estimated_savings_usd float saved_tokens × model_input_price
aggressiveness str Level used

Aggressiveness levels — controls how many recent user/assistant pairs are pinned (never summarized):

Level Pinned pairs Use when
"light" 5 pairs Long coding sessions, high coherence needed
"medium" 3 pairs General-purpose (default)
"aggressive" 1 pair Support chats, RAG lookups, cost is critical

TokenCounter

from ctxmgr import TokenCounter

counter = TokenCounter("claude-sonnet-4-6")
print(counter.count("Hello, world!"))        # 4
print(counter.count_messages(messages))      # full conversation estimate

Supported models: all claude-* and gpt-* variants. Unknown models fall back to cl100k_base.

Accepts both plain-string content and list-of-blocks format (OpenAI tool calls, Anthropic multi-modal).

RollingSummarizer

Lower-level class if you need more control:

from ctxmgr import RollingSummarizer

summarizer = RollingSummarizer(
    model="claude-haiku-4-5-20251001",  # summarization model
    token_budget=4000,
    pin_last_pairs=3,
)
compressed = summarizer.compress(messages)

Message format support

Both Anthropic and OpenAI message formats work:

# Plain strings (both APIs)
{"role": "user", "content": "What is a token?"}

# OpenAI list-of-blocks (vision, tool calls)
{"role": "user", "content": [
    {"type": "text", "text": "Describe this image."},
    {"type": "image_url", "image_url": {"url": "https://..."}},
]}

# Anthropic list-of-blocks (tool use)
{"role": "assistant", "content": [
    {"type": "text", "text": "I'll look that up."},
    {"type": "tool_use", "id": "tu_01", "name": "search", "input": {"q": "tokens"}},
]}

Images and tool-use blocks are counted as short placeholders ([image], [tool:name]) so token estimates stay meaningful.


Edge cases

Scenario Behaviour
Empty history [] Returns [], saved_tokens=0
Single-turn (no assistant reply) Returns unchanged — nothing to summarize
Single message larger than budget Returns unchanged — cannot split a single message
Already under budget Returns unchanged, no API call made
content=None Treated as empty string

How it works

  1. CountTokenCounter uses tiktoken (cl100k_base for Claude, o200k_base for GPT-4o) to estimate the token count of the full conversation.
  2. Split — the system prompt and last N user/assistant pairs are pinned. Everything older is passed to the summarizer.
  3. Summarize — Claude Haiku receives the old turns and returns a single summary message in under 300 words.
  4. Reassemble[system prompt] + [summary] + [pinned tail] replaces the original history.
  5. ReportCompressionResult returns exact token counts and estimated dollar savings at the target model's input price.

Requirements

  • Python 3.10+
  • anthropic >= 0.100.0
  • tiktoken >= 0.7.0
  • ANTHROPIC_API_KEY env variable (used only when compression actually runs; token counting is fully local)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenai-0.1.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenai-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file tokenai-0.1.0.tar.gz.

File metadata

  • Download URL: tokenai-0.1.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for tokenai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 53a0194241c38d6657c7f65d5b4fa881f30bed8d514aee3b497d8bd2b333e58c
MD5 befaf2d39861d65e217d2c73346687f5
BLAKE2b-256 6921359e93e1c1855e912cdc4bb091eda4353170d84b8ea8c02e95fcc3508889

See more details on using hashes here.

File details

Details for the file tokenai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tokenai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for tokenai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6bc500102cca5460fe85e8f131ad715c963ff3b74eb4a84f0fdb089c9a72b069
MD5 f4cb04982455db25278d2082f83bfe06
BLAKE2b-256 6e5150ebdf9254ad24a97a037fe4a7e8d576318ba7ca7cb757e03d148226122d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page