Skip to main content

Fit anything into an LLM context window — a tiny, zero-dependency, priority-aware token-budget packer.

Project description

contextcram

PyPI version Python versions CI License: MIT

Fit anything into an LLM context window. A tiny, zero-dependency, priority-aware token-budget packer for RAG pipelines and agents.

Every RAG or agent app has the same problem: you have too much stuff — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget. contextcram packs it all in by priority, truncating, trimming, or dropping the least important pieces so the important ones always make it.

from contextcram import Packer

packer = Packer(budget=8000)  # token budget

packer.add(system_prompt, priority="required")                 # never dropped
packer.add(chat_history, priority="high", strategy="trim")     # drop oldest turns
packer.add(retrieved_docs, priority="medium", strategy="drop") # all-or-nothing
packer.add(tool_output, priority="low", strategy="truncate")   # cut to fit

result = packer.fit()
print(result.text)            # assembled, in-budget context
print(result.used_tokens)     # e.g. 7840
print(result.dropped_names)   # what didn't make the cut

Why

  • Zero dependencies. Pure Python. Works out of the box with a fast characters-per-token heuristic; plug in tiktoken or any tokenizer when you need exact counts.
  • Framework-agnostic. Use it with LangChain, LlamaIndex, the raw provider SDKs, or nothing at all.
  • Priority-aware. You decide what survives a tight budget, not a blind truncate at the end.
  • Observable. Every result tells you what was kept, truncated, and dropped.

Installation

pip install contextcram
# optional: exact token counts via tiktoken
pip install "contextcram[tiktoken]"

Strategies

When an optional item doesn't fully fit, its strategy decides what happens:

Strategy Behavior
drop Include the item whole, or not at all
truncate Cut from the end, keeping the head (default)
truncate_head Cut from the start, keeping the tail
trim For list content: drop oldest segments first

required items are always kept; if they alone exceed the budget, a BudgetExceeded error is raised.

Model-aware budgets

Skip the magic number — set the budget from the model, and reserve room for the response in one go:

from contextcram import Packer

# 128k window for gpt-4o, holding back 2k tokens for the model's reply
packer = Packer(model="gpt-4o", reserve=2000)
print(packer.full_budget)  # 128000
print(packer.budget)       # 126000  (effective budget you pack into)

reserve is the easy way to avoid the classic "prompt fit, but no room left to answer" failure. Unknown model? Pass budget= explicitly or register it:

from contextcram import register_model

register_model("my-internal-llm", 32000)
packer = Packer(model="my-internal-llm", reserve=1000)

Exact token counts

from contextcram import Packer, tiktoken_tokenizer

packer = Packer(budget=8000, tokenizer=tiktoken_tokenizer("gpt-4o"))

Or wrap any tokenizer with CallableTokenizer(lambda s: len(my_encode(s))).

Priorities

Use the named levels "required", "high", "medium", "low", or pass any integer (higher is kept first):

packer.add(text, priority=42, strategy="truncate")

Real-world usage

With LangChain

Pack a system prompt, retrieved docs, and chat history into a gpt-4o budget — leaving room for the answer — then hand the result to the model:

from contextcram import Packer
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4o")

docs = [d.page_content for d in retriever.invoke(question)]
history = [f"{m.type}: {m.content}" for m in memory.messages]

ctx = (
    Packer(model="gpt-4o", reserve=1500)                          # room for the reply
    .add(SYSTEM_PROMPT, priority="required")                      # never dropped
    .add(history, priority="high", strategy="trim")               # drop oldest turns
    .add("\n\n".join(docs), priority="medium", strategy="drop")   # whole docs only
    .fit()
)

response = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)])

With the raw Anthropic SDK

Tie reserve to max_tokens so the input can never crowd out the response:

import anthropic
from contextcram import Packer

client = anthropic.Anthropic()
REPLY_TOKENS = 4000

ctx = (
    Packer(model="claude-opus-4-8", reserve=REPLY_TOKENS)
    .add(SYSTEM_PROMPT, priority="required")
    .add(chat_history, priority="high", strategy="trim")
    .add(retrieved_docs, priority="medium", strategy="drop")
    .fit()
)

msg = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=REPLY_TOKENS,        # matches reserve above
    system=ctx.text,
    messages=[{"role": "user", "content": question}],
)
print(f"packed {ctx.used_tokens} tokens; dropped {ctx.dropped_names}")

Alternatives

Priority-based context assembly isn't a new idea, and depending on your needs one of these may fit better — contextcram deliberately trades features for simplicity and zero dependencies:

Library Approach When to prefer it over contextcram
Priompt / PriomptiPy Component/JSX-style priority rendering You want fine-grained, composable prompt components and don't mind a learning curve
Prompt Poet YAML + Jinja2 templating with cache-aware, priority truncation You need templating and production GPU prefix-cache optimization
LLMLingua Model-based prompt compression You want to shrink text rather than drop/truncate whole pieces

Choose contextcram when you want a tiny, zero-dependency, framework-agnostic helper with a 3-line API (Packer(...).add(...).fit()) that does one thing — fit prioritized pieces into a budget — and gets out of your way.

Development

git clone https://github.com/Waelr1985/contextcram.git
cd contextcram
uv sync
uv run pytest
uv run ruff check .
uv run mypy

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextcram-0.2.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextcram-0.2.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file contextcram-0.2.0.tar.gz.

File metadata

  • Download URL: contextcram-0.2.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for contextcram-0.2.0.tar.gz
Algorithm Hash digest
SHA256 055ee9fe48850478722af706f043044e971331db5fd4a17e123e85a5cd7a7b00
MD5 08f7fc8d0e6f420031962058862f8665
BLAKE2b-256 238780e9effad60d60a892dbaf1665f1610b56860f6d6d6576b9f0400e49b994

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextcram-0.2.0.tar.gz:

Publisher: publish.yml on Waelr1985/contextcram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file contextcram-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: contextcram-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for contextcram-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a2cc24632d1bd8031be17f9326e1e7dfed3bc342f4fa87b6154c13455711d460
MD5 9f83b3640b5b63d6cc603129ab9e5b7d
BLAKE2b-256 a9cef6853cdf07f94a00b2cf46fc85e0fc18baf4bcebb4c89e6b8e764cd37f5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextcram-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Waelr1985/contextcram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page