Deterministic context compression for LLM chat, RAG, and agent pipelines
Project description
contextpress
Deterministic context compression for LLM chat, RAG, and agent pipelines. Created and maintained by Taha Azizi.
Project Status
Status: Stable for its original use case — maintained at a low cadence.
- Built for a specific use case and provided as-is. I will review bug fixes when time permits, but I am not actively developing new features.
- PRs are welcome, but please expect a review cycle of 2–4 weeks. If you need a feature immediately, fork the repository and iterate on your own timeline.
- License: Apache 2.0 — no warranty, no liability. See §7 (Disclaimer of Warranty) and §8 (Limitation of Liability) of the license for the legal text.
For bug reports, please open a GitHub issue with a minimal reproduction. Feature requests may be closed with a pointer to fork.
Install
pip install contextpress
If you cloned this repository:
pip install -e .
30-second quickstart
from contextpress import ContextManager
# Default compression is "medium" (filler + repetition + recency); see below.
cm = ContextManager(type="chat")
messages = [{"role": "user", "content": "Hello!"}]
compressed = cm.compress(messages, token_budget=2000)
No API keys are required for Tier 1. Passing token_budget turns on the budget stage; other stages follow the chosen compression preset (low / medium / high).
Minimal examples
from contextpress import ContextManager
# Shortest useful call (default compression=medium, budget on because token_budget set)
out = ContextManager().compress(
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}],
token_budget=500,
)
# Lighter pass: only filler + repetition (+ budget if token_budget set)
out = ContextManager(compression="low").compress(messages, token_budget=500)
# Full NLP pipeline for this call (+ budget if token_budget set)
out = ContextManager().compress(messages, token_budget=500, compression="high")
# Exact stages only (preset ignored); include "budget" if you pass token_budget and want enforcement
out = ContextManager().compress(
messages,
token_budget=500,
stages=["filler", "repetition", "budget"],
)
Runnable demo in this repo
After pip install -e ., run:
python try_compress.py
That script builds a long history and a tight token_budget so you can see turn and token counts drop (see comments at the top of try_compress.py).
Context types
- chat — Typical back-and-forth dialogue. Filler removal, repetition deduplication, resolution collapsing, recency weighting, and token budgets are tuned for conversational flow.
- rag_doc — Document chunks or RAG context. Resolution is off; repetition compares all chunks; recency uses relevance to the latest user query instead of chat recency.
- agent — Tool-using or task-oriented threads. Resolution can trigger on a single high-confidence completion signal; filler rules preserve tool-related turns when markers are present.
ContextManager(type="chat")
ContextManager(type="rag_doc")
ContextManager(type="agent")
Pipeline stages
- Filler — Removes low-semantic filler words and (in chat/agent) drops acknowledgement-only assistant turns.
- Repetition — TF-IDF cosine similarity; keeps the more recent of similar turns.
- Resolution — Collapses agreed threads into a single
RESOLVED:synthetic system turn (chat/agent only). - Recency — Extractively compresses older turns (or low-relevance chunks in
rag_doc) while preserving the latest context. - Budget — Enforces a hard token limit with
tiktoken, removing oldest turns first while protecting system prompts and recent turns.
Tier 1 vs Tier 2 (classical NLP vs LLM)
| Tier 1 (always available) | Tier 2 (optional) | |
|---|---|---|
| What | Pipeline stages: filler, repetition, resolution, recency, budget | LLMBackend: semantic deduplicate + summarize after Tier 1 |
| Where in code | contextpress/strategies/, orchestrated by pipeline.py |
contextpress/llm/ (base.py, adapters.py) |
| Techniques | Rules, TF–IDF, cosine similarity, NLTK, Sumy extractive summarization, tiktoken | Your provider’s chat/completions API (you supply the client) |
| API key | None | Required for your chosen provider (OpenAI, Anthropic, …) |
| Determinism | Deterministic for a fixed input and settings | Non-deterministic (model sampling) |
| How to enable | Default: ContextManager() runs Tier 1 only |
Pass llm_backend= (OpenAIBackend, AnthropicBackend, OllamaBackend, or custom LLMBackend) |
Note: ContextManager(model="gpt-4") is only for tiktoken encoding when counting tokens in the budget stage. It does not call that model unless you also pass llm_backend.
Compression presets and custom stages
Presets (low / medium / high, default medium) control how many NLP stages run. Aliases: light→low, med/mid→medium, max→high.
| Preset | Non-budget stages enabled |
|---|---|
| low | filler, repetition |
| medium | filler, repetition, recency |
| high | filler, repetition, resolution, recency |
The budget stage is separate: if you pass token_budget=<int>, the budget stage runs as well (unless you opt out with disable=["budget"] or omit "budget" from an explicit stages= list). If token_budget is None, the budget stage does not run.
Presets are merged with the context profile (for example, resolution stays off for rag_doc even on high, unless you pass an explicit stages= list that includes resolution).
from contextpress import ContextManager
# Default strength is medium
cm = ContextManager(type="chat", compression="medium")
# Per-call preset
out = cm.compress(messages, token_budget=4000, compression="high")
# Full control: exact stages for this call (preset ignored)
out = cm.compress(
messages,
token_budget=4000,
stages=["filler", "repetition", "budget"],
)
# Preset + skip one stage
out = cm.compress(messages, compression="high", disable=["resolution"])
# Change default for future calls
cm.set_compression("low")
Optional LLM tier (Tier 2)
After Tier 1 finishes, you can attach an LLMBackend for semantic compression.
What it does
- Calls
deduplicate(turn_texts)on non-system turns (your backend returns indices to keep; default adapters keep all). - If the combined transcript is long enough (default 1500 characters; set
llm_min_input_chars=0to always run), callssummarize(transcript, max_tokens). - System turns are unchanged in order and content. All other turns are replaced by a single assistant message whose content is the LLM summary (metadata includes
source: contextpress_llm_tier). If the LLM call fails, the Tier 1 conversation is returned and a warning is emitted.
Optional constructor knobs: llm_min_input_chars, llm_max_summary_tokens.
Install SDKs (not bundled): pip install openai, anthropic, and/or ollama (for local Ollama), or pip install "contextpress[llm]" from this repo’s pyproject.toml to pull all optional LLM clients.
from openai import OpenAI
from contextpress import ContextManager
from contextpress.llm.adapters import OpenAIBackend
backend = OpenAIBackend(client=OpenAI(), model="gpt-4o-mini")
cm = ContextManager(
type="chat",
llm_backend=backend,
llm_min_input_chars=1000,
llm_max_summary_tokens=1024,
)
out = cm.compress(messages, token_budget=4000)
Runnable example (requires OPENAI_API_KEY): examples/llm_tier_openai.py.
pip install openai
set OPENAI_API_KEY=sk-... # or export on Unix
python examples/llm_tier_openai.py
Local Ollama (no cloud API key) — install Ollama, run ollama serve, pull a model (ollama pull llama3.2), then:
from contextpress import ContextManager
from contextpress.llm.adapters import OllamaBackend
backend = OllamaBackend(model="llama3.2") # optional: host="http://localhost:11434"
cm = ContextManager(type="chat", llm_backend=backend, llm_min_input_chars=500)
out = cm.compress(messages, token_budget=4000)
Runnable script: examples/llm_tier_ollama.py.
pip install ollama
ollama pull llama3.2
python examples/llm_tier_ollama.py
Custom strategies
Subclass contextpress.strategies.base.BaseStrategy, implement process(self, conversation) -> Conversation, then fork Pipeline._build_strategy in a local subclass or contribute a factory that returns your strategy for a custom stage name. Stages must not mutate input turns; return new Conversation and Turn objects.
Why contextpress
Long chat histories inflate token usage, bury important facts (lost-in-the-middle), and repeat stale or redundant content. contextpress trims noise, merges resolved threads, and enforces budgets with deterministic Tier 1 NLP so applications stay within context limits without extra services.
Dependencies
- nltk — Tokenization, tagging, and light parsing for resolution and NLP helpers.
- scikit-learn — TF-IDF vectors and cosine similarity for repetition and RAG relevance.
- sumy — Extractive summarization for the recency stage.
- tiktoken — Token-accurate budgeting aligned with common model encodings.
Research and citing
For academic use, cite this package in your paper’s software or methods section. A machine-readable citation file is provided as CITATION.cff (GitHub and Zenodo can ingest it). Replace the placeholder repository URL in that file with your fork’s URL when you publish.
Extension and growth
- Custom stages — Subclass
contextpress.strategies.base.BaseStrategyand plug in via a customPipelinesubclass or future registry hooks. - Tier 2 — Implement
LLMBackend(summarize,deduplicate) for provider-specific semantic compression; failures fall back to Tier 1. - Presets API —
from contextpress.compression import VALID_STAGES, STAGE_ORDERfor tooling and experiments. - Profiles —
configure(stage, ...)adjusts aggressiveness per stage;type="rag_doc"vschatchanges dedup and recency behavior.
Invalid inputs are rejected early where practical: for example, token_budget must be a positive int or None (booleans are not accepted).
Typing
The package includes py.typed (PEP 561) for static analysis in downstream projects.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextpress-0.1.0.tar.gz.
File metadata
- Download URL: contextpress-0.1.0.tar.gz
- Upload date:
- Size: 33.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
535839677a2b6821bd91f67304fcb62c32550158828540a4b9b2b79b5280ea4c
|
|
| MD5 |
e411821edfe42285cfcb7f4c76fc84aa
|
|
| BLAKE2b-256 |
c701160b11cbfb07014e2bff66cc76097737c9c1450e687fae5f0f83b90a7701
|
File details
Details for the file contextpress-0.1.0-py3-none-any.whl.
File metadata
- Download URL: contextpress-0.1.0-py3-none-any.whl
- Upload date:
- Size: 37.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd4bfe468d5a8b2758e9748e68ef53adede0d496a3db440eee7bc14a06ce1649
|
|
| MD5 |
b2a6fa982294d8235738cb1c3e29975e
|
|
| BLAKE2b-256 |
d878718b2d351e446fc37aa40212fb3bbf58aa01c0fbcfcf13ae36253a9b1d2c
|