Negotiated session codebook compression for LLMs — cut 20-60% of tokens losslessly
Project description
ContextPack
Negotiated session codebook compression for LLMs — cut tokens, keep answers.
ContextPack is an OpenAI-compatible proxy + library that compresses the context you send to any LLM (OpenAI, Anthropic, or any OpenAI-compatible API). Unlike one-sided compressors that simply throw bytes away, its signature feature is a negotiated session codebook: it negotiates a shared abbreviation dictionary with the model, so compression is lossless — the model confirms each symbol before it's used. Pure Python, no Rust or ML binaries, works out of the box.
Why ContextPack
- Negotiated codebook (lossless). ContextPack proposes
[CP_1] = <big chunk of context>, the model acknowledges it, and every later turn sends the symbol instead of the chunk. Because the model confirmed the mapping, nothing is lost. Nothing else does this. - Content-aware compression. Separate, format-specific compressors for JSON, code, logs, stacktraces, and query-aware prose — each strips redundancy the way that format allows.
- Lazy references. Huge blobs (over a configurable token threshold) are replaced with a reference; the model retrieves the full content on demand instead of re-sending it every turn.
- Token budget optimizer + semantic dedup. Fit a conversation into a target budget and drop near-duplicate content automatically.
- 4 ways to use it: HTTP proxy, Python library, CLI, or MCP server.
- Live analytics dashboard. Watch token savings accumulate in real time at
/dashboard. - Bring-your-own-key (BYOK). Each request can carry its own upstream key, so every user pays their own bill.
- Pure Python. No native binaries, no downloaded ML models, no GPU.
pip install -e .and go.
Benchmarks
The core claim — compression doesn't change the model's answers — is tested two ways against real datasets (GSM8K, SQuAD v2, TruthfulQA), with deterministic sampling (seed=42) and Wilson/normal confidence intervals. The two runs answer different questions and are reported separately (never blended):
- Scale — does it hold across thousands of diverse inputs? (
gpt-4o-mini, full datasets) - Strength — does it hold on a stronger model? (
gpt-4o, N=100)
Scale — gpt-4o-mini, 6,557 cases (full datasets)
| Benchmark | N | Baseline | Compressed | Δ | Compression | Tokens saved |
|---|---|---|---|---|---|---|
| Codebook | 21 | 100% | 100% | ±0.0% | 57% | 6,852 |
| Workload (code/JSON/log) | 26 | 100% | 100% | ±0.0% | 24% | 499 |
| SQuAD v2 (prose) | 5,236 | 46.2% | 46.7% | +0.6% | 24% | 228,976 |
| GSM8K | 1,029 | 79.6% | 79.6% | ±0.000 | 0%¹ | 0 |
| TruthfulQA | 245 | 48.6% | 48.6% | ±0.000 | 0%¹ | 0 |
Strength — gpt-4o, N=100
| Benchmark | N | Baseline | Compressed | Δ | Compression |
|---|---|---|---|---|---|
| Codebook | 21 | 100% | 100% | ±0.0% | 57% |
| Workload | 26 | 100% | 100% | ±0.0% | 24% |
| SQuAD v2 | 100 | 70.7% | 70.5% | -0.2% | 20% |
| GSM8K | 100 | 88.0% | 88.0% | ±0.000 | 0%¹ |
| TruthfulQA | 100 | 56.0% | 56.0% | ±0.000 | 0%¹ |
Codebook, per scenario (the unique angle — lossless by construction, the model confirms every symbol):
| Scenario | Turns | Accuracy | Tokens saved |
|---|---|---|---|
auth_service |
6 | 100% | 41–44% |
data_schema |
7 | 100% | 58–59% |
api_spec |
8 | 100% | 60% |
Verdict: compression preserves accuracy — every delta is within ±0.6%, and the codebook path is exactly lossless on both models. On gpt-4o, GSM8K (88%) lands in the same league as published baselines (~87%), confirming the setup is sound.
¹ GSM8K/TruthfulQA are short prose with nothing to compress, so compression is a deliberate no-op — those rows prove non-interference, not savings.
Honest notes: the scale run hit the account's rate limit at full concurrency, so 1,527 cases exhausted retries (SQuAD landed at 5,236 of 5,928, TruthfulQA at 245 of 790) — the completed cases are valid and the SQuAD CI is still tight [46–48%]. The two runs use different N by design (scale vs. strength); they are reported as separate tables with their own confidence intervals and never averaged together.
Reproduce:
python -m evals.suite --tier 3 --n 200 # scale-ish, mini, cheap
python -m evals.suite --tier 3 --n 100 --model gpt-4o # strength
Quick start
git clone https://github.com/surya16122114/contextpack
cd contextpack
pip install -e .
cp .env.example .env # add your upstream key (or use bring-your-own-key per request)
contextpack serve # starts the proxy on :8000
Then point the OpenAI SDK at the proxy — no other code changes:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-...", # your real upstream key; passed through as BYOK
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize the attached spec..."}],
extra_headers={"x-session-id": "my-session"}, # reuse a session to build a codebook
)
print(resp.choices[0].message.content)
Every response includes X-ContextPack-* headers (Original-Tokens, Compressed-Tokens, Savings, Codebook-Size) so you can see exactly what was saved.
The 4 usage modes
1. HTTP Proxy
Drop-in OpenAI-compatible endpoint. Change base_url and you're done — works with Cursor, the OpenAI SDK, LangChain, or any OpenAI client.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-your-upstream-key", # BYOK: this key is used for the upstream call
)
The Authorization: Bearer <key> header is treated as bring-your-own-key — ContextPack forwards it to the upstream provider instead of using the server's own key. Pass x-session-id to keep building the same codebook across calls.
2. Python library
Use the compression pipeline in-process, no server required:
from contextpack import ContextPackClient
client = ContextPackClient(
upstream_provider="openai", # or "anthropic"
upstream_api_key="sk-...",
session_id="my-session", # optional; auto-generated if omitted
)
response = client.chat(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "..."}],
)
print("Response:", response.content)
print("Tokens saved:", response.tokens_saved)
print("Codebook size:", response.codebook_size)
3. CLI
contextpack serve # start the proxy (--port, --host, --reload)
contextpack stats # global compression stats
contextpack stats <session-id> # per-session stats
contextpack codebook <session-id> # show the negotiated codebook for a session
contextpack mcp-install # auto-configure the MCP server in your clients
4. MCP server
Expose ContextPack's compression as tools (compress_text, analyze_tokens, get_stats) to any MCP client.
Let ContextPack configure it for you:
contextpack mcp-install # configures Claude Desktop, Cursor, and Claude Code
contextpack mcp-install --client cursor # just one client
contextpack mcp-install --dry-run # preview without writing anything
Or add it by hand to your client's MCP config (e.g. Claude Desktop / Cursor):
{
"mcpServers": {
"contextpack": {
"command": "python",
"args": ["-m", "contextpack.mcp_server"]
}
}
}
mcp-install writes exactly this block (using the active Python interpreter) to:
- Claude Desktop —
~/Library/Application Support/Claude/claude_desktop_config.json(macOS),%APPDATA%/Claude/claude_desktop_config.json(Windows),~/.config/Claude/claude_desktop_config.json(Linux) - Cursor —
~/.cursor/mcp.json - Claude Code — via
claude mcp add contextpack -- <python> -m contextpack.mcp_server(if theclaudeCLI is on your PATH)
How it works
┌────────┐ ┌──────────────────────────────────────────────────────────┐ ┌──────────────┐
│ │ │ ContextPack │ │ │
│ Client │─────▶│ ContentRouter ─▶ Compressors ─▶ Codebook Negotiator │─────▶│ Upstream LLM │
│ (SDK / │ │ (JSON/code/log/ (format-aware) (negotiates [CP_n] with │ │ (OpenAI / │
│ Cursor│ │ stacktrace/ the model) │ │ Anthropic) │
│ /any) │◀─────│ prose) ─▶ Lazy Refs ─▶ Budget Optimizer │◀─────│ │
│ │ │ │ │ │
└────────┘ │ ◀── response decompress (symbols → original content) │ └──────────────┘
└──────────────────────────────────────────────────────────┘
The negotiated codebook. When a chunk of context recurs (or is large enough to be worth it), ContextPack injects a one-time system message establishing a mapping — [CP_1] = <the full content> — and asks the model to confirm it. Once the model acknowledges, every subsequent turn sends just [CP_1] instead of the full chunk. The mapping lives for the session, so the savings compound the longer the conversation runs. On the way back, any symbols in the response are expanded to their original content before the client sees them. Because the dictionary is agreed with the model, this is lossless — the model knows exactly what each symbol means.
Configuration
Set these in .env (see .env.example) or as environment variables.
| Variable | Default | Description |
|---|---|---|
UPSTREAM_PROVIDER |
anthropic |
anthropic or openai |
UPSTREAM_API_KEY |
"" |
Default upstream key (overridable per-request via BYOK) |
UPSTREAM_BASE_URL |
https://api.anthropic.com |
Upstream API base URL |
PROXY_PORT |
8000 |
Port the proxy listens on |
PROXY_HOST |
0.0.0.0 |
Host the proxy binds to |
DB_PATH |
~/.contextpack/contextpack.db |
SQLite store for sessions, codebooks, analytics |
CODEBOOK_MIN_FREQ |
3 |
Times a pattern must recur before it's a codebook candidate |
CODEBOOK_NEGOTIATE_AFTER |
2 |
Recurrences after which negotiation is triggered |
CODEBOOK_MAX_ENTRIES |
50 |
Max codebook entries per session |
CODEBOOK_MIN_TOKEN_SAVINGS |
10 |
Minimum net token savings for an entry to be worth it |
ENABLE_CROSS_SESSION |
true |
Allow codebook reuse across sessions |
REF_THRESHOLD_TOKENS |
100 |
Token size above which content becomes a lazy reference |
ENABLE_LAZY_REFS |
true |
Enable lazy reference loading |
ENABLE_SUMMARIZER |
false |
Auto-summarize long content (off — costs upstream tokens) |
SUMMARIZE_THRESHOLD |
500 |
Token size above which auto-summarize kicks in |
TOKEN_BUDGET |
8000 |
Default token budget for the optimizer |
LOG_LEVEL |
INFO |
Logging level |
Dashboard
While the proxy is running, open http://localhost:8000/dashboard for a live view of token savings, per-session compression ratios, and active codebooks. Append ?session_id=<id> to focus on a single session.
How it compares
| Doing nothing | Generic compressor | ContextPack | |
|---|---|---|---|
| Token savings on repeated context | 0% | partial | 41–60% (codebook) |
| Lossless | n/a | no — drops bytes one-sidedly | yes — model confirms the dictionary |
| Content-aware (JSON/code/logs) | no | sometimes | yes |
| Lazy references for huge blobs | no | rarely | yes |
| Drop-in OpenAI-compatible proxy | n/a | varies | yes |
| Library / CLI / MCP server | n/a | varies | all three |
| Runtime footprint | none | often Rust/ML deps | pure Python |
The honest summary: generic compressors decide unilaterally what to throw away. ContextPack's negotiated codebook is the unique angle — it reaches an explicit agreement with the model about what each symbol means, which is why accuracy stays at 100% on the codebook benchmark while still saving up to 60% of tokens.
Development / running evals
pip install -e . # core install
pip install -e ".[evals]" # optional: pull datasets via HuggingFace instead of canonical URLs
# Tiered benchmark suite (datasets auto-download and cache under ~/.contextpack/eval_cache/)
python -m evals.suite --tier 1 # workload + codebook (fast)
python -m evals.suite --tier 2 --n 50 # + SQuAD context compression
python -m evals.suite --tier 3 --n 100 # full suite (SQuAD + GSM8K + TruthfulQA)
Results are printed as a rich table and written to evals/RESULTS.md.
License
MIT © 2026 Surya Vulavala. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextpack_ai-0.1.0.tar.gz.
File metadata
- Download URL: contextpack_ai-0.1.0.tar.gz
- Upload date:
- Size: 77.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bf466f6e98d18f8d2da22b6e43bd2f36a2f45159929183f94cb64a2d58340f9
|
|
| MD5 |
9be3a974c23f9e33fdf630cbacf210c4
|
|
| BLAKE2b-256 |
f2315cff1bb7b44120db2cc3da2a54abf1620cf235262704f3eb33c0ccc354fc
|
File details
Details for the file contextpack_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: contextpack_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 58.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a97ba3bc0cfd9724081f7f53a638a688e515720f50a9f57ce8445a989a33c566
|
|
| MD5 |
ece7e47109130a2c69901c41abdaa8ec
|
|
| BLAKE2b-256 |
fc358ddd0f12dcbf1ad601124a417c6f34571b6b415412f680946ce559f252d2
|