Skip to main content

Negotiated session codebook compression for LLMs — cut 20-60% of tokens losslessly

Project description

ContextPack

Negotiated session codebook compression for LLMs — cut tokens, keep answers.

ContextPack is an OpenAI-compatible proxy + library that compresses the context you send to any LLM (OpenAI, Anthropic, or any OpenAI-compatible API). Unlike one-sided compressors that simply throw bytes away, its signature feature is a negotiated session codebook: it negotiates a shared abbreviation dictionary with the model, so compression is lossless — the model confirms each symbol before it's used. Pure Python, no Rust or ML binaries, works out of the box.


Why ContextPack

  • Negotiated codebook (lossless). ContextPack proposes [CP_1] = <big chunk of context>, the model acknowledges it, and every later turn sends the symbol instead of the chunk. Because the model confirmed the mapping, nothing is lost. Nothing else does this.
  • Content-aware compression. Separate, format-specific compressors for JSON, code, logs, stacktraces, and query-aware prose — each strips redundancy the way that format allows.
  • Lazy references. Huge blobs (over a configurable token threshold) are replaced with a reference; the model retrieves the full content on demand instead of re-sending it every turn.
  • Token budget optimizer + semantic dedup. Fit a conversation into a target budget and drop near-duplicate content automatically.
  • 4 ways to use it: HTTP proxy, Python library, CLI, or MCP server.
  • Live analytics dashboard. Watch token savings accumulate in real time at /dashboard.
  • Bring-your-own-key (BYOK). Each request can carry its own upstream key, so every user pays their own bill.
  • Pure Python. No native binaries, no downloaded ML models, no GPU. pip install -e . and go.

Benchmarks

The core claim — compression doesn't change the model's answers — is tested two ways against real datasets (GSM8K, SQuAD v2, TruthfulQA), with deterministic sampling (seed=42) and Wilson/normal confidence intervals. The two runs answer different questions and are reported separately (never blended):

  • Scale — does it hold across thousands of diverse inputs? (gpt-4o-mini, full datasets)
  • Strength — does it hold on a stronger model? (gpt-4o, N=100)

Scale — gpt-4o-mini, 6,557 cases (full datasets)

Benchmark N Baseline Compressed Δ Compression Tokens saved
Codebook 21 100% 100% ±0.0% 57% 6,852
Workload (code/JSON/log) 26 100% 100% ±0.0% 24% 499
SQuAD v2 (prose) 5,236 46.2% 46.7% +0.6% 24% 228,976
GSM8K 1,029 79.6% 79.6% ±0.000 0%¹ 0
TruthfulQA 245 48.6% 48.6% ±0.000 0%¹ 0

Strength — gpt-4o, N=100

Benchmark N Baseline Compressed Δ Compression
Codebook 21 100% 100% ±0.0% 57%
Workload 26 100% 100% ±0.0% 24%
SQuAD v2 100 70.7% 70.5% -0.2% 20%
GSM8K 100 88.0% 88.0% ±0.000 0%¹
TruthfulQA 100 56.0% 56.0% ±0.000 0%¹

Codebook, per scenario (the unique angle — lossless by construction, the model confirms every symbol):

Scenario Turns Accuracy Tokens saved
auth_service 6 100% 41–44%
data_schema 7 100% 58–59%
api_spec 8 100% 60%

Verdict: compression preserves accuracy — every delta is within ±0.6%, and the codebook path is exactly lossless on both models. On gpt-4o, GSM8K (88%) lands in the same league as published baselines (~87%), confirming the setup is sound.

¹ GSM8K/TruthfulQA are short prose with nothing to compress, so compression is a deliberate no-op — those rows prove non-interference, not savings.

Honest notes: the scale run hit the account's rate limit at full concurrency, so 1,527 cases exhausted retries (SQuAD landed at 5,236 of 5,928, TruthfulQA at 245 of 790) — the completed cases are valid and the SQuAD CI is still tight [46–48%]. The two runs use different N by design (scale vs. strength); they are reported as separate tables with their own confidence intervals and never averaged together.

Reproduce:

python -m evals.suite --tier 3 --n 200                 # scale-ish, mini, cheap
python -m evals.suite --tier 3 --n 100 --model gpt-4o  # strength

Quick start

git clone https://github.com/surya16122114/contextpack
cd contextpack
pip install -e .
cp .env.example .env        # add your upstream key (or use bring-your-own-key per request)
contextpack serve           # starts the proxy on :8000

Then point the OpenAI SDK at the proxy — no other code changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-...",   # your real upstream key; passed through as BYOK
)

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize the attached spec..."}],
    extra_headers={"x-session-id": "my-session"},   # reuse a session to build a codebook
)
print(resp.choices[0].message.content)

Every response includes X-ContextPack-* headers (Original-Tokens, Compressed-Tokens, Savings, Codebook-Size) so you can see exactly what was saved.


The 4 usage modes

1. HTTP Proxy

Drop-in OpenAI-compatible endpoint. Change base_url and you're done — works with Cursor, the OpenAI SDK, LangChain, or any OpenAI client.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-your-upstream-key",   # BYOK: this key is used for the upstream call
)

The Authorization: Bearer <key> header is treated as bring-your-own-key — ContextPack forwards it to the upstream provider instead of using the server's own key. Pass x-session-id to keep building the same codebook across calls.

2. Python library

Use the compression pipeline in-process, no server required:

from contextpack import ContextPackClient

client = ContextPackClient(
    upstream_provider="openai",       # or "anthropic"
    upstream_api_key="sk-...",
    session_id="my-session",          # optional; auto-generated if omitted
)

response = client.chat(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}],
)

print("Response:", response.content)
print("Tokens saved:", response.tokens_saved)
print("Codebook size:", response.codebook_size)

3. CLI

contextpack serve                    # start the proxy (--port, --host, --reload)
contextpack stats                    # global compression stats
contextpack stats <session-id>       # per-session stats
contextpack codebook <session-id>    # show the negotiated codebook for a session
contextpack mcp-install              # auto-configure the MCP server in your clients

4. MCP server

Expose ContextPack's compression as tools (compress_text, analyze_tokens, get_stats) to any MCP client.

Let ContextPack configure it for you:

contextpack mcp-install                       # configures Claude Desktop, Cursor, and Claude Code
contextpack mcp-install --client cursor       # just one client
contextpack mcp-install --dry-run             # preview without writing anything

Or add it by hand to your client's MCP config (e.g. Claude Desktop / Cursor):

{
  "mcpServers": {
    "contextpack": {
      "command": "python",
      "args": ["-m", "contextpack.mcp_server"]
    }
  }
}

mcp-install writes exactly this block (using the active Python interpreter) to:

  • Claude Desktop~/Library/Application Support/Claude/claude_desktop_config.json (macOS), %APPDATA%/Claude/claude_desktop_config.json (Windows), ~/.config/Claude/claude_desktop_config.json (Linux)
  • Cursor~/.cursor/mcp.json
  • Claude Code — via claude mcp add contextpack -- <python> -m contextpack.mcp_server (if the claude CLI is on your PATH)

How it works

┌────────┐      ┌──────────────────────────────────────────────────────────┐      ┌──────────────┐
│        │      │                       ContextPack                         │      │              │
│ Client │─────▶│  ContentRouter ─▶ Compressors ─▶ Codebook Negotiator      │─────▶│ Upstream LLM │
│ (SDK / │      │  (JSON/code/log/  (format-aware)  (negotiates [CP_n] with  │      │ (OpenAI /    │
│  Cursor│      │   stacktrace/                       the model)             │      │  Anthropic)  │
│  /any) │◀─────│   prose)        ─▶ Lazy Refs ─▶ Budget Optimizer          │◀─────│              │
│        │      │                                                           │      │              │
└────────┘      │   ◀── response decompress (symbols → original content)    │      └──────────────┘
                └──────────────────────────────────────────────────────────┘

The negotiated codebook. When a chunk of context recurs (or is large enough to be worth it), ContextPack injects a one-time system message establishing a mapping — [CP_1] = <the full content> — and asks the model to confirm it. Once the model acknowledges, every subsequent turn sends just [CP_1] instead of the full chunk. The mapping lives for the session, so the savings compound the longer the conversation runs. On the way back, any symbols in the response are expanded to their original content before the client sees them. Because the dictionary is agreed with the model, this is lossless — the model knows exactly what each symbol means.


Configuration

Set these in .env (see .env.example) or as environment variables.

Variable Default Description
UPSTREAM_PROVIDER anthropic anthropic or openai
UPSTREAM_API_KEY "" Default upstream key (overridable per-request via BYOK)
UPSTREAM_BASE_URL https://api.anthropic.com Upstream API base URL
PROXY_PORT 8000 Port the proxy listens on
PROXY_HOST 0.0.0.0 Host the proxy binds to
DB_PATH ~/.contextpack/contextpack.db SQLite store for sessions, codebooks, analytics
CODEBOOK_MIN_FREQ 3 Times a pattern must recur before it's a codebook candidate
CODEBOOK_NEGOTIATE_AFTER 2 Recurrences after which negotiation is triggered
CODEBOOK_MAX_ENTRIES 50 Max codebook entries per session
CODEBOOK_MIN_TOKEN_SAVINGS 10 Minimum net token savings for an entry to be worth it
ENABLE_CROSS_SESSION true Allow codebook reuse across sessions
REF_THRESHOLD_TOKENS 100 Token size above which content becomes a lazy reference
ENABLE_LAZY_REFS true Enable lazy reference loading
ENABLE_SUMMARIZER false Auto-summarize long content (off — costs upstream tokens)
SUMMARIZE_THRESHOLD 500 Token size above which auto-summarize kicks in
TOKEN_BUDGET 8000 Default token budget for the optimizer
LOG_LEVEL INFO Logging level

Dashboard

While the proxy is running, open http://localhost:8000/dashboard for a live view of token savings, per-session compression ratios, and active codebooks. Append ?session_id=<id> to focus on a single session.


How it compares

Doing nothing Generic compressor ContextPack
Token savings on repeated context 0% partial 41–60% (codebook)
Lossless n/a no — drops bytes one-sidedly yes — model confirms the dictionary
Content-aware (JSON/code/logs) no sometimes yes
Lazy references for huge blobs no rarely yes
Drop-in OpenAI-compatible proxy n/a varies yes
Library / CLI / MCP server n/a varies all three
Runtime footprint none often Rust/ML deps pure Python

The honest summary: generic compressors decide unilaterally what to throw away. ContextPack's negotiated codebook is the unique angle — it reaches an explicit agreement with the model about what each symbol means, which is why accuracy stays at 100% on the codebook benchmark while still saving up to 60% of tokens.


Development / running evals

pip install -e .                  # core install
pip install -e ".[evals]"         # optional: pull datasets via HuggingFace instead of canonical URLs

# Tiered benchmark suite (datasets auto-download and cache under ~/.contextpack/eval_cache/)
python -m evals.suite --tier 1            # workload + codebook (fast)
python -m evals.suite --tier 2 --n 50     # + SQuAD context compression
python -m evals.suite --tier 3 --n 100    # full suite (SQuAD + GSM8K + TruthfulQA)

Results are printed as a rich table and written to evals/RESULTS.md.


License

MIT © 2026 Surya Vulavala. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextpack_ai-0.1.0.tar.gz (77.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextpack_ai-0.1.0-py3-none-any.whl (58.9 kB view details)

Uploaded Python 3

File details

Details for the file contextpack_ai-0.1.0.tar.gz.

File metadata

  • Download URL: contextpack_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 77.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for contextpack_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4bf466f6e98d18f8d2da22b6e43bd2f36a2f45159929183f94cb64a2d58340f9
MD5 9be3a974c23f9e33fdf630cbacf210c4
BLAKE2b-256 f2315cff1bb7b44120db2cc3da2a54abf1620cf235262704f3eb33c0ccc354fc

See more details on using hashes here.

File details

Details for the file contextpack_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: contextpack_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 58.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for contextpack_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a97ba3bc0cfd9724081f7f53a638a688e515720f50a9f57ce8445a989a33c566
MD5 ece7e47109130a2c69901c41abdaa8ec
BLAKE2b-256 fc358ddd0f12dcbf1ad601124a417c6f34571b6b415412f680946ce559f252d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page