Skip to main content

LLM-agnostic context optimization proxy — reduce session token usage without losing accuracy

Project description

ctx-gate

LLM-agnostic context optimization proxy. Sits between your IDE/tool and any LLM, automatically reducing session token consumption without losing the facts your next prompt depends on — and without changes to your workflow.

How much it saves depends entirely on the session: short Q&A compresses very little, while long sessions with verbose tool output compress heavily. ctx-gate doesn't ask you to trust that number — it ships a faithfulness harness that measures savings and information retention on every change.

Speaks two APIs:

  • POST /v1/messages — native Anthropic Messages API, with streaming. This is what Claude Code uses.
  • POST /v1/chat/completions — OpenAI-compatible, for Cursor, Continue.dev, Copilot Chat, and any OpenAI SDK.

Why

Claude Code (and most LLM coding tools) burn tokens faster than expected because:

  1. Context compounds — every message re-sends the entire conversation history
  2. Tool outputs are verbose — stack traces, grep results, and build logs dump thousands of tokens per call
  3. Files are re-injected in full — even when only one line changed
  4. Tasks bleed into each other — no automatic /clear between unrelated tasks
  5. Model overkill — Opus-level model used for trivial rename tasks

ctx-gate addresses all five at the proxy layer, transparently.


Architecture

Your IDE / Claude Code / CLI
        │
        ▼
┌─────────────────────────┐
│       ctx-gate          │  ← localhost:8080  (/v1/messages + /v1/chat/completions)
│                         │
│  ① Task Shift Detector  │  auto-clears context on new tasks
│  ② Context Compressor   │  summarizes old turns, keeps prompt-relevant ones,
│                         │  diffs files, strips noise, honors a token budget
│  ③ Model Router         │  fast/standard/advanced based on prompt complexity
│  ④ Checkpoint Writer    │  saves session state for restart recovery
└─────────────────────────┘
        │
        ▼
  Real LLM (Claude / OpenAI / Gemini / Ollama)

Quick Start

Install

git clone https://github.com/your-org/ctx-gate
cd ctx-gate
pip install fastapi uvicorn httpx

# Optional extras:
pip install "tiktoken"                  # exact token counts (else char/4 estimate)
pip install "lancedb sentence-transformers"   # RAG retrieval (else TF-IDF fallback)

Start the proxy

# Claude (default)
ANTHROPIC_API_KEY=sk-ant-... python ctx_gate.py serve --verbose

# OpenAI
OPENAI_API_KEY=sk-... python ctx_gate.py serve --provider=openai

# Local Ollama (no key needed)
python ctx_gate.py serve --provider=ollama

# Custom port
python ctx_gate.py serve --port=9000

Point your tool at ctx-gate

Claude Code (~/.claude/settings.json) — routed through the native /v1/messages endpoint, streaming included:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8080"
  }
}

Claude Code sends its own API key/token through; ctx-gate forwards it upstream and falls back to the server's ANTHROPIC_API_KEY if none is present.

Cursor / Continue.dev / VS Code: Change the API base URL to http://127.0.0.1:8080/v1

Any OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="any")

MCP Integration (Claude Code native)

For tighter integration, register ctx-gate as a Claude Code MCP server:

// ~/.claude/claude_desktop_config.json
{
  "mcpServers": {
    "ctx-gate": {
      "command": "python",
      "args": ["/path/to/ctx-gate/mcp_server.py"]
    }
  }
}

This exposes ctx-gate's tools directly to Claude:

  • compress_context — compress a message list before sending
  • detect_task_shift — check if a new prompt is a new task
  • write_checkpoint — save session state
  • load_checkpoint — restore from last checkpoint
  • route_model — get recommended model tier for a prompt
  • get_stats — token savings for current session

What Each Module Does

① Task Shift Detector

Detects when you've moved to a new task using signal scoring:

  • Explicit language: "now let's work on X", "switch to", "start fresh"
  • File domain change (auth.ts → payments.go)
  • Topic keyword cluster divergence

On shift: clears conversation history, extracts key facts (language, frameworks, files) to carry forward into the new session's system prompt.

② Context Compressor

Applied on every request. Savings are per-strategy ceilings on the content they apply to, not whole-session numbers — actual savings depend on how much of your session is old turns / verbose output (see Faithfulness & Evaluation for measured results):

Strategy Savings (on applicable content)
Rolling summary of turns older than recency_window 40–70%
Relevance-scored retention (keep prompt-relevant old turns verbatim) preserves facts the summary would drop
File diff injection (only send changed lines) 80–95% on repeated file loads
Tool output truncation (keep first+last N lines) 60–90% on verbose outputs
Large code block compression 50–80% on pasted files
Adaptive token budget (--token-budget) drops least-relevant turns to fit a hard cap

Relevance-scored retention is what keeps compression honest: before summarizing old turns, the compressor keeps the few most relevant to your current prompt verbatim, so a fact you're now asking about isn't summarized away. This is why the bundled eval reports 100% fact-retention.

By default relevance is lexical (fast, zero-dependency keyword overlap). Pass --relevance embedding (needs the rag extra) to score by semantic similarity instead, which also catches paraphrases — e.g. a fact stored as "we standardized on PostgreSQL" is still kept when you later ask "which datastore did we pick?", despite sharing no words.

③ Model Router

Auto-selects model tier based on prompt complexity:

Signal Tier Example
"typo", "rename", "comment", "what is" fast (Haiku 4.5 / GPT-4o-mini) Fix the spelling in line 42
Default standard (Sonnet 4.6 / GPT-4o) Add a new API endpoint
"architecture", "cross-cutting", "refactor entire", "root cause" advanced (Opus 4.8 / o1) Redesign the auth system
Long context (>60k tokens) at least standard Any long session

The routed tier is authoritative — the proxy rewrites the upstream model accordingly. Per-tier model IDs are overridable in ModelRouter. Force one tier for everything with --model=advanced.

④ Checkpoint Writer

A proxy can't observe individual tool-call events, so checkpoints are derived from the conversation snapshot on each request and written every N requests (configurable) to .ctx-gate/:

{
  "task_description": "Build REST API",
  "decisions": ["Used FastAPI over Flask for async support"],
  "files_touched": ["src/main.py", "src/auth.py"],
  "next_steps": ["Add authentication middleware"],
  "turn_count": 23,
  "tool_call_count": 45
}

On session restart, the checkpoint is injected into the new system prompt automatically.


CLI Reference

ctx-gate serve                          # proxy on :8080, Claude provider
ctx-gate serve --provider=openai        # use OpenAI (also: gemini, ollama)
ctx-gate serve --model=advanced         # force Opus/o1 for everything
ctx-gate serve --recency-window=10      # keep 10 recent turns verbatim
ctx-gate serve --token-budget=20000     # hard cap; drop least-relevant turns to fit
ctx-gate serve --llm-summary            # summarize old turns with the fast tier
ctx-gate serve --rag --project-root=.   # inject only semantically-relevant code chunks
ctx-gate serve --relevance=embedding    # semantic relevance retention (needs rag extra)
ctx-gate serve --retries=3              # retry transient upstream failures (default 2)
ctx-gate serve --max-sessions=256       # cap isolated client sessions (LRU, default 128)
ctx-gate serve --verbose                # log compression stats per request

ctx-gate status                         # show stats (proxy must be running)
ctx-gate compress "some long text"      # test compressor
ctx-gate detect-shift "new prompt"      # test task shift detection

ctx-gate eval                           # run the faithfulness harness (savings + retention)
ctx-gate eval --json                    # machine-readable report
ctx-gate eval --min-retention 1.0       # exit nonzero if any fact is dropped (CI gate)
ctx-gate eval --llm                     # also score answer accuracy (needs ANTHROPIC_API_KEY)

Tuning

--recency-window (default: 6): Number of recent turns kept verbatim. Increase if you find Claude losing recent context; decrease for more aggressive compression.

--model: Force a tier. Use advanced for architecture sessions, fast for bulk scripting.

--token-budget (default: off): Hard cap on tokens per request. When set, ctx-gate drops the least prompt-relevant turns (never a system message or your current prompt) until the request fits.

--llm-summary (default: off): Summarize old turns with the fast-tier model instead of the local extractive summarizer. Higher-quality summaries at the cost of one extra fast-model call per request; falls back to extractive automatically if that call fails.

.claudeignore: Add to your project to prevent large directories from being indexed:

node_modules/
dist/
.git/
*.lock
__pycache__/

CLAUDE.md size: Every line in CLAUDE.md is prepended to every turn. Keep it under 2KB. ctx-gate will warn if it's larger.


Optional: Accurate Token Counting

Token counts (and therefore reported savings) use tiktoken when it's installed, and fall back to a char/4 estimate otherwise. The ctx-gate eval report tells you which backed a given run (token counts: accurate (tiktoken) vs estimated (char/4)).

pip install tiktoken                    # auto-detected and used when present

Optional: RAG-based Retrieval

For very large codebases, install the RAG extras to store file chunks in a vector DB:

pip install "ctx-gate[rag]"            # adds lancedb + sentence-transformers

Then pass --rag --project-root=<dir> to ctx-gate serve. The model receives only the chunks semantically relevant to each prompt, instead of full files. Without the extras, RAG still runs on a built-in TF-IDF + in-memory fallback (lower quality, zero extra dependencies).


Faithfulness & Evaluation

"Reduce tokens without losing accuracy" is only credible if it's measured, so ctx-gate ships a harness that does exactly that. Run it any time:

$ ctx-gate eval
ctx-gate faithfulness report
============================================================
scenario                      savings  retention
------------------------------------------------------------
database-choice                  0.7%      100%
auth-mechanism                   0.4%      100%
rate-limit-gap                   0.4%      100%
long-session-logs               90.9%      100%
recent-constraint                0.7%      100%
------------------------------------------------------------
MEAN                            18.6%      100%
token counts: accurate (tiktoken)

Each scenario buries a fact in an early turn (the kind compression summarizes), pads the history, then probes for that fact. The harness measures two things:

  • Layer A — fact retention (deterministic, no API key, CI-safe): after compression, do the facts the answer depends on still appear in the context the model would receive? This directly tests the compressor and is fully reproducible.
  • Layer B — answer accuracy (--llm, opt-in): ask the model the probe with full vs. compressed context and score each answer. The delta is the real signal — ~0 means compression didn't change the answer.

The numbers tell an honest story rather than a marketing one: short Q&A barely compresses (0.4–0.7%), while a long session full of verbose logs compresses 90.9% — and retention stays at 100% across the board because relevance-scored retention keeps the probed fact verbatim. Wire it into CI as a regression gate:

ctx-gate eval --min-retention 1.0   # exit nonzero if compression drops any required fact

The harness is designed to be able to fail the product's own claim — the test suite includes a case (relevance disabled) where a fact is dropped and the report flags it, so a passing report means something.


Stats

Aggregate counters are persisted to .ctx-gate/stats.json, so they survive proxy restarts:

$ ctx-gate status
{
  "instance_id": "a3f8c1d2",
  "active_sessions": 3,
  "requests_proxied": 47,
  "tokens_saved_estimate": 83400,
  "task_shifts_detected": 3
}

Multiple clients

Each client conversation gets its own isolated state (file-diff memory, task-shift history, checkpoint counters), keyed by an x-ctx-gate-session request header. Without the header everything shares a single default session, so single-client use is unchanged. Sessions are LRU-capped (--max-sessions, default 128).


Status & Known Limitations

ctx-gate is early but real — every feature documented above is wired into the request path and covered by tests (pytest -q → 113 passing). What that does and doesn't mean:

Working today

  • Native Anthropic /v1/messages (with streaming) and OpenAI /v1/chat/completions.
  • Streaming in both directions, including OpenAI-format clients streaming against a Claude backend — the Anthropic SSE is translated to OpenAI chat.completion.chunk SSE on the fly (text and tool-call deltas).
  • Task-shift clearing, context compression, relevance-scored retention (lexical or embedding/semantic), file-diff injection, tool-output truncation, model routing (routed model is applied upstream), token budgeting, snapshot checkpoints, optional RAG, optional LLM summary.
  • Per-session isolation (per-client state keyed by header), persistent stats (survive restarts), and upstream resilience (retries with backoff on transient errors, clean 502s instead of crashes).
  • A faithfulness harness with a CI gate.

Known limitations / rough edges

  • Checkpoints are snapshot-based, not event-based — a proxy can't see individual tool calls, so checkpoint counts are derived from conversation history, not a live tool-call stream.
  • Embedding relevance and --llm-summary add latency (an embedding pass / a fast-model call per compressed request) and are off by default; lexical relevance and extractive summary are the zero-dependency defaults.
  • Streaming retries are connection-only — a transient failure before the first byte is handled gracefully, but a drop mid-stream isn't retried (retrying would duplicate already-sent tokens).
  • Model IDs in ModelRouter are sensible defaults, not guaranteed current for every provider — override per tier as needed.

Roadmap

  • Publish to PyPI; expand the provider matrix and eval scenario coverage.

Development

pip install -e ".[dev,tokenizer]"      # editable install with test + tokenizer deps
pytest -q                              # run the suite (113 tests)
ctx-gate eval --min-retention 1.0      # run the faithfulness gate locally

CI (GitHub Actions, .github/workflows/ci.yml) runs the test suite and the faithfulness gate on every push and PR across Python 3.11–3.13, and uploads the faithfulness report as a build artifact. A change that drops a required fact fails CI — savings can't silently regress accuracy.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ctx_gate-0.1.0.tar.gz (61.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ctx_gate-0.1.0-py3-none-any.whl (54.9 kB view details)

Uploaded Python 3

File details

Details for the file ctx_gate-0.1.0.tar.gz.

File metadata

  • Download URL: ctx_gate-0.1.0.tar.gz
  • Upload date:
  • Size: 61.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ctx_gate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d8b136598d89edd92003812e603b929fce22dbb44fe9eedd8b1a279133fce521
MD5 997bc89a28c20fa226dd9e807b2670b8
BLAKE2b-256 d0fce68327444c471fce30b3d3191a3848013f34a4bd504d194dcf49d4ff3327

See more details on using hashes here.

File details

Details for the file ctx_gate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ctx_gate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 54.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ctx_gate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5307ff048271fc69e0caad33edf18574ce4790c2610ec3a2e8ca7b4136606a83
MD5 63dc5c3877c42d328afbd70e6f7543f0
BLAKE2b-256 bd145eaa9e4ea73e90b69fe651f547aacc403d5acaca8f7ee81dc8d135924dcd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page