The context compression protocol for LLM inference. Eliminate 93% token redundancy in one line.
Project description
ContextLens
93% of tokens sent to LLMs are identical repeated data. ContextLens eliminates that waste.
The Problem
Every LLM application wastes tokens. Not 10%. Not 20%. 93%.
We measured a real production AI system (a live financial intelligence platform making thousands of decisions per day) and found:
| Metric | Value |
|---|---|
| Total messages sent | 8,584 |
| Unique messages | 599 |
| Total characters sent | 282,981 |
| Wasted characters | 263,234 |
| Redundancy | 93.0% |
The same reasoning, same context, same instructions — sent hundreds of times. Your LLM reads it fresh every single time. You pay for every single token.
This gets worse with agents. A 50-step agent loop can consume 800,000 tokens to complete a task that needs 50,000 tokens of actual information. That is 94% waste on every complex task.
The Solution
ContextLens is a context compression protocol that sits between your code and any LLM API. It intercepts every request, eliminates redundancy, and forwards only what the model actually needs.
One line. Zero configuration. Works with your existing code.
# Before
import anthropic
client = anthropic.Anthropic(api_key="...")
# After — one line change
import contextlens as cx
client = cx.wrap(anthropic.Anthropic(api_key="..."))
# Everything else stays identical
# Your costs drop immediately
What It Does
Layer 1 — Semantic Triage
Scores every message in your conversation history for relevance to the current prompt. Irrelevant history is compressed or archived. The model only sees what matters right now.
Score > 0.8 → Sent to model (Hot)
Score 0.3-0.8 → Compressed to summary (Warm)
Score < 0.3 → Archived locally (Cold)
Layer 2 — Deduplication Engine
Identical or near-identical content is stored once and referenced. If your agent re-reads the same system context 200 times, it is sent once and cached.
"Regime:CRISIS Score:-45.4 Top:GLD" × 202 times
→ "Regime:CRISIS [stable, 202 cycles, Score:-45.4]" × 1 time
Layer 3 — Agent State Machine
Agent loops are the worst offenders. ContextLens understands agent-specific message types and applies intelligent rules automatically.
GOAL → Always kept (never removed)
TOOL_RESULT → Kept if referenced in last 3 steps, else summarised
TOOL_CALL → Only most recent per tool type kept
REASONING → Last 5 steps kept, rest archived
ERROR → Count + last error only ("Failed 3x: last=X")
Layer 4 — Prompt Cache Integration
Stable context blocks are automatically flagged for provider-side caching. You pay full price once. Every repeat is 90% cheaper.
Results
| Use Case | Tokens Before | Tokens After | Reduction |
|---|---|---|---|
| Long conversation | 40,000 | 8,000 | 80% |
| 50-step agent loop | 800,000 | 120,000 | 85% |
| Code assistant session | 60,000 | 9,000 | 85% |
| Production AI system* | 282,981 chars | 19,747 chars | 93% |
*Measured on real production data
Installation
pip install contextlens
Usage
Basic — Anthropic
import anthropic
import contextlens as cx
client = cx.wrap(anthropic.Anthropic(api_key="..."))
# Identical to normal usage — compression is automatic
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}]
)
Basic — OpenAI
import openai
import contextlens as cx
client = cx.wrap(openai.OpenAI(api_key="..."))
# Works identically
Agent Loops
import contextlens as cx
# Wrap your existing agent — zero other changes
agent = cx.wrap_agent(your_langchain_agent)
# Agent now never hits context limits
# Long tasks cost same as short tasks
result = agent.run("Analyse this 10,000 line codebase and fix all bugs")
Token Budget Dial
# Economic — aggressive compression, maximum savings
client = cx.wrap(client, budget="economic")
# Balanced — smart compression, preserves nuance
client = cx.wrap(client, budget="balanced")
# Precise — minimal compression, maximum accuracy
client = cx.wrap(client, budget="precise")
See Your Savings
import contextlens as cx
client = cx.wrap(client, show_savings=True)
# After each call:
# ─────────────────────────────────────
# ContextLens | This request
# Sent: 4,847 tokens (↓ from 12,203)
# Saved: 7,356 tokens (~£0.018)
# Session: £1.24 saved | 147g CO₂ avoided
# ─────────────────────────────────────
For Infrastructure / Data Centers
ContextLens runs as a drop-in proxy in front of any inference server.
docker run -p 8080:8080 \
-e UPSTREAM=http://your-vllm-instance:8000 \
contextlens/proxy:latest
Change one line in your application:
ANTHROPIC_BASE_URL=https://your-proxy:8080
Result: Every GPU on your cluster handles more concurrent users. Same hardware. Less energy. More revenue per chip.
vLLM Plugin (coming in v0.3)
vllm serve meta-llama/Llama-3-70b --contextlens-plugin enabled
Context is compressed before the KV cache is allocated. The GPU never processes redundant tokens.
The Carbon Impact
Every redundant token burns real energy. At scale:
- 10 million API calls/day × 60% average compression = 6 million fewer GPU-seconds per day
- Equivalent to powering a city block for a year — eliminated entirely
- ContextLens tracks your CO₂ avoided in real time
AI inference is the fastest growing slice of global electricity consumption. Context waste is the fastest fix.
The Open Protocol: CXP
ContextLens is built on the Context Exchange Protocol (CXP) — an open specification for how context moves between applications and language models.
The spec is free. Anyone can implement it. Any model provider can support it.
→ Read the CXP v0.1 Specification
Benchmarks
All benchmarks are reproducible. The methodology is open source.
→ See benchmark methodology
→ Run benchmarks on your own data
Roadmap
| Version | Feature | Status |
|---|---|---|
| v0.1 | Deduplication engine + basic proxy | 🔨 Building |
| v0.2 | Semantic triage (MiniLM embeddings) | Planned |
| v0.3 | Agent state machine | Planned |
| v0.4 | Prompt cache integration | Planned |
| v1.0 | Accuracy guarantee + CXP spec final | Planned |
| v2.0 | vLLM hardware plugin | Planned |
Why Free
Context waste is an infrastructure problem that affects every developer, every company, and the planet. A solution locked behind a paywall does not fix the infrastructure.
ContextLens is free for developers. Forever.
Infrastructure deployments (data centers, GPU clouds, enterprise on-premise) are the paid tier. They save millions. They can pay.
Contributing
The protocol is open. Implementations are welcome.
git clone https://github.com/Usama1909/contextlens
cd contextlens
pip install -e ".[dev]"
pytest tests/
License
MIT — use it, fork it, build on it.
Built by @Usama1909
Founding benchmark measured on ARIA — a live autonomous financial intelligence system
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ctxlens-1.0.0.tar.gz.
File metadata
- Download URL: ctxlens-1.0.0.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b11691fddfede36111c94bc2f0e045db43dc0588b2dd0f9616a4cb189c354b6
|
|
| MD5 |
32f520d890cf0dd79c598cf883317d04
|
|
| BLAKE2b-256 |
edb9e7ed4dee079cb0f45b20303e9f7579ebef1c909d784cebf1a82d27066fb0
|
File details
Details for the file ctxlens-1.0.0-py3-none-any.whl.
File metadata
- Download URL: ctxlens-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8387582d4f10bbd652f523d4a0fa6afb2878392df9958417d2ddf7945b43383
|
|
| MD5 |
2d8a33639e067ae48df43915485b99aa
|
|
| BLAKE2b-256 |
fc43c236795da6e13361d3684745b68592fd7ccbec9b41076d2e0c99d5c0e29a
|