The Context Optimization Layer for LLM Applications - Cut costs by 50-90%
Project description
Headroom
The Context Optimization Layer for LLM Applications
Tool outputs are 70-95% redundant boilerplate. Headroom compresses that away.
Demo
Does It Actually Work? A Real Test
The setup: 100 production log entries. One critical error buried at position 67.
BEFORE: 100 log entries (18,952 chars) - click to expand
[
{"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", "message": "Request processed successfully - latency=50ms", "request_id": "req-000000", "status_code": 200},
{"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", "message": "Request processed successfully - latency=51ms", "request_id": "req-000001", "status_code": 200},
{"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", "message": "Request processed successfully - latency=52ms", "request_id": "req-000002", "status_code": 200},
// ... 64 more INFO entries ...
{"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "message": "Connection pool exhausted", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
// ... 32 more INFO entries ...
]
AFTER: Headroom compresses to 6 entries (1,155 chars):
[
{"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", ...},
{"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", ...},
{"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", ...},
{"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
{"timestamp": "2024-12-15T02:38:00Z", "level": "INFO", "service": "inventory", ...},
{"timestamp": "2024-12-15T03:39:00Z", "level": "INFO", "service": "auth", ...}
]
What happened: First 3 items + the FATAL error + last 2 items. The critical error at position 67 was automatically preserved.
The question we asked Claude: "What caused the outage? What's the error code? What's the fix?"
| Baseline | Headroom | |
|---|---|---|
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | 4/4 |
Both responses: "payment-gateway service, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected"
87.6% fewer tokens. Same answer.
Run it yourself: python examples/needle_in_haystack_test.py
Accuracy Benchmarks
Headroom's guarantee: compress without losing accuracy.
We validate against established open-source benchmarks. Full methodology and reproducible tests: Benchmarks Documentation
| Benchmark | Metric | Result | Status |
|---|---|---|---|
| Scrapinghub Article Extraction | F1 Score | 0.919 (baseline: 0.958) | :white_check_mark: |
| Scrapinghub Article Extraction | Recall | 98.2% | :white_check_mark: |
| Scrapinghub Article Extraction | Compression | 94.9% | :white_check_mark: |
| SmartCrusher (JSON) | Accuracy | 100% (4/4 correct) | :white_check_mark: |
| SmartCrusher (JSON) | Compression | 87.6% | :white_check_mark: |
| Multi-Tool Agent | Accuracy | 100% (all findings) | :white_check_mark: |
| Multi-Tool Agent | Compression | 76.3% | :white_check_mark: |
Why recall matters most: For LLM applications, capturing all relevant information is critical. 98.2% recall means nearly all content is preserved — LLMs can answer questions accurately from compressed context.
Run benchmarks yourself
# Install with benchmark dependencies
pip install "headroom-ai[evals,html]" datasets
# Run HTML extraction benchmark (no API key needed)
pytest tests/test_evals/test_html_oss_benchmarks.py::TestExtractionBenchmark -v -s
# Run QA accuracy tests (requires OPENAI_API_KEY)
pytest tests/test_evals/test_html_oss_benchmarks.py::TestQAAccuracyPreservation -v -s
Multi-Tool Agent Test: Real Function Calling
The setup: An Agno agent with 4 tools (GitHub Issues, ArXiv Papers, Code Search, Database Logs) investigating a memory leak. Total tool output: 62,323 chars (~15,580 tokens).
from agno.agent import Agent
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel
# Wrap your model - that's it!
base_model = Claude(id="claude-sonnet-4-20250514")
model = HeadroomAgnoModel(wrapped_model=base_model)
agent = Agent(model=model, tools=[search_github, search_arxiv, search_code, query_db])
response = agent.run("Investigate the memory leak and recommend a fix")
Results with Claude Sonnet:
| Baseline | Headroom | |
|---|---|---|
| Tokens sent to API | 15,662 | 6,100 |
| API requests | 2 | 2 |
| Tool calls | 4 | 4 |
| Duration | 26.5s | 27.0s |
76.3% fewer tokens. Same comprehensive answer.
Both found: Issue #42 (memory leak), the cleanup_worker() fix, OutOfMemoryError logs (7.8GB/8GB, 847 threads), and relevant research papers.
Run it yourself: python examples/multi_tool_agent_test.py
How It Works
Headroom optimizes LLM context before it hits the provider — without changing your agent logic or tools.
flowchart LR
User["Your App"]
Entry["Headroom"]
Transform["Context<br/>Optimization"]
LLM["LLM Provider"]
Response["Response"]
User --> Entry --> Transform --> LLM --> Response
Inside Headroom
flowchart TB
subgraph Pipeline["Transform Pipeline"]
CA["Cache Aligner<br/><i>Stabilizes dynamic tokens</i>"]
SC["Smart Crusher<br/><i>Removes redundant tool output</i>"]
CM["Intelligent Context<br/><i>Score-based token fitting</i>"]
CA --> SC --> CM
end
subgraph CCR["CCR: Compress-Cache-Retrieve"]
Store[("Compressed<br/>Store")]
Tool["Retrieve Tool"]
Tool <--> Store
end
LLM["LLM Provider"]
CM --> LLM
SC -. "Stores originals" .-> Store
LLM -. "Requests full context<br/>if needed" .-> Tool
Headroom never throws data away. It compresses aggressively and retrieves precisely.
What actually happens
-
Headroom intercepts context — Tool outputs, logs, search results, and intermediate agent steps.
-
Dynamic content is stabilized — Timestamps, UUIDs, request IDs are normalized so prompts cache cleanly.
-
Low-signal content is removed — Repetitive or redundant data is crushed, not truncated.
-
Original data is preserved — Full content is stored separately and retrieved only if the LLM asks.
-
Provider caches finally work — Headroom aligns prompts so OpenAI, Anthropic, and Google caches actually hit.
For deep technical details, see Architecture Documentation.
Why Headroom?
- Zero code changes - works as a transparent proxy
- 47-92% savings - depends on your workload (tool-heavy = more savings)
- Image compression - 40-90% reduction via trained ML router (OpenAI, Anthropic, Google)
- Reversible compression - LLM retrieves original data via CCR
- Content-aware - code, logs, JSON, images each handled optimally
- Provider caching - automatic prefix optimization for cache hits
- Framework native - LangChain, Agno, MCP, agents supported
30-Second Quickstart
Option 1: Proxy (Zero Code Changes)
pip install "headroom-ai[all]" # Recommended for best performance
headroom proxy --port 8787
Note: First startup downloads ML models (~500MB) for optimal compression. This is a one-time download.
Dashboard: Open http://localhost:8787/dashboard to see real-time stats, token savings, and request history.
Point your tools at the proxy:
# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude
# Any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor
Enable Persistent Memory - Claude remembers across conversations:
headroom proxy --memory
Memory auto-detects your provider (Anthropic, OpenAI, Gemini) and uses the appropriate format:
- Anthropic: Uses native memory tool (
memory_20250818) - works with Claude Code subscriptions - OpenAI/Gemini/Others: Uses function calling format
- All providers share the same semantic vector store for search
Set x-headroom-user-id header for per-user memory isolation (defaults to 'default').
Claude Code Subscription Users - Use MCP for CCR (Compress-Cache-Retrieve):
If you use Claude Code with a subscription (not API key), you need MCP to enable the headroom_retrieve tool:
# One-time setup
pip install "headroom-ai[mcp]"
headroom mcp install
# Every time you code
headroom proxy # Terminal 1
claude # Terminal 2 - now has headroom_retrieve!
What this does:
- Configures Claude Code to use Headroom's MCP server (
~/.claude/mcp.json) - When the proxy compresses large tool outputs, Claude sees markers like
[47 items compressed... hash=abc123] - Claude can call
headroom_retrieveto get the full original content when needed
Check your setup:
headroom mcp status
Why MCP for subscriptions?
- API users can inject custom tools directly via the Messages API
- Subscription users use Claude Code's built-in tool set and can't inject tools programmatically
- MCP (Model Context Protocol) is Claude's official way to extend tools - it works with subscriptions
The MCP server exposes headroom_retrieve so Claude can request uncompressed content when the compressed summary isn't enough.
Using AWS Bedrock, Google Vertex, or Azure? Route through Headroom:
# AWS Bedrock - Terminal 1: Start proxy
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"
headroom proxy --backend bedrock --region us-east-1
# AWS Bedrock - Terminal 2: Run Claude Code
export ANTHROPIC_API_KEY="sk-ant-dummy" # Any value works! Headroom ignores it.
export ANTHROPIC_BASE_URL="http://localhost:8787"
# IMPORTANT: Do NOT set CLAUDE_CODE_USE_BEDROCK=1 (Headroom handles Bedrock routing)
claude
VS Code settings.json for Bedrock (click to expand)
{
"claudeCode.environmentVariables": [
{ "name": "ANTHROPIC_API_KEY", "value": "sk-ant-dummy" },
{ "name": "ANTHROPIC_BASE_URL", "value": "http://localhost:8787" },
{ "name": "AWS_ACCESS_KEY_ID", "value": "AKIA..." },
{ "name": "AWS_SECRET_ACCESS_KEY", "value": "..." },
{ "name": "AWS_REGION", "value": "us-east-1" }
]
}
Do NOT include CLAUDE_CODE_USE_BEDROCK - Headroom handles the Bedrock routing.
Using OpenRouter? Access 400+ models through a single API:
# OpenRouter - Terminal 1: Start proxy
export OPENROUTER_API_KEY="sk-or-v1-..."
headroom proxy --backend openrouter
# OpenRouter - Terminal 2: Run your client
export ANTHROPIC_API_KEY="sk-ant-dummy" # Any value works! Headroom ignores it.
export ANTHROPIC_BASE_URL="http://localhost:8787"
# Use OpenRouter model names in your requests:
# - anthropic/claude-3.5-sonnet
# - openai/gpt-4o
# - google/gemini-pro
# - meta-llama/llama-3-70b-instruct
# See all models: https://openrouter.ai/models
# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1
# Azure OpenAI
headroom proxy --backend azure --region eastus
Option 2: LangChain Integration
pip install "headroom-ai[langchain]"
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
# Wrap your model - that's it!
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
# Use exactly like before
response = llm.invoke("Hello!")
See the full LangChain Integration Guide for memory, retrievers, agents, and more.
Option 3: Agno Integration
pip install "headroom-ai[agno]"
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
# Wrap your model - that's it!
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)
# Use exactly like before
response = agent.run("Hello!")
# Check savings
print(f"Tokens saved: {model.total_tokens_saved}")
See the full Agno Integration Guide for hooks, multi-provider support, and more.
Framework Integrations
| Framework | Integration | Docs |
|---|---|---|
| LangChain | HeadroomChatModel, memory, retrievers, agents |
Guide |
| Agno | HeadroomAgnoModel, hooks, multi-provider |
Guide |
| MCP | Claude Code subscription support via headroom mcp install |
Guide |
| Any OpenAI Client | Proxy server | Guide |
Features
| Feature | Description | Docs |
|---|---|---|
| Image Compression | 40-90% token reduction for images via trained ML router | Image Compression |
| Memory | Persistent memory across conversations (zero-latency inline extraction) | Memory |
| Universal Compression | ML-based content detection + structure-preserving compression | Compression |
| SmartCrusher | Compresses JSON tool outputs statistically | Transforms |
| CacheAligner | Stabilizes prefixes for provider caching | Transforms |
| IntelligentContext | Score-based context dropping with TOIN-learned importance | Transforms |
| CCR | Reversible compression with automatic retrieval | CCR Guide |
| MCP Server | Claude Code subscription support via headroom mcp install |
MCP Guide |
| LangChain | Memory, retrievers, agents, streaming | LangChain |
| Agno | Agent framework integration with hooks | Agno |
| Text Utilities | Opt-in compression for search/logs | Text Compression |
| LLMLingua-2 | ML-based 20x compression (opt-in) | LLMLingua |
| Code-Aware | AST-based code compression (tree-sitter) | Transforms |
| Evals Framework | Prove compression preserves accuracy (12+ datasets) | Evals |
Evaluation Framework: Prove It Works
Skeptical? Good. We built a comprehensive evaluation framework to prove compression preserves accuracy.
# Install evals
pip install "headroom-ai[evals]"
# Quick sanity check (5 samples)
python -m headroom.evals quick
# Run on real datasets
python -m headroom.evals benchmark --dataset hotpotqa -n 100
How Evals Work
Original Context ───► LLM ───► Response A
│
Compressed Context ─► LLM ───► Response B
│
Compare A vs B │
─────────────────
F1 Score: 0.95
Semantic Similarity: 0.97
Ground Truth Match: ✓
─────────────────
PASS: Accuracy preserved
Available Datasets (12+)
| Category | Datasets |
|---|---|
| RAG | HotpotQA, Natural Questions, TriviaQA, MS MARCO, SQuAD |
| Long Context | LongBench (4K-128K tokens), NarrativeQA |
| Tool Use | BFCL (function calling), ToolBench, Built-in samples |
| Code | CodeSearchNet, HumanEval |
CI Integration
# GitHub Actions
- name: Run Compression Evals
run: python -m headroom.evals quick -n 20
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Exit code 0 if accuracy ≥ 90%, 1 otherwise.
See the full Evals Documentation for datasets, metrics, and programmatic API.
Verified Performance
These numbers are from actual API calls, not estimates:
| Scenario | Before | After | Savings | Verified |
|---|---|---|---|---|
| Code search (100 results) | 17,765 tokens | 1,408 tokens | 92% | Claude Sonnet |
| SRE incident debugging | 65,694 tokens | 5,118 tokens | 92% | GPT-4o |
| Codebase exploration | 78,502 tokens | 41,254 tokens | 47% | GPT-4o |
| GitHub issue triage | 54,174 tokens | 14,761 tokens | 73% | GPT-4o |
Overhead: ~1-5ms compression latency
When savings are highest: Tool-heavy workloads (search, logs, database queries) When savings are lowest: Conversation-heavy workloads with minimal tool use
Providers
| Provider | Token Counting | Cache Optimization |
|---|---|---|
| OpenAI | tiktoken (exact) | Automatic prefix caching |
| Anthropic | Official API | cache_control blocks |
| Official API | Context caching | |
| Cohere | Official API | - |
| Mistral | Official tokenizer | - |
New models auto-supported via naming pattern detection.
Safety Guarantees
- Never removes human content - user/assistant messages preserved
- Never breaks tool ordering - tool calls and responses stay paired
- Parse failures are no-ops - malformed content passes through unchanged
- Compression is reversible - LLM retrieves original data via CCR
Installation
# Recommended: Install everything for best compression performance
pip install "headroom-ai[all]"
# Or install specific components
pip install headroom-ai # SDK only
pip install "headroom-ai[proxy]" # Proxy server
pip install "headroom-ai[mcp]" # MCP server for Claude Code subscriptions
pip install "headroom-ai[langchain]" # LangChain integration
pip install "headroom-ai[agno]" # Agno agent framework
pip install "headroom-ai[evals]" # Evaluation framework
pip install "headroom-ai[code]" # AST-based code compression
pip install "headroom-ai[llmlingua]" # ML-based compression
Requirements: Python 3.10+
First-time startup: Headroom downloads ML models (~500MB) on first run for optimal compression. This is cached locally and only happens once.
Documentation
| Guide | Description |
|---|---|
| Memory Guide | Persistent memory for LLMs |
| Compression Guide | Universal compression with ML detection |
| Evals Framework | Prove compression preserves accuracy |
| LangChain Integration | Full LangChain support |
| Agno Integration | Full Agno agent framework support |
| SDK Guide | Fine-grained control |
| Proxy Guide | Production deployment |
| Configuration | All options |
| CCR Guide | Reversible compression |
| MCP Guide | Claude Code subscription support |
| Metrics | Monitoring |
| Troubleshooting | Common issues |
Who's Using Headroom?
Add your project here! Open a PR or start a discussion.
Contributing
git clone https://github.com/chopratejas/headroom.git
cd headroom
pip install -e ".[dev]"
pytest
See CONTRIBUTING.md for details.
License
Apache License 2.0 - see LICENSE.
Built for the AI developer community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file headroom_ai-0.3.3.tar.gz.
File metadata
- Download URL: headroom_ai-0.3.3.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48f2ea76e2f01342becc248ecdff45b114386446b2281a40ab23a98091d7ad90
|
|
| MD5 |
55e1b1cc0dfd9ada24a052f6d00041c0
|
|
| BLAKE2b-256 |
5811208083e89847de0315ccb2d3d8a1a85de065b642f13b751f04100dd5d5c0
|
File details
Details for the file headroom_ai-0.3.3-py3-none-any.whl.
File metadata
- Download URL: headroom_ai-0.3.3-py3-none-any.whl
- Upload date:
- Size: 791.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
646722018ce844c5b2301fe1c73b024aea82cfa158f276984dc79096e616c828
|
|
| MD5 |
7a2e6389ac45f144bff14cf9feea29a1
|
|
| BLAKE2b-256 |
c66ce5ba1674be9b56e64360666c4cba6f0424f102b60bb50f23abca051ef654
|