Skip to main content

The Context Optimization Layer for LLM Applications - Cut costs by 50-90%

Project description

Headroom

Compress everything your AI agent reads. Same answers, fraction of the tokens.

Every tool call, DB query, file read, and RAG retrieval your agent makes is 70-95% boilerplate.
Headroom compresses it away before it hits the model.

CI PyPI Python Downloads License Documentation Discord


Where Headroom Fits

Your Agent / App
      │
      │  tool calls, logs, DB reads, RAG results, file reads, API responses
      ▼
   Headroom  ← transparent proxy, no code changes needed
      │
      ▼
 LLM Provider  (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM)

Headroom sits between your application and the LLM provider. It intercepts requests, compresses the context, and forwards an optimized prompt. Your app doesn't change — just point it at Headroom.

What gets compressed

Headroom optimizes any data your agent injects into a prompt:

  • Tool outputs — shell commands, API calls, search results
  • Database queries — SQL results, key-value lookups
  • RAG retrievals — document chunks, embeddings results
  • File reads — code, logs, configs, CSVs
  • API responses — JSON, XML, HTML
  • Conversation history — long agent sessions with repetitive context

Quick Start

pip install "headroom-ai[all]"

Proxy (zero code changes)

headroom proxy --port 8787
# Claude Code — just set the base URL
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Cursor, Continue, any OpenAI-compatible tool
OPENAI_BASE_URL=http://localhost:8787/v1 cursor

Works with any language, any tool, any framework. One env var. Proxy docs

Python: One function

from headroom import compress

result = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(model="claude-sonnet-4-5-20250929", messages=result.messages)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")

Works with any Python LLM client — Anthropic, OpenAI, LiteLLM, httpx, anything.

Already have a proxy or gateway?

You don't need to replace it. Drop Headroom into your existing stack:

Your setup Add Headroom One-liner
LiteLLM Callback litellm.callbacks = [HeadroomCallback()]
Any Python proxy ASGI Middleware app.add_middleware(CompressionMiddleware)
Any Python app compress() result = compress(messages, model="gpt-4o")
Agno agents Wrap model HeadroomAgnoModel(your_model)
LangChain Wrap model HeadroomChatModel(your_llm) (experimental)

Full Integration Guide — detailed setup for LiteLLM, ASGI middleware, compress(), and every framework.


Demo

Headroom Demo


Does It Actually Work?

100 production log entries. One critical error buried at position 67.

Baseline Headroom
Input tokens 10,144 1,260
Correct answers 4/4 4/4

Both responses: "payment-gateway, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected."

87.6% fewer tokens. Same answer. Run it: python examples/needle_in_haystack_test.py

What Headroom kept

From 100 log entries, SmartCrusher kept 6: first 3 (boundary), the FATAL error at position 67 (anomaly detection), and last 2 (recency). The error was automatically preserved — not by keyword matching, but by statistical analysis of field variance.

Real Workloads

Scenario Before After Savings
Code search (100 results) 17,765 1,408 92%
SRE incident debugging 65,694 5,118 92%
Codebase exploration 78,502 41,254 47%
GitHub issue triage 54,174 14,761 73%

Accuracy Benchmarks

Compression preserves accuracy — tested on real OSS benchmarks.

Standard Benchmarks — Baseline (direct to API) vs Headroom (through proxy):

Benchmark Category N Baseline Headroom Delta
GSM8K Math 100 0.870 0.870 0.000
TruthfulQA Factual 100 0.530 0.560 +0.030

Compression Benchmarks — Accuracy after full compression stack:

Benchmark Category N Accuracy Compression Method
SQuAD v2 QA 100 97% 19% Before/After
BFCL Tool/Function 100 97% 32% LLM-as-Judge
Tool Outputs (built-in) Agent 8 100% 20% Before/After
CCR Needle Retention Lossless 50 100% 77% Exact Match

Run it yourself:

# Quick smoke test (8 cases, ~10s)
python -m headroom.evals quick -n 8 --provider openai --model gpt-4o-mini

# Full Tier 1 suite (~$3, ~15 min)
python -m headroom.evals suite --tier 1 -o eval_results/

# CI mode (exit 1 on regression)
python -m headroom.evals suite --tier 1 --ci

Full methodology: Benchmarks | Evals Framework


Key Capabilities

Lossless Compression

Headroom never throws data away. It compresses aggressively, stores the originals, and gives the LLM a tool to retrieve full details when needed. When it compresses 500 items to 20, it tells the model what was omitted ("87 passed, 2 failed, 1 error") so the model knows when to ask for more.

Smart Content Detection

Auto-detects what's in your context — JSON arrays, code, logs, plain text — and routes each to the best compressor. JSON goes to SmartCrusher, code goes through AST-aware compression (Python, JS, Go, Rust, Java, C++), prose goes to LLMLingua-2.

Cache Optimization

Stabilizes message prefixes so your provider's KV cache actually works. Claude offers a 90% read discount on cached prefixes — but almost no framework takes advantage of it. Headroom does.

Failure Learning

headroom learn                   # Analyze past Claude Code sessions, show recommendations
headroom learn --apply           # Write learnings to CLAUDE.md and MEMORY.md
headroom learn --all --apply     # Learn across all your projects

Reads your conversation history, finds every failed tool call, correlates it with what eventually succeeded, and writes specific corrections into your project files. Next session starts smarter. Learn docs

headroom learn demo

Image Compression

40-90% token reduction via trained ML router. Automatically selects the right resize/quality tradeoff per image.

All features
Feature What it does
Content Router Auto-detects content type, routes to optimal compressor
SmartCrusher Universal JSON compression — arrays of dicts, strings, numbers, mixed types, nested objects
CodeCompressor AST-aware compression for Python, JS, Go, Rust, Java, C++
LLMLingua-2 ML-based 20x text compression
CCR Reversible compression — LLM retrieves originals when needed
Compression Summaries Tells the LLM what was omitted ("3 errors, 12 failures")
CacheAligner Stabilizes prefixes for provider KV cache hits
IntelligentContext Score-based context management with learned importance
Image Compression 40-90% token reduction via trained ML router
Memory Persistent memory across conversations
Compression Hooks Customize compression with pre/post hooks
Read Lifecycle Detects stale/superseded Read outputs, replaces with CCR markers
headroom learn Analyzes past failures, writes project-specific learnings to CLAUDE.md/MEMORY.md

Headroom vs Alternatives

Context compression is a new space. Here's how the approaches differ:

Approach Scope Deploy as Framework integrations Data stays local? Reversible
Headroom Multi-algorithm compression All context (tool outputs, DB reads, RAG, files, logs, history) Proxy, Python library, ASGI middleware, or callback LangChain, Agno, LiteLLM, Strands, MCP Yes (OSS) Yes (CCR)
RTK CLI command rewriter Shell command outputs CLI wrapper None Yes (OSS) No
Compresr Cloud compression API Text sent to their API API call None No No
Token Company Cloud compression API Text sent to their API API call None No No

Use it however you want. Headroom works as a standalone proxy (headroom proxy), a one-function Python library (compress()), ASGI middleware, or a LiteLLM callback. Already using LiteLLM, LangChain, or Agno? Drop Headroom in without replacing anything.

Headroom + RTK work well together. RTK rewrites CLI commands (git showgit show --short), Headroom compresses everything else (JSON arrays, code, logs, RAG results, conversation history). Use both.

Headroom vs cloud APIs. Compresr and Token Company are hosted services — you send your context to their servers, they compress and return it. Headroom runs locally. Your data never leaves your machine. You also get lossless compression (CCR): the LLM can retrieve the full original when it needs more detail.


How It Works Inside

  Your prompt
      │
      ▼
  1. CacheAligner            Stabilize prefix for KV cache
      │
      ▼
  2. ContentRouter           Route each content type:
      │                         → SmartCrusher    (JSON)
      │                         → CodeCompressor  (code)
      │                         → LLMLingua       (text)
      ▼
  3. IntelligentContext      Score-based token fitting
      │
      ▼
  LLM Provider

  Needs full details? LLM calls headroom_retrieve.
  Originals are in the Compressed Store — nothing is thrown away.

Overhead: 15-200ms compression latency (net positive for Sonnet/Opus). Full data: Latency Benchmarks


Integrations

Integration Status Docs
compress() — one function Stable Integration Guide
LiteLLM callback Stable Integration Guide
ASGI middleware Stable Integration Guide
Proxy server Stable Proxy Docs
Agno Stable Agno Guide
MCP (Claude Code) Stable MCP Guide
Strands Stable Strands Guide
LangChain Experimental LangChain Guide

Cloud Providers

headroom proxy --backend bedrock --region us-east-1     # AWS Bedrock
headroom proxy --backend vertex_ai --region us-central1 # Google Vertex
headroom proxy --backend azure                          # Azure OpenAI
headroom proxy --backend openrouter                     # OpenRouter (400+ models)

Installation

pip install headroom-ai                # Core library
pip install "headroom-ai[all]"         # Everything including evals (recommended)
pip install "headroom-ai[proxy]"       # Proxy server
pip install "headroom-ai[mcp]"         # MCP for Claude Code
pip install "headroom-ai[agno]"        # Agno integration
pip install "headroom-ai[langchain]"   # LangChain (experimental)
pip install "headroom-ai[evals]"       # Evaluation framework only

Python 3.10+


Documentation

Integration Guide LiteLLM, ASGI, compress(), proxy
Proxy Docs Proxy server configuration
Architecture How the pipeline works
CCR Guide Reversible compression
Benchmarks Accuracy validation
Latency Benchmarks Compression overhead & cost-benefit analysis
Limitations When compression helps, when it doesn't
Evals Framework Prove compression preserves accuracy
Memory Persistent memory
Agno Agno agent framework
MCP Claude Code subscriptions
Learn Offline failure learning for coding agents
Configuration All options

Community

Questions, feedback, or just want to follow along? Join us on Discord


Contributing

git clone https://github.com/chopratejas/headroom.git && cd headroom
pip install -e ".[dev]" && pytest

License

Apache License 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

headroom_ai-0.4.2.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

headroom_ai-0.4.2-py3-none-any.whl (914.3 kB view details)

Uploaded Python 3

File details

Details for the file headroom_ai-0.4.2.tar.gz.

File metadata

  • Download URL: headroom_ai-0.4.2.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for headroom_ai-0.4.2.tar.gz
Algorithm Hash digest
SHA256 6bad76c5c75292b6491729203db72c07728526d4f34800a4002a2a767030765c
MD5 77c31f009e79cf3fff9d5a7590b46216
BLAKE2b-256 5e8347238452395808b00f64ed9678b247799c0d5c72336054bfedfd83bd912c

See more details on using hashes here.

File details

Details for the file headroom_ai-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: headroom_ai-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 914.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for headroom_ai-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1adaa06549f019d3214d3f720530e11ee1339d8c020ed0a2109b97ae2e2b83bf
MD5 bf29ccf90ef5c7f337809c475b518a8d
BLAKE2b-256 864fc6192d1bacf3e38cc5a9ea0725bfe1bf2fc3f54f19c8e3f3ca2f2b2bc182

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page