The Context Optimization Layer for LLM Applications - Cut costs by 50-90%

These details have not been verified by PyPI

Project links

Project description

Headroom

The Context Optimization Layer for LLM Applications

Tool outputs are 70-95% redundant boilerplate. Headroom compresses that away.

Demo

Headroom Demo

Does It Actually Work? A Real Test

The setup: 100 production log entries. One critical error buried at position 67.

BEFORE: 100 log entries (18,952 chars) - click to expand

[
  {"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", "message": "Request processed successfully - latency=50ms", "request_id": "req-000000", "status_code": 200},
  {"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", "message": "Request processed successfully - latency=51ms", "request_id": "req-000001", "status_code": 200},
  {"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", "message": "Request processed successfully - latency=52ms", "request_id": "req-000002", "status_code": 200},
  // ... 64 more INFO entries ...
  {"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "message": "Connection pool exhausted", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
  // ... 32 more INFO entries ...
]

AFTER: Headroom compresses to 6 entries (1,155 chars):

[
  {"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", ...},
  {"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", ...},
  {"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", ...},
  {"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
  {"timestamp": "2024-12-15T02:38:00Z", "level": "INFO", "service": "inventory", ...},
  {"timestamp": "2024-12-15T03:39:00Z", "level": "INFO", "service": "auth", ...}
]

What happened: First 3 items + the FATAL error + last 2 items. The critical error at position 67 was automatically preserved.

The question we asked Claude: "What caused the outage? What's the error code? What's the fix?"

	Baseline	Headroom
Input tokens	10,144	1,260
Correct answers	4/4	4/4

Both responses: "payment-gateway service, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected"

87.6% fewer tokens. Same answer.

Run it yourself: python examples/needle_in_haystack_test.py

Accuracy Benchmarks

Headroom's guarantee: compress without losing accuracy.

We validate against established open-source benchmarks. Full methodology and reproducible tests: Benchmarks Documentation

Benchmark	Metric	Result	Status
Scrapinghub Article Extraction	F1 Score	0.919 (baseline: 0.958)	:white_check_mark:
Scrapinghub Article Extraction	Recall	98.2%	:white_check_mark:
Scrapinghub Article Extraction	Compression	94.9%	:white_check_mark:
SmartCrusher (JSON)	Accuracy	100% (4/4 correct)	:white_check_mark:
SmartCrusher (JSON)	Compression	87.6%	:white_check_mark:
Multi-Tool Agent	Accuracy	100% (all findings)	:white_check_mark:
Multi-Tool Agent	Compression	76.3%	:white_check_mark:

Why recall matters most: For LLM applications, capturing all relevant information is critical. 98.2% recall means nearly all content is preserved — LLMs can answer questions accurately from compressed context.

Run benchmarks yourself

# Install with benchmark dependencies
pip install "headroom-ai[evals,html]" datasets

# Run HTML extraction benchmark (no API key needed)
pytest tests/test_evals/test_html_oss_benchmarks.py::TestExtractionBenchmark -v -s

# Run QA accuracy tests (requires OPENAI_API_KEY)
pytest tests/test_evals/test_html_oss_benchmarks.py::TestQAAccuracyPreservation -v -s

Multi-Tool Agent Test: Real Function Calling

The setup: An Agno agent with 4 tools (GitHub Issues, ArXiv Papers, Code Search, Database Logs) investigating a memory leak. Total tool output: 62,323 chars (~15,580 tokens).

from agno.agent import Agent
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel

# Wrap your model - that's it!
base_model = Claude(id="claude-sonnet-4-20250514")
model = HeadroomAgnoModel(wrapped_model=base_model)

agent = Agent(model=model, tools=[search_github, search_arxiv, search_code, query_db])
response = agent.run("Investigate the memory leak and recommend a fix")

Results with Claude Sonnet:

	Baseline	Headroom
Tokens sent to API	15,662	6,100
API requests	2	2
Tool calls	4	4
Duration	26.5s	27.0s

76.3% fewer tokens. Same comprehensive answer.

Both found: Issue #42 (memory leak), the cleanup_worker() fix, OutOfMemoryError logs (7.8GB/8GB, 847 threads), and relevant research papers.

Run it yourself: python examples/multi_tool_agent_test.py

How It Works

Headroom optimizes LLM context before it hits the provider — without changing your agent logic or tools.

flowchart LR
  User["Your App"]
  Entry["Headroom"]
  Transform["Context<br/>Optimization"]
  LLM["LLM Provider"]
  Response["Response"]

  User --> Entry --> Transform --> LLM --> Response

Inside Headroom

flowchart TB

subgraph Pipeline["Transform Pipeline"]
  CA["Cache Aligner<br/><i>Stabilizes dynamic tokens</i>"]
  SC["Smart Crusher<br/><i>Removes redundant tool output</i>"]
  CM["Intelligent Context<br/><i>Score-based token fitting</i>"]
  CA --> SC --> CM
end

subgraph CCR["CCR: Compress-Cache-Retrieve"]
  Store[("Compressed<br/>Store")]
  Tool["Retrieve Tool"]
  Tool <--> Store
end

LLM["LLM Provider"]

CM --> LLM
SC -. "Stores originals" .-> Store
LLM -. "Requests full context<br/>if needed" .-> Tool

Headroom never throws data away. It compresses aggressively and retrieves precisely.

What actually happens

Headroom intercepts context — Tool outputs, logs, search results, and intermediate agent steps.
Dynamic content is stabilized — Timestamps, UUIDs, request IDs are normalized so prompts cache cleanly.
Low-signal content is removed — Repetitive or redundant data is crushed, not truncated.
Original data is preserved — Full content is stored separately and retrieved only if the LLM asks.
Provider caches finally work — Headroom aligns prompts so OpenAI, Anthropic, and Google caches actually hit.

For deep technical details, see Architecture Documentation.

Why Headroom?

Zero code changes - works as a transparent proxy
47-92% savings - depends on your workload (tool-heavy = more savings)
Image compression - 40-90% reduction via trained ML router (OpenAI, Anthropic, Google)
Reversible compression - LLM retrieves original data via CCR
Content-aware - code, logs, JSON, images each handled optimally
Provider caching - automatic prefix optimization for cache hits
Framework native - LangChain, Agno, MCP, agents supported

30-Second Quickstart

Option 1: Proxy (Zero Code Changes)

pip install "headroom-ai[all]"  # Recommended for best performance
headroom proxy --port 8787

Note: First startup downloads ML models (~500MB) for optimal compression. This is a one-time download.

Dashboard: Open http://localhost:8787/dashboard to see real-time stats, token savings, and request history.

Point your tools at the proxy:

# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor

Enable Persistent Memory - Claude remembers across conversations:

headroom proxy --memory

Memory auto-detects your provider (Anthropic, OpenAI, Gemini) and uses the appropriate format:

Anthropic: Uses native memory tool (memory_20250818) - works with Claude Code subscriptions
OpenAI/Gemini/Others: Uses function calling format
All providers share the same semantic vector store for search

Set x-headroom-user-id header for per-user memory isolation (defaults to 'default').

Claude Code Subscription Users - Use MCP for CCR (Compress-Cache-Retrieve):

If you use Claude Code with a subscription (not API key), you need MCP to enable the headroom_retrieve tool:

# One-time setup
pip install "headroom-ai[mcp]"
headroom mcp install

# Every time you code
headroom proxy          # Terminal 1
claude                  # Terminal 2 - now has headroom_retrieve!

What this does:

Configures Claude Code to use Headroom's MCP server (~/.claude/mcp.json)
When the proxy compresses large tool outputs, Claude sees markers like [47 items compressed... hash=abc123]
Claude can call headroom_retrieve to get the full original content when needed

Check your setup:

headroom mcp status

Why MCP for subscriptions?

API users can inject custom tools directly via the Messages API
Subscription users use Claude Code's built-in tool set and can't inject tools programmatically
MCP (Model Context Protocol) is Claude's official way to extend tools - it works with subscriptions

The MCP server exposes headroom_retrieve so Claude can request uncompressed content when the compressed summary isn't enough.

Using AWS Bedrock, Google Vertex, or Azure? Route through Headroom:

# AWS Bedrock - Terminal 1: Start proxy
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"
headroom proxy --backend bedrock --region us-east-1

# AWS Bedrock - Terminal 2: Run Claude Code
export ANTHROPIC_API_KEY="sk-ant-dummy"  # Any value works! Headroom ignores it.
export ANTHROPIC_BASE_URL="http://localhost:8787"
# IMPORTANT: Do NOT set CLAUDE_CODE_USE_BEDROCK=1 (Headroom handles Bedrock routing)
claude

VS Code settings.json for Bedrock (click to expand)

{
  "claudeCode.environmentVariables": [
    { "name": "ANTHROPIC_API_KEY", "value": "sk-ant-dummy" },
    { "name": "ANTHROPIC_BASE_URL", "value": "http://localhost:8787" },
    { "name": "AWS_ACCESS_KEY_ID", "value": "AKIA..." },
    { "name": "AWS_SECRET_ACCESS_KEY", "value": "..." },
    { "name": "AWS_REGION", "value": "us-east-1" }
  ]
}

Do NOT include CLAUDE_CODE_USE_BEDROCK - Headroom handles the Bedrock routing.

Using OpenRouter? Access 400+ models through a single API:

# OpenRouter - Terminal 1: Start proxy
export OPENROUTER_API_KEY="sk-or-v1-..."
headroom proxy --backend openrouter

# OpenRouter - Terminal 2: Run your client
export ANTHROPIC_API_KEY="sk-ant-dummy"  # Any value works! Headroom ignores it.
export ANTHROPIC_BASE_URL="http://localhost:8787"
# Use OpenRouter model names in your requests:
# - anthropic/claude-3.5-sonnet
# - openai/gpt-4o
# - google/gemini-pro
# - meta-llama/llama-3-70b-instruct
# See all models: https://openrouter.ai/models

# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1

# Azure OpenAI
headroom proxy --backend azure --region eastus

Option 2: LangChain Integration

pip install "headroom-ai[langchain]"

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

# Wrap your model - that's it!
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

# Use exactly like before
response = llm.invoke("Hello!")

See the full LangChain Integration Guide for memory, retrievers, agents, and more.

Option 3: Agno Integration

pip install "headroom-ai[agno]"

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

# Wrap your model - that's it!
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)

# Use exactly like before
response = agent.run("Hello!")

# Check savings
print(f"Tokens saved: {model.total_tokens_saved}")

See the full Agno Integration Guide for hooks, multi-provider support, and more.

Framework Integrations

Framework	Integration	Docs
LangChain	`HeadroomChatModel`, memory, retrievers, agents	Guide
Agno	`HeadroomAgnoModel`, hooks, multi-provider	Guide
MCP	Claude Code subscription support via `headroom mcp install`	Guide
Any OpenAI Client	Proxy server	Guide

Features

Feature	Description	Docs
Image Compression	40-90% token reduction for images via trained ML router	Image Compression
Memory	Persistent memory across conversations (zero-latency inline extraction)	Memory
Universal Compression	ML-based content detection + structure-preserving compression	Compression
SmartCrusher	Compresses JSON tool outputs statistically	Transforms
CacheAligner	Stabilizes prefixes for provider caching	Transforms
IntelligentContext	Score-based context dropping with TOIN-learned importance	Transforms
CCR	Reversible compression with automatic retrieval	CCR Guide
MCP Server	Claude Code subscription support via `headroom mcp install`	MCP Guide
LangChain	Memory, retrievers, agents, streaming	LangChain
Agno	Agent framework integration with hooks	Agno
Text Utilities	Opt-in compression for search/logs	Text Compression
LLMLingua-2	ML-based 20x compression (opt-in)	LLMLingua
Code-Aware	AST-based code compression (tree-sitter)	Transforms
Evals Framework	Prove compression preserves accuracy (12+ datasets)	Evals

Evaluation Framework: Prove It Works

Skeptical? Good. We built a comprehensive evaluation framework to prove compression preserves accuracy.

# Install evals
pip install "headroom-ai[evals]"

# Quick sanity check (5 samples)
python -m headroom.evals quick

# Run on real datasets
python -m headroom.evals benchmark --dataset hotpotqa -n 100

How Evals Work

Original Context ───► LLM ───► Response A
                                   │
Compressed Context ─► LLM ───► Response B
                                   │
                    Compare A vs B │
                    ─────────────────
                    F1 Score: 0.95
                    Semantic Similarity: 0.97
                    Ground Truth Match: ✓
                    ─────────────────
                    PASS: Accuracy preserved

Available Datasets (12+)

Category	Datasets
RAG	HotpotQA, Natural Questions, TriviaQA, MS MARCO, SQuAD
Long Context	LongBench (4K-128K tokens), NarrativeQA
Tool Use	BFCL (function calling), ToolBench, Built-in samples
Code	CodeSearchNet, HumanEval

CI Integration

# GitHub Actions
- name: Run Compression Evals
  run: python -m headroom.evals quick -n 20
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Exit code 0 if accuracy ≥ 90%, 1 otherwise.

See the full Evals Documentation for datasets, metrics, and programmatic API.

Verified Performance

These numbers are from actual API calls, not estimates:

Scenario	Before	After	Savings	Verified
Code search (100 results)	17,765 tokens	1,408 tokens	92%	Claude Sonnet
SRE incident debugging	65,694 tokens	5,118 tokens	92%	GPT-4o
Codebase exploration	78,502 tokens	41,254 tokens	47%	GPT-4o
GitHub issue triage	54,174 tokens	14,761 tokens	73%	GPT-4o

Overhead: ~1-5ms compression latency

When savings are highest: Tool-heavy workloads (search, logs, database queries) When savings are lowest: Conversation-heavy workloads with minimal tool use

Providers

Provider	Token Counting	Cache Optimization
OpenAI	tiktoken (exact)	Automatic prefix caching
Anthropic	Official API	cache_control blocks
Google	Official API	Context caching
Cohere	Official API	-
Mistral	Official tokenizer	-

New models auto-supported via naming pattern detection.

Safety Guarantees

Never removes human content - user/assistant messages preserved
Never breaks tool ordering - tool calls and responses stay paired
Parse failures are no-ops - malformed content passes through unchanged
Compression is reversible - LLM retrieves original data via CCR

Installation

# Recommended: Install everything for best compression performance
pip install "headroom-ai[all]"

# Or install specific components
pip install headroom-ai              # SDK only
pip install "headroom-ai[proxy]"     # Proxy server
pip install "headroom-ai[mcp]"       # MCP server for Claude Code subscriptions
pip install "headroom-ai[langchain]" # LangChain integration
pip install "headroom-ai[agno]"      # Agno agent framework
pip install "headroom-ai[evals]"     # Evaluation framework
pip install "headroom-ai[code]"      # AST-based code compression
pip install "headroom-ai[llmlingua]" # ML-based compression

Requirements: Python 3.10+

First-time startup: Headroom downloads ML models (~500MB) on first run for optimal compression. This is cached locally and only happens once.

Documentation

Guide	Description
Memory Guide	Persistent memory for LLMs
Compression Guide	Universal compression with ML detection
Evals Framework	Prove compression preserves accuracy
LangChain Integration	Full LangChain support
Agno Integration	Full Agno agent framework support
SDK Guide	Fine-grained control
Proxy Guide	Production deployment
Configuration	All options
CCR Guide	Reversible compression
MCP Guide	Claude Code subscription support
Metrics	Monitoring
Troubleshooting	Common issues

Who's Using Headroom?

Add your project here! Open a PR or start a discussion.

Contributing

git clone https://github.com/chopratejas/headroom.git
cd headroom
pip install -e ".[dev]"
pytest

See CONTRIBUTING.md for details.

License

Apache License 2.0 - see LICENSE.

_{Built for the AI developer community}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.2

Apr 21, 2026

0.8.1

Apr 21, 2026

0.8.0

Apr 21, 2026

0.7.4

Apr 21, 2026

0.7.3

Apr 21, 2026

0.7.2

Apr 21, 2026

0.7.1

Apr 20, 2026

0.7.0

Apr 20, 2026

0.6.7

Apr 20, 2026

0.6.6

Apr 20, 2026

0.6.5

Apr 19, 2026

0.6.4

Apr 19, 2026

0.6.3

Apr 18, 2026

0.6.2

Apr 18, 2026

0.6.1

Apr 17, 2026

0.5.25

Apr 13, 2026

0.5.24

Apr 12, 2026

0.5.23

Apr 12, 2026

0.5.22

Apr 12, 2026

0.5.21

Apr 8, 2026

0.5.20

Apr 8, 2026

0.5.19

Apr 7, 2026

0.5.18

Apr 3, 2026

0.5.17

Mar 31, 2026

0.5.16

Mar 31, 2026

0.5.15

Mar 31, 2026

0.5.14

Mar 30, 2026

0.5.13

Mar 30, 2026

0.5.12

Mar 30, 2026

0.5.11

Mar 30, 2026

0.5.10

Mar 29, 2026

0.5.9

Mar 28, 2026

0.5.8

Mar 27, 2026

0.5.7

Mar 26, 2026

0.5.6

Mar 25, 2026

0.5.5

Mar 25, 2026

0.5.4

Mar 24, 2026

0.5.3

Mar 24, 2026

0.5.2

Mar 20, 2026

0.5.1

Mar 19, 2026

0.5.0

Mar 19, 2026

0.4.6

Mar 17, 2026

0.4.5

Mar 15, 2026

0.4.4

Mar 14, 2026

0.4.3

Mar 13, 2026

0.4.2

Mar 13, 2026

0.4.1

Mar 13, 2026

0.4.0

Mar 11, 2026

0.3.8

Mar 10, 2026

0.3.7

Feb 19, 2026

0.3.6

Feb 19, 2026

0.3.5

Feb 19, 2026

0.3.4

Feb 16, 2026

This version

0.3.3

Feb 11, 2026

0.3.2

Feb 11, 2026

0.3.1

Feb 2, 2026

0.3.0

Jan 31, 2026

0.2.15

Jan 21, 2026

0.2.14

Jan 20, 2026

0.2.13

Jan 19, 2026

0.2.12

Jan 18, 2026

0.2.10

Jan 17, 2026

0.2.9

Jan 17, 2026

0.2.8

Jan 16, 2026

0.2.7

Jan 16, 2026

0.2.6

Jan 16, 2026

0.2.5

Jan 16, 2026

0.2.4

Jan 15, 2026

0.2.2

Jan 14, 2026

0.2.1

Jan 10, 2026

0.2.0

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

headroom_ai-0.3.3.tar.gz (1.1 MB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

headroom_ai-0.3.3-py3-none-any.whl (791.4 kB view details)

Uploaded Feb 11, 2026 Python 3

File details

Details for the file headroom_ai-0.3.3.tar.gz.

File metadata

Download URL: headroom_ai-0.3.3.tar.gz
Upload date: Feb 11, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for headroom_ai-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`48f2ea76e2f01342becc248ecdff45b114386446b2281a40ab23a98091d7ad90`
MD5	`55e1b1cc0dfd9ada24a052f6d00041c0`
BLAKE2b-256	`5811208083e89847de0315ccb2d3d8a1a85de065b642f13b751f04100dd5d5c0`

See more details on using hashes here.

File details

Details for the file headroom_ai-0.3.3-py3-none-any.whl.

File metadata

Download URL: headroom_ai-0.3.3-py3-none-any.whl
Upload date: Feb 11, 2026
Size: 791.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for headroom_ai-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`646722018ce844c5b2301fe1c73b024aea82cfa158f276984dc79096e616c828`
MD5	`7a2e6389ac45f144bff14cf9feea29a1`
BLAKE2b-256	`c66ce5ba1674be9b56e64360666c4cba6f0424f102b60bb50f23abca051ef654`

See more details on using hashes here.

headroom-ai 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Headroom

Demo

Does It Actually Work? A Real Test

Accuracy Benchmarks

Multi-Tool Agent Test: Real Function Calling

How It Works

Inside Headroom

What actually happens

Why Headroom?

30-Second Quickstart

Option 1: Proxy (Zero Code Changes)

Option 2: LangChain Integration

Option 3: Agno Integration

Framework Integrations

Features

Evaluation Framework: Prove It Works

How Evals Work

Available Datasets (12+)

CI Integration

Verified Performance

Providers

Safety Guarantees

Installation

Documentation

Who's Using Headroom?

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes