The Context Optimization Layer for LLM Applications - Cut costs by 50-90%

These details have not been verified by PyPI

Project links

Project description

Headroom

The Context Optimization Layer for LLM Applications

Cut your LLM costs by 50-90% without losing accuracy

Why Headroom?

AI coding agents and tool-using applications generate massive contexts:

Tool outputs with 1000s of search results, log entries, API responses
Long conversation histories that hit token limits
System prompts with dynamic dates that break provider caching

Result: You pay for tokens you don't need, and cache hits are rare.

Headroom is a smart compression layer that sits between your app and LLM providers:

Transform	What It Does	Savings
SmartCrusher	Compresses tool outputs statistically (keeps errors, anomalies, relevant items)	70-90%
CacheAligner	Stabilizes prefixes so provider caching works	Up to 10x
RollingWindow	Manages context within limits without breaking tool calls	Prevents failures

Zero accuracy loss - we keep what matters: errors, anomalies, relevant items.

5-Minute Quickstart

Option 1: Proxy Server (Recommended)

Works with any OpenAI-compatible client without code changes:

# Install
pip install "headroom[proxy]"

# Start the proxy
headroom proxy --port 8787

# Verify it's running
curl http://localhost:8787/health
# Expected: {"status": "healthy", ...}

Use with your tools:

# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Cursor / Continue / any OpenAI client
OPENAI_BASE_URL=http://localhost:8787/v1 your-app

# Python OpenAI SDK
export OPENAI_BASE_URL=http://localhost:8787/v1
python your_script.py

Option 2: Python SDK

Wrap your existing client for fine-grained control:

pip install headroom openai

from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

# Create wrapped client
client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",  # or "audit" to observe only
)

# Use exactly like the original client
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
)

print(response.choices[0].message.content)

# Check what happened
stats = client.get_stats()
print(f"Tokens saved this session: {stats['session']['tokens_saved_total']}")

With tool outputs (where real savings happen):

import json

# Conversation with large tool output
messages = [
    {"role": "user", "content": "Search for Python tutorials"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {"name": "search", "arguments": '{"q": "python"}'},
        }],
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": json.dumps({
            "results": [{"title": f"Tutorial {i}", "score": 100-i} for i in range(500)]
        }),
    },
    {"role": "user", "content": "What are the top 3?"},
]

# Headroom compresses 500 results to ~15, keeping highest-scoring items
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
print(f"Tokens saved: {client.get_stats()['session']['tokens_saved_total']}")
# Typical output: "Tokens saved: 3500"

Option 3: LangChain Integration (Coming Soon)

# Coming soon - use proxy server for now
# OPENAI_BASE_URL=http://localhost:8787/v1 python your_langchain_app.py

Verify It's Working

Check Proxy Stats

curl http://localhost:8787/stats

{
  "requests": {"total": 42, "cached": 5, "rate_limited": 0, "failed": 0},
  "tokens": {"input": 50000, "output": 8000, "saved": 12500, "savings_percent": 25.0},
  "cost": {"total_cost_usd": 0.15, "total_savings_usd": 0.04},
  "cache": {"entries": 10, "total_hits": 5}
}

Check SDK Stats

# Quick session stats (no database query)
stats = client.get_stats()
print(stats)
# {
#   "session": {"requests_total": 10, "tokens_saved_total": 5000, ...},
#   "config": {"mode": "optimize", "provider": "openai", ...},
#   "transforms": {"smart_crusher_enabled": True, ...}
# }

# Validate setup is correct
result = client.validate_setup()
if not result["valid"]:
    print("Setup issues:", result)

Enable Logging

import logging
logging.basicConfig(level=logging.INFO)

# Now you'll see:
# INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
# INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items

Installation

# Core only (minimal dependencies: tiktoken, pydantic)
pip install headroom

# With semantic relevance scoring (adds sentence-transformers)
pip install "headroom[relevance]"

# With proxy server (adds fastapi, uvicorn)
pip install "headroom[proxy]"

# With HTML reports (adds jinja2)
pip install "headroom[reports]"

# Everything
pip install "headroom[all]"

Requirements: Python 3.10+

Configuration

SDK Configuration

from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

# Full configuration example
client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",              # "audit" (observe only) or "optimize" (apply transforms)
    enable_cache_optimizer=True,          # Enable provider-specific cache optimization
    enable_semantic_cache=False,          # Enable query-level semantic caching
    model_context_limits={                # Override default context limits
        "gpt-4o": 128000,
        "gpt-4o-mini": 128000,
    },
    # store_url defaults to temp directory; override with absolute path if needed:
    # store_url="sqlite:////absolute/path/to/headroom.db",
)

Proxy Configuration

# Via command line
headroom proxy \
  --port 8787 \
  --budget 10.00 \
  --log-file headroom.jsonl

# Disable optimization (passthrough mode)
headroom proxy --no-optimize

# Disable semantic caching
headroom proxy --no-cache

# See all options
headroom proxy --help

Per-Request Overrides

# Override mode for specific requests
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    headroom_mode="audit",              # Just observe, don't optimize
    headroom_output_buffer_tokens=8000, # Reserve more for output
    headroom_keep_turns=5,              # Keep last 5 turns
)

Modes

Mode	Behavior	Use Case
`audit`	Observes and logs, no modifications	Production monitoring, baseline measurement
`optimize`	Applies safe, deterministic transforms	Production optimization
`simulate`	Returns plan without API call	Testing, cost estimation

# Simulate to see what would happen
plan = client.chat.completions.simulate(
    model="gpt-4o",
    messages=large_conversation,
)
print(f"Would save {plan.tokens_saved} tokens")
print(f"Transforms: {plan.transforms}")
print(f"Estimated savings: {plan.estimated_savings}")

Error Handling

Headroom provides explicit exceptions for debugging:

from headroom import (
    HeadroomClient,
    HeadroomError,        # Base class - catch all Headroom errors
    ConfigurationError,   # Invalid configuration
    ProviderError,        # Provider issues (unknown model, etc.)
    StorageError,         # Database/storage failures
    CompressionError,     # Compression failures (rare - we fail safe)
    ValidationError,      # Setup validation failures
)

try:
    client = HeadroomClient(...)
    response = client.chat.completions.create(...)
except ConfigurationError as e:
    print(f"Config issue: {e}")
    print(f"Details: {e.details}")  # Additional context
except StorageError as e:
    print(f"Storage issue: {e}")
    # Headroom continues to work, just without metrics persistence
except HeadroomError as e:
    print(f"Headroom error: {e}")

Safety guarantee: If compression fails, the original content passes through unchanged. Your LLM calls never fail due to Headroom.

How It Works

SmartCrusher: Statistical Compression

# Before: 50KB tool response with 1000 items
{"results": [{"id": 1, "status": "ok", ...}, ... 1000 items ...]}

# After: ~2KB with important items preserved
# Headroom keeps:
# - First 3 items (context)
# - Last 2 items (recency)
# - All error items (status != "ok")
# - Statistical anomalies (values > 2 std dev from mean)
# - Items matching user's query (BM25/embedding similarity)

CacheAligner: Prefix Stabilization

# Before: Cache miss every day due to changing date
"You are helpful. Today is January 7, 2025."

# After: Stable prefix (cache hit!) + dynamic context moved to end
"You are helpful."
# Dynamic content: "Current date: January 7, 2025"

RollingWindow: Context Management

# When context exceeds limit:
# 1. Drop oldest tool outputs first (as atomic units with their calls)
# 2. Drop oldest conversation turns
# 3. NEVER drop: system prompt, last N turns, orphaned tool responses

Metrics & Monitoring

Prometheus Metrics (Proxy)

curl http://localhost:8787/metrics

# HELP headroom_requests_total Total requests processed
headroom_requests_total{mode="optimize"} 1234

# HELP headroom_tokens_saved_total Total tokens saved
headroom_tokens_saved_total 5678900

# HELP headroom_compression_ratio Compression ratio histogram
headroom_compression_ratio_bucket{le="0.5"} 890

Query Stored Metrics (SDK)

from datetime import datetime, timedelta

# Get recent metrics
metrics = client.get_metrics(
    start_time=datetime.utcnow() - timedelta(hours=1),
    limit=100,
)

for m in metrics:
    print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")

# Get summary statistics
summary = client.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Total tokens saved: {summary['total_tokens_saved']}")

Troubleshooting

"Proxy won't start"

# Check if port is in use
lsof -i :8787

# Try a different port
headroom proxy --port 8788

# Check logs
headroom proxy --log-level debug

"No token savings"

# 1. Verify mode is "optimize"
stats = client.get_stats()
print(stats["config"]["mode"])  # Should be "optimize"

# 2. Check if transforms are enabled
print(stats["transforms"])  # smart_crusher_enabled should be True

# 3. Enable logging to see what's happening
import logging
logging.basicConfig(level=logging.DEBUG)

# 4. Use simulate to see what WOULD happen
plan = client.chat.completions.simulate(model="gpt-4o", messages=msgs)
print(f"Transforms that would apply: {plan.transforms}")

"High latency"

# Headroom adds ~1-5ms overhead. If you see more:

# 1. Check if embedding scorer is enabled (slower but better relevance)
# Switch to BM25 for faster scoring:
config.smart_crusher.relevance.tier = "bm25"

# 2. Disable transforms you don't need
config.cache_aligner.enabled = False  # If you don't need cache alignment

# 3. Increase min_tokens_to_crush to skip small payloads
config.smart_crusher.min_tokens_to_crush = 500

"Compression too aggressive"

# Keep more items
config.smart_crusher.max_items_after_crush = 50  # Default is 15

# Or disable compression for specific tools
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    headroom_tool_profiles={
        "important_tool": {"skip_compression": True}
    }
)

Supported Providers

Provider	Token Counting	Cache Optimization	Status
OpenAI	tiktoken (exact)	Automatic prefix caching	Full
Anthropic	Official API	cache_control blocks	Full
Google	Official API	Context caching	Full
Cohere	Official API	-	Full
Mistral	Official tokenizer	-	Full
LiteLLM	Via underlying provider	-	Full

Safety Guarantees

Headroom follows strict safety rules:

Never removes human content - User/assistant messages are never compressed
Never breaks tool ordering - Tool calls and responses stay paired as atomic units
Parse failures are no-ops - Malformed content passes through unchanged
Preserves recency - Last N turns are always kept
Errors surface, don't hide - Explicit exceptions with context

Performance

Scenario	Before	After	Savings	Overhead
Search results (1000 items)	45,000 tokens	4,500 tokens	90%	~2ms
Log analysis (500 entries)	22,000 tokens	3,300 tokens	85%	~1ms
API response (nested JSON)	15,000 tokens	2,250 tokens	85%	~1ms
Long conversation (50 turns)	80,000 tokens	32,000 tokens	60%	~3ms

Documentation

Quickstart Guide - Complete working examples
Proxy Documentation - Production deployment
Transform Reference - How each transform works
API Reference - Complete API documentation
Troubleshooting - Common issues and solutions
Architecture - How Headroom works internally

Examples

See the examples/ directory for complete, runnable examples:

basic_usage.py - Simple SDK usage
proxy_integration.py - Using the proxy with different clients
custom_compression.py - Advanced compression configuration
metrics_dashboard.py - Building a metrics dashboard

Contributing

We welcome contributions!

# Development setup
git clone https://github.com/headroom-sdk/headroom.git
cd headroom
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .
mypy headroom

See CONTRIBUTING.md for details.

License

Apache License 2.0 - see LICENSE for details.

_{Built for the AI developer community}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.2

Apr 21, 2026

0.8.1

Apr 21, 2026

0.8.0

Apr 21, 2026

0.7.4

Apr 21, 2026

0.7.3

Apr 21, 2026

0.7.2

Apr 21, 2026

0.7.1

Apr 20, 2026

0.7.0

Apr 20, 2026

0.6.7

Apr 20, 2026

0.6.6

Apr 20, 2026

0.6.5

Apr 19, 2026

0.6.4

Apr 19, 2026

0.6.3

Apr 18, 2026

0.6.2

Apr 18, 2026

0.6.1

Apr 17, 2026

0.5.25

Apr 13, 2026

0.5.24

Apr 12, 2026

0.5.23

Apr 12, 2026

0.5.22

Apr 12, 2026

0.5.21

Apr 8, 2026

0.5.20

Apr 8, 2026

0.5.19

Apr 7, 2026

0.5.18

Apr 3, 2026

0.5.17

Mar 31, 2026

0.5.16

Mar 31, 2026

0.5.15

Mar 31, 2026

0.5.14

Mar 30, 2026

0.5.13

Mar 30, 2026

0.5.12

Mar 30, 2026

0.5.11

Mar 30, 2026

0.5.10

Mar 29, 2026

0.5.9

Mar 28, 2026

0.5.8

Mar 27, 2026

0.5.7

Mar 26, 2026

0.5.6

Mar 25, 2026

0.5.5

Mar 25, 2026

0.5.4

Mar 24, 2026

0.5.3

Mar 24, 2026

0.5.2

Mar 20, 2026

0.5.1

Mar 19, 2026

0.5.0

Mar 19, 2026

0.4.6

Mar 17, 2026

0.4.5

Mar 15, 2026

0.4.4

Mar 14, 2026

0.4.3

Mar 13, 2026

0.4.2

Mar 13, 2026

0.4.1

Mar 13, 2026

0.4.0

Mar 11, 2026

0.3.8

Mar 10, 2026

0.3.7

Feb 19, 2026

0.3.6

Feb 19, 2026

0.3.5

Feb 19, 2026

0.3.4

Feb 16, 2026

0.3.3

Feb 11, 2026

0.3.2

Feb 11, 2026

0.3.1

Feb 2, 2026

0.3.0

Jan 31, 2026

0.2.15

Jan 21, 2026

0.2.14

Jan 20, 2026

0.2.13

Jan 19, 2026

0.2.12

Jan 18, 2026

0.2.10

Jan 17, 2026

0.2.9

Jan 17, 2026

0.2.8

Jan 16, 2026

0.2.7

Jan 16, 2026

0.2.6

Jan 16, 2026

0.2.5

Jan 16, 2026

0.2.4

Jan 15, 2026

0.2.2

Jan 14, 2026

0.2.1

Jan 10, 2026

This version

0.2.0

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

headroom_ai-0.2.0.tar.gz (316.7 kB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

headroom_ai-0.2.0-py3-none-any.whl (248.8 kB view details)

Uploaded Jan 10, 2026 Python 3

File details

Details for the file headroom_ai-0.2.0.tar.gz.

File metadata

Download URL: headroom_ai-0.2.0.tar.gz
Upload date: Jan 10, 2026
Size: 316.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for headroom_ai-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`eca1d312b85f86744701274eec8ef8cf449f231c462d2d2ba0e9a2e9697adf25`
MD5	`27b7ee94945dbc02f575b3783da3079b`
BLAKE2b-256	`edf9a965097f937094e7b2ff7a8c26100de348b3ff27067b86571f426b3878dc`

See more details on using hashes here.

File details

Details for the file headroom_ai-0.2.0-py3-none-any.whl.

File metadata

Download URL: headroom_ai-0.2.0-py3-none-any.whl
Upload date: Jan 10, 2026
Size: 248.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for headroom_ai-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0d5901d40705cd79bf460b5d33455e7b15817cee22446e2eed1f1b4d4bc377d`
MD5	`d09700c2de7a34b3d505f9c5bbe420bf`
BLAKE2b-256	`ac13c6017cb61e1cf54654aa339ef2f056c7b2dfd4c8e0c5bea8bceb264327bf`

See more details on using hashes here.

headroom-ai 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Headroom

Why Headroom?

5-Minute Quickstart

Option 1: Proxy Server (Recommended)

Option 2: Python SDK

Option 3: LangChain Integration (Coming Soon)

Verify It's Working

Check Proxy Stats

Check SDK Stats

Enable Logging

Installation

Configuration

SDK Configuration

Proxy Configuration

Per-Request Overrides

Modes

Error Handling

How It Works

SmartCrusher: Statistical Compression

CacheAligner: Prefix Stabilization

RollingWindow: Context Management

Metrics & Monitoring

Prometheus Metrics (Proxy)

Query Stored Metrics (SDK)

Troubleshooting

"Proxy won't start"

"No token savings"

"High latency"

"Compression too aggressive"

Supported Providers

Safety Guarantees

Performance

Documentation

Examples

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes