Context Assembly Layer — intelligent, cache-aware context management for LLM applications
Project description
CAL — Context Assembly Layer
Intelligent, cache-aware context management for LLM applications. CAL dynamically selects, compresses, and assembles context chunks so your AI agent gets exactly the right information for each request — nothing more, nothing less — while preserving provider cache coherence.
Works with Anthropic, OpenAI, and Google Gemini out of the box.
The Problem
You have an AI agent with tools, documents, conversation history, and user preferences. Stuffing everything into every request wastes tokens, increases latency, and costs money. But naively reducing tokens can bust your provider's prefix cache — making the "optimized" version cost MORE than the unoptimized one. We learned this the hard way in production.
What CAL Does
Selector — Scores chunks by relevance using IDF-weighted triggers, summary matching, and conversation inheritance. Two-zone architecture: stable prefix (always cached) + dynamic chunks (deterministically ordered to maximize cache hits). Includes poison trigger suppression and require-any gates to prevent noise loading.
Chunker — Splits large documents into coherent pieces. Compresses intelligently when a chunk is relevant but too large.
Tool Stubs — Three-tier lazy tool loading with conversation history awareness. Provides lightweight stubs until the model signals intent to use a specific tool. Automatically preserves full schemas for tools already used in conversation history, preventing provider validation errors.
Cost Engine — Provider-aware savings calculator. Knows that Anthropic has 4 input tiers, OpenAI has automatic 90% cache discounts, and Google charges for cache storage. No more wrong math.
Telemetry — Logs token counts, cache hit rates, chunk overlap, and cost estimates per request. Trust production data, not benchmarks.
Quick Start
pip install cal-context
from cal import Selector, Chunker
selector = Selector(chunks_dir='./my_chunks', provider='anthropic')
chunker = Chunker(max_tokens=4096)
query = 'What is the status of Project Alpha?'
selected = selector.select(query, max_chunks=5)
compressed = chunker.process(selected)
# Context is assembled with cache-stable ordering
prompt = build_prompt(system=compressed, user=query)
response = call_llm(prompt)
Why Cache-Stable Ordering Matters
Every major LLM provider caches by prefix. If the first N tokens of your request match the previous request, you get cheap cache-read pricing. If you sort chunks by relevance score, the same chunk can land at different positions between requests, breaking the prefix match. In our production test, this made the "optimized" version cost more than no optimization at all.
CAL fixes this by using scores only for selection (which chunks to include) and using deterministic alphabetical ordering for position. Overlapping chunks between requests stay in identical positions, maximizing cache hits.
Two-Zone Architecture
from cal import Selector
from cal.cache_hints import get_hint_provider
selector = Selector(chunks_dir='./my_chunks')
hints = get_hint_provider('anthropic')
assembled = selector.assemble('What is Project Alpha status?')
system = hints.build_system_message(assembled['zone1'], assembled['zone2'])
# Zone 1: Mandatory chunks (identity, rules) — always cached
# Zone 2: Dynamic chunks — alphabetically ordered for prefix stability
| Zone | Content | Order | Cache Behavior |
|---|---|---|---|
| Zone 1 (Stable) | Mandatory chunks: identity, tools, rules | Fixed — never changes | Always cache hit (prefix match) |
| Zone 2 (Dynamic) | Selected chunks: project data, recent context | Alphabetical by chunk ID | Cache hit when overlapping selections share prefix |
| User Message | Current user query | Always last | Never cached (always unique) |
Noise Suppression (v1.2)
Real-world indexes have "poison triggers" — common words like dates, names, or generic terms that appear in dozens of unrelated chunks. Without suppression, these cause irrelevant chunks to load on nearly every request.
CAL v1.2 adds two defenses:
IDF Floor — Triggers appearing in 10+ chunks automatically get zero weight. Configurable via idf_floor_doc_freq.
# Default: triggers in 10+ chunks get zero weight
selector = Selector(chunks_dir='./chunks')
# Custom threshold
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=15)
# Disable entirely
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=0)
Require-Any Gates — Lock a chunk behind specific topic keywords. The chunk only loads when at least one gate keyword appears in the query.
{
"chunk_id": "brookes_agent_setup",
"triggers": ["agent", "setup", "server", "openclaw"],
"negative_triggers": {
"require_any": ["brooke"],
"penalty": [],
"hard_exclude": []
}
}
Without the gate, this chunk loads on any query mentioning "agent" or "setup". With require_any: ["brooke"], it only loads when Brooke is specifically mentioned.
In production, these two features moved us from 65% to 83% average token savings — a +18 percentage point improvement from suppressing 4 noise chunks that were loading on 78-96% of all requests.
History-Aware Tool Stubs (v1.3)
When an LLM conversation includes tool_use blocks (e.g. read({path: "..."})) in the message history, providers like Anthropic validate those historical calls against the current request's tool schemas. If you stub a tool that was already used, the schema mismatch causes a 400 error.
CAL v1.3 adds history-aware tool stub selection:
from cal.tool_stubs import ToolStubs
stubs = ToolStubs(my_tool_schemas)
# Pass conversation messages for history awareness
schemas, meta = stubs.select_tools(
"what about the second result?",
messages=conversation_history, # scans for tool_use blocks
)
# meta["history_protected"] shows which tools kept full schemas
# meta["tier"] shows which tier was selected (0, 1, or 2)
Three tiers:
| Tier | When | Behavior |
|---|---|---|
| 0 (No Tools) | Short conversational message, no history tools | Strip all tools — retry if model needs one |
| 1 (Shortlist) | No specific tool detected, but might need tools | Core tools as stubs only |
| 2 (Targeted) | Specific tool intent detected via triggers | Full schemas for detected tools, stubs for rest |
History-protected tools always keep full schemas regardless of tier.
Provider Support
| Provider | Cache Type | CAL Behavior |
|---|---|---|
| Anthropic | Prefix + cache_control hint | Emits cache_control: ephemeral on Zone 1 |
| OpenAI | Automatic prefix + prompt_cache_key | Sets stable prompt_cache_key per workspace |
| Google Gemini | Implicit (auto) or Explicit (manual) | Stable prefix for implicit; explicit cache API optional |
Cost Engine
from cal.cost import estimate_savings, google_cache_breakeven
# How much does CAL actually save?
savings = estimate_savings(
tokens_without_cal=23000,
tokens_with_cal=5500,
provider='anthropic',
model='opus',
cache_hit_rate=0.85,
requests_per_day=100,
)
print(savings['note'])
# "76% token reduction. Saves ~$100/month at 100 req/day (anthropic/opus, 85% cache hit rate)."
# Is Google explicit caching worth it?
breakeven = google_cache_breakeven(
cached_tokens=2000,
uncached_rate=3.50,
cache_read_rate=0.35,
)
print(f"Need {breakeven['breakeven_requests_per_hour']:.0f} req/hr to break even")
Telemetry
from cal.telemetry import Telemetry
telemetry = Telemetry(log_path='./cal_telemetry.jsonl', provider='anthropic', model='opus')
# Log each request
telemetry.record(
original_tokens=23000,
optimized_tokens=5500,
chunks_selected=['identity', 'project_alpha', 'tools'],
cached_tokens=4800, # from provider response headers
)
# Get aggregate stats
stats = telemetry.get_stats()
print(f"Avg reduction: {stats['avg_reduction_pct']}%")
print(f"Avg cache overlap: {stats['avg_overlap_pct']}%")
Production Benchmarks
Measured on Claude Opus 4, 103 chunks indexed, 250+ production requests:
| Metric | Without CAL | With CAL (v1.1) | With CAL (v1.2) | With CAL (v1.3) |
|---|---|---|---|---|
| Tokens per request | ~23,000 | ~7,800 | ~4,100 | ~4,100 |
| Chunks per request | 103 (all) | ~20 | ~6 | ~6 |
| Avg savings | — | 65% | 83% | 83% |
| Tool schema errors | N/A | Possible on multi-turn | Possible on multi-turn | 0 (history-aware) |
| Cost per request (cached) | $0.043 | $0.015 | $0.008 | $0.008 |
| Failsafe errors | N/A | 0 | 0 | 0 |
Important: The primary value is context quality, not cost. 4K relevant tokens produce better model responses than 23K tokens with noise. Cost savings are the bonus.
Configuration
| Variable | Default | Description |
|---|---|---|
| CAL_PROVIDER | anthropic | Provider: anthropic, openai, or google |
| CAL_CHUNKS_DIR | ./chunks | Path to your chunks directory |
| CAL_MAX_TOKENS | 100000 | Max token budget for assembled context |
| CAL_COMPRESSION_THRESHOLD | 0.8 | Compress chunks above this % of budget |
| CAL_MODEL | (provider default) | Model for compression if needed |
| CAL_TELEMETRY_ENABLED | true | Enable/disable request logging |
Contributing
PRs welcome. Open an issue first so we can discuss the approach. One feature or fix per PR.
License
MIT — do whatever you want with it. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cal_context-1.3.0.tar.gz.
File metadata
- Download URL: cal_context-1.3.0.tar.gz
- Upload date:
- Size: 36.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c03a9509c0b3c74cbfc3d3cff6a7ae6c124e5a938143129a5210d85c6e4b2da
|
|
| MD5 |
3015b2d65e89e8aeb58956c8473e07eb
|
|
| BLAKE2b-256 |
bdc1c2423ccef3e0d2ab556808c298edf6265dc21bdd96bad7613a480984907d
|
File details
Details for the file cal_context-1.3.0-py3-none-any.whl.
File metadata
- Download URL: cal_context-1.3.0-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3de15474a3fdda1b49152623b29abdb3dca091940df3fdd20524a10faefd218c
|
|
| MD5 |
b65b52f0bb047c915f5a06ee253ff4a4
|
|
| BLAKE2b-256 |
2d0cdebf904447c7ca36415f247b453a8d5557ba0f797fb9e6903ec5a88b667c
|