Skip to main content

Context Assembly Layer — intelligent, cache-aware context management for LLM applications

Project description

CAL — Context Assembly Layer

Intelligent, cache-aware context management for LLM applications. CAL dynamically selects, compresses, and assembles context chunks so your AI agent gets exactly the right information for each request — nothing more, nothing less — while preserving provider cache coherence.

Works with Anthropic, OpenAI, and Google Gemini out of the box.

The Problem

You have an AI agent with tools, documents, conversation history, and user preferences. Stuffing everything into every request wastes tokens, increases latency, and costs money. But naively reducing tokens can bust your provider's prefix cache — making the "optimized" version cost MORE than the unoptimized one. We learned this the hard way in production.

What CAL Does

Selector — Scores chunks by relevance using IDF-weighted triggers, summary matching, and conversation inheritance. Two-zone architecture: stable prefix (always cached) + dynamic chunks (deterministically ordered to maximize cache hits). Includes poison trigger suppression and require-any gates to prevent noise loading.

Chunker — Splits large documents into coherent pieces. Compresses intelligently when a chunk is relevant but too large.

Tool Stubs — Three-tier lazy tool loading with conversation history awareness. Provides lightweight stubs until the model signals intent to use a specific tool. Automatically preserves full schemas for tools already used in conversation history, preventing provider validation errors.

Cost Engine — Provider-aware savings calculator. Knows that Anthropic has 4 input tiers, OpenAI has automatic 90% cache discounts, and Google charges for cache storage. No more wrong math.

Telemetry — Logs token counts, cache hit rates, chunk overlap, and cost estimates per request. Trust production data, not benchmarks.

Quick Start

pip install cal-context

from cal import Selector, Chunker

selector = Selector(chunks_dir='./my_chunks', provider='anthropic')
chunker = Chunker(max_tokens=4096)

query = 'What is the status of Project Alpha?'
selected = selector.select(query, max_chunks=5)
compressed = chunker.process(selected)

# Context is assembled with cache-stable ordering
prompt = build_prompt(system=compressed, user=query)
response = call_llm(prompt)

Why Cache-Stable Ordering Matters

Every major LLM provider caches by prefix. If the first N tokens of your request match the previous request, you get cheap cache-read pricing. If you sort chunks by relevance score, the same chunk can land at different positions between requests, breaking the prefix match. In our production test, this made the "optimized" version cost more than no optimization at all.

CAL fixes this by using scores only for selection (which chunks to include) and using deterministic alphabetical ordering for position. Overlapping chunks between requests stay in identical positions, maximizing cache hits.

Two-Zone Architecture

from cal import Selector
from cal.cache_hints import get_hint_provider

selector = Selector(chunks_dir='./my_chunks')
hints = get_hint_provider('anthropic')

assembled = selector.assemble('What is Project Alpha status?')
system = hints.build_system_message(assembled['zone1'], assembled['zone2'])

# Zone 1: Mandatory chunks (identity, rules) — always cached
# Zone 2: Dynamic chunks — alphabetically ordered for prefix stability
Zone Content Order Cache Behavior
Zone 1 (Stable) Mandatory chunks: identity, tools, rules Fixed — never changes Always cache hit (prefix match)
Zone 2 (Dynamic) Selected chunks: project data, recent context Alphabetical by chunk ID Cache hit when overlapping selections share prefix
User Message Current user query Always last Never cached (always unique)

Noise Suppression (v1.2)

Real-world indexes have "poison triggers" — common words like dates, names, or generic terms that appear in dozens of unrelated chunks. Without suppression, these cause irrelevant chunks to load on nearly every request.

CAL v1.2 adds two defenses:

IDF Floor — Triggers appearing in 10+ chunks automatically get zero weight. Configurable via idf_floor_doc_freq.

# Default: triggers in 10+ chunks get zero weight
selector = Selector(chunks_dir='./chunks')

# Custom threshold
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=15)

# Disable entirely
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=0)

Require-Any Gates — Lock a chunk behind specific topic keywords. The chunk only loads when at least one gate keyword appears in the query.

{
  "chunk_id": "brookes_agent_setup",
  "triggers": ["agent", "setup", "server", "openclaw"],
  "negative_triggers": {
    "require_any": ["brooke"],
    "penalty": [],
    "hard_exclude": []
  }
}

Without the gate, this chunk loads on any query mentioning "agent" or "setup". With require_any: ["brooke"], it only loads when Brooke is specifically mentioned.

In production, these two features moved us from 65% to 83% average token savings — a +18 percentage point improvement from suppressing 4 noise chunks that were loading on 78-96% of all requests.

History-Aware Tool Stubs (v1.3)

When an LLM conversation includes tool_use blocks (e.g. read({path: "..."})) in the message history, providers like Anthropic validate those historical calls against the current request's tool schemas. If you stub a tool that was already used, the schema mismatch causes a 400 error.

CAL v1.3 adds history-aware tool stub selection:

from cal.tool_stubs import ToolStubs

stubs = ToolStubs(my_tool_schemas)

# Pass conversation messages for history awareness
schemas, meta = stubs.select_tools(
    "what about the second result?",
    messages=conversation_history,  # scans for tool_use blocks
)

# meta["history_protected"] shows which tools kept full schemas
# meta["tier"] shows which tier was selected (0, 1, or 2)

Three tiers:

Tier When Behavior
0 (No Tools) Short conversational message, no history tools Strip all tools — retry if model needs one
1 (Shortlist) No specific tool detected, but might need tools Core tools as stubs only
2 (Targeted) Specific tool intent detected via triggers Full schemas for detected tools, stubs for rest

History-protected tools always keep full schemas regardless of tier.

Provider Support

Provider Cache Type CAL Behavior
Anthropic Prefix + cache_control hint Emits cache_control: ephemeral on Zone 1
OpenAI Automatic prefix + prompt_cache_key Sets stable prompt_cache_key per workspace
Google Gemini Implicit (auto) or Explicit (manual) Stable prefix for implicit; explicit cache API optional

Cost Engine

from cal.cost import estimate_savings, google_cache_breakeven

# How much does CAL actually save?
savings = estimate_savings(
    tokens_without_cal=23000,
    tokens_with_cal=5500,
    provider='anthropic',
    model='opus',
    cache_hit_rate=0.85,
    requests_per_day=100,
)
print(savings['note'])
# "76% token reduction. Saves ~$100/month at 100 req/day (anthropic/opus, 85% cache hit rate)."

# Is Google explicit caching worth it?
breakeven = google_cache_breakeven(
    cached_tokens=2000,
    uncached_rate=3.50,
    cache_read_rate=0.35,
)
print(f"Need {breakeven['breakeven_requests_per_hour']:.0f} req/hr to break even")

Telemetry

from cal.telemetry import Telemetry

telemetry = Telemetry(log_path='./cal_telemetry.jsonl', provider='anthropic', model='opus')

# Log each request
telemetry.record(
    original_tokens=23000,
    optimized_tokens=5500,
    chunks_selected=['identity', 'project_alpha', 'tools'],
    cached_tokens=4800,  # from provider response headers
)

# Get aggregate stats
stats = telemetry.get_stats()
print(f"Avg reduction: {stats['avg_reduction_pct']}%")
print(f"Avg cache overlap: {stats['avg_overlap_pct']}%")

Production Benchmarks

Measured on Claude Opus 4, 103 chunks indexed, 250+ production requests:

Metric Without CAL With CAL (v1.1) With CAL (v1.2) With CAL (v1.3)
Tokens per request ~23,000 ~7,800 ~4,100 ~4,100
Chunks per request 103 (all) ~20 ~6 ~6
Avg savings 65% 83% 83%
Tool schema errors N/A Possible on multi-turn Possible on multi-turn 0 (history-aware)
Cost per request (cached) $0.043 $0.015 $0.008 $0.008
Failsafe errors N/A 0 0 0

Important: The primary value is context quality, not cost. 4K relevant tokens produce better model responses than 23K tokens with noise. Cost savings are the bonus.

Configuration

Variable Default Description
CAL_PROVIDER anthropic Provider: anthropic, openai, or google
CAL_CHUNKS_DIR ./chunks Path to your chunks directory
CAL_MAX_TOKENS 100000 Max token budget for assembled context
CAL_COMPRESSION_THRESHOLD 0.8 Compress chunks above this % of budget
CAL_MODEL (provider default) Model for compression if needed
CAL_TELEMETRY_ENABLED true Enable/disable request logging

Contributing

PRs welcome. Open an issue first so we can discuss the approach. One feature or fix per PR.

License

MIT — do whatever you want with it. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cal_context-1.3.0.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cal_context-1.3.0-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file cal_context-1.3.0.tar.gz.

File metadata

  • Download URL: cal_context-1.3.0.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cal_context-1.3.0.tar.gz
Algorithm Hash digest
SHA256 5c03a9509c0b3c74cbfc3d3cff6a7ae6c124e5a938143129a5210d85c6e4b2da
MD5 3015b2d65e89e8aeb58956c8473e07eb
BLAKE2b-256 bdc1c2423ccef3e0d2ab556808c298edf6265dc21bdd96bad7613a480984907d

See more details on using hashes here.

File details

Details for the file cal_context-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: cal_context-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cal_context-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3de15474a3fdda1b49152623b29abdb3dca091940df3fdd20524a10faefd218c
MD5 b65b52f0bb047c915f5a06ee253ff4a4
BLAKE2b-256 2d0cdebf904447c7ca36415f247b453a8d5557ba0f797fb9e6903ec5a88b667c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page