Context Assembly Layer — intelligent, cache-aware context management for LLM applications

These details have not been verified by PyPI

Project links

Project description

CAL — Context Assembly Layer

Intelligent, cache-aware context management for LLM applications. CAL dynamically selects, compresses, and assembles context chunks so your AI agent gets exactly the right information for each request — nothing more, nothing less — while preserving provider cache coherence.

Works with Anthropic, OpenAI, and Google Gemini out of the box.

The Problem

You have an AI agent with tools, documents, conversation history, and user preferences. Stuffing everything into every request wastes tokens, increases latency, and costs money. But naively reducing tokens can bust your provider's prefix cache — making the "optimized" version cost MORE than the unoptimized one. We learned this the hard way in production.

What CAL Does

Selector — Scores chunks by relevance using IDF-weighted triggers, summary matching, and conversation inheritance. Two-zone architecture: stable prefix (always cached) + dynamic chunks (deterministically ordered to maximize cache hits). Includes poison trigger suppression and require-any gates to prevent noise loading.

Chunker — Splits large documents into coherent pieces. Compresses intelligently when a chunk is relevant but too large.

Tool Stubs — Three-tier lazy tool loading with conversation history awareness. Provides lightweight stubs until the model signals intent to use a specific tool. Automatically preserves full schemas for tools already used in conversation history, preventing provider validation errors.

Cost Engine — Provider-aware savings calculator. Knows that Anthropic has 4 input tiers, OpenAI has automatic 90% cache discounts, and Google charges for cache storage. No more wrong math.

Telemetry — Logs token counts, cache hit rates, chunk overlap, and cost estimates per request. Trust production data, not benchmarks.

Quick Start

pip install cal-context

from cal import Selector, Chunker

selector = Selector(chunks_dir='./my_chunks', provider='anthropic')
chunker = Chunker(max_tokens=4096)

query = 'What is the status of Project Alpha?'
selected = selector.select(query, max_chunks=5)
compressed = chunker.process(selected)

# Context is assembled with cache-stable ordering
prompt = build_prompt(system=compressed, user=query)
response = call_llm(prompt)

Why Cache-Stable Ordering Matters

Every major LLM provider caches by prefix. If the first N tokens of your request match the previous request, you get cheap cache-read pricing. If you sort chunks by relevance score, the same chunk can land at different positions between requests, breaking the prefix match. In our production test, this made the "optimized" version cost more than no optimization at all.

CAL fixes this by using scores only for selection (which chunks to include) and using deterministic alphabetical ordering for position. Overlapping chunks between requests stay in identical positions, maximizing cache hits.

Two-Zone Architecture

from cal import Selector
from cal.cache_hints import get_hint_provider

selector = Selector(chunks_dir='./my_chunks')
hints = get_hint_provider('anthropic')

assembled = selector.assemble('What is Project Alpha status?')
system = hints.build_system_message(assembled['zone1'], assembled['zone2'])

# Zone 1: Mandatory chunks (identity, rules) — always cached
# Zone 2: Dynamic chunks — alphabetically ordered for prefix stability

Zone	Content	Order	Cache Behavior
Zone 1 (Stable)	Mandatory chunks: identity, tools, rules	Fixed — never changes	Always cache hit (prefix match)
Zone 2 (Dynamic)	Selected chunks: project data, recent context	Alphabetical by chunk ID	Cache hit when overlapping selections share prefix
User Message	Current user query	Always last	Never cached (always unique)

Noise Suppression (v1.2)

Real-world indexes have "poison triggers" — common words like dates, names, or generic terms that appear in dozens of unrelated chunks. Without suppression, these cause irrelevant chunks to load on nearly every request.

CAL v1.2 adds two defenses:

IDF Floor — Triggers appearing in 10+ chunks automatically get zero weight. Configurable via idf_floor_doc_freq.

# Default: triggers in 10+ chunks get zero weight
selector = Selector(chunks_dir='./chunks')

# Custom threshold
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=15)

# Disable entirely
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=0)

Require-Any Gates — Lock a chunk behind specific topic keywords. The chunk only loads when at least one gate keyword appears in the query.

{
  "chunk_id": "brookes_agent_setup",
  "triggers": ["agent", "setup", "server", "openclaw"],
  "negative_triggers": {
    "require_any": ["brooke"],
    "penalty": [],
    "hard_exclude": []
  }
}

Without the gate, this chunk loads on any query mentioning "agent" or "setup". With require_any: ["brooke"], it only loads when Brooke is specifically mentioned.

In production, these two features moved us from 65% to 83% average token savings — a +18 percentage point improvement from suppressing 4 noise chunks that were loading on 78-96% of all requests.

History-Aware Tool Stubs (v1.3)

When an LLM conversation includes tool_use blocks (e.g. read({path: "..."})) in the message history, providers like Anthropic validate those historical calls against the current request's tool schemas. If you stub a tool that was already used, the schema mismatch causes a 400 error.

CAL v1.3 adds history-aware tool stub selection:

from cal.tool_stubs import ToolStubs

stubs = ToolStubs(my_tool_schemas)

# Pass conversation messages for history awareness
schemas, meta = stubs.select_tools(
    "what about the second result?",
    messages=conversation_history,  # scans for tool_use blocks
)

# meta["history_protected"] shows which tools kept full schemas
# meta["tier"] shows which tier was selected (0, 1, or 2)

Three tiers:

Tier	When	Behavior
0 (No Tools)	Short conversational message, no history tools	Strip all tools — retry if model needs one
1 (Shortlist)	No specific tool detected, but might need tools	Core tools as stubs only
2 (Targeted)	Specific tool intent detected via triggers	Full schemas for detected tools, stubs for rest

History-protected tools always keep full schemas regardless of tier.

Provider Support

Provider	Cache Type	CAL Behavior
Anthropic	Prefix + cache_control hint	Emits `cache_control: ephemeral` on Zone 1
OpenAI	Automatic prefix + prompt_cache_key	Sets stable `prompt_cache_key` per workspace
Google Gemini	Implicit (auto) or Explicit (manual)	Stable prefix for implicit; explicit cache API optional

Cost Engine

from cal.cost import estimate_savings, google_cache_breakeven

# How much does CAL actually save?
savings = estimate_savings(
    tokens_without_cal=23000,
    tokens_with_cal=5500,
    provider='anthropic',
    model='opus',
    cache_hit_rate=0.85,
    requests_per_day=100,
)
print(savings['note'])
# "76% token reduction. Saves ~$100/month at 100 req/day (anthropic/opus, 85% cache hit rate)."

# Is Google explicit caching worth it?
breakeven = google_cache_breakeven(
    cached_tokens=2000,
    uncached_rate=3.50,
    cache_read_rate=0.35,
)
print(f"Need {breakeven['breakeven_requests_per_hour']:.0f} req/hr to break even")

Telemetry

from cal.telemetry import Telemetry

telemetry = Telemetry(log_path='./cal_telemetry.jsonl', provider='anthropic', model='opus')

# Log each request
telemetry.record(
    original_tokens=23000,
    optimized_tokens=5500,
    chunks_selected=['identity', 'project_alpha', 'tools'],
    cached_tokens=4800,  # from provider response headers
)

# Get aggregate stats
stats = telemetry.get_stats()
print(f"Avg reduction: {stats['avg_reduction_pct']}%")
print(f"Avg cache overlap: {stats['avg_overlap_pct']}%")

Production Benchmarks

Measured on Claude Opus 4, 103 chunks indexed, 250+ production requests:

Metric	Without CAL	With CAL (v1.1)	With CAL (v1.2)	With CAL (v1.3)
Tokens per request	~23,000	~7,800	~4,100	~4,100
Chunks per request	103 (all)	~20	~6	~6
Avg savings	—	65%	83%	83%
Tool schema errors	N/A	Possible on multi-turn	Possible on multi-turn	0 (history-aware)
Cost per request (cached)	$0.043	$0.015	$0.008	$0.008
Failsafe errors	N/A	0	0	0

Important: The primary value is context quality, not cost. 4K relevant tokens produce better model responses than 23K tokens with noise. Cost savings are the bonus.

Configuration

Variable	Default	Description
CAL_PROVIDER	anthropic	Provider: anthropic, openai, or google
CAL_CHUNKS_DIR	./chunks	Path to your chunks directory
CAL_MAX_TOKENS	100000	Max token budget for assembled context
CAL_COMPRESSION_THRESHOLD	0.8	Compress chunks above this % of budget
CAL_MODEL	(provider default)	Model for compression if needed
CAL_TELEMETRY_ENABLED	true	Enable/disable request logging

Contributing

PRs welcome. Open an issue first so we can discuss the approach. One feature or fix per PR.

License

MIT — do whatever you want with it. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.3.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cal_context-1.3.0.tar.gz (36.6 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cal_context-1.3.0-py3-none-any.whl (28.1 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file cal_context-1.3.0.tar.gz.

File metadata

Download URL: cal_context-1.3.0.tar.gz
Upload date: Apr 5, 2026
Size: 36.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cal_context-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5c03a9509c0b3c74cbfc3d3cff6a7ae6c124e5a938143129a5210d85c6e4b2da`
MD5	`3015b2d65e89e8aeb58956c8473e07eb`
BLAKE2b-256	`bdc1c2423ccef3e0d2ab556808c298edf6265dc21bdd96bad7613a480984907d`

See more details on using hashes here.

File details

Details for the file cal_context-1.3.0-py3-none-any.whl.

File metadata

Download URL: cal_context-1.3.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cal_context-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3de15474a3fdda1b49152623b29abdb3dca091940df3fdd20524a10faefd218c`
MD5	`b65b52f0bb047c915f5a06ee253ff4a4`
BLAKE2b-256	`2d0cdebf904447c7ca36415f247b453a8d5557ba0f797fb9e6903ec5a88b667c`

See more details on using hashes here.

cal-context 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CAL — Context Assembly Layer

The Problem

What CAL Does

Quick Start

Why Cache-Stable Ordering Matters

Two-Zone Architecture

Noise Suppression (v1.2)

History-Aware Tool Stubs (v1.3)

Provider Support

Cost Engine

Telemetry

Production Benchmarks

Configuration

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes