Token-waste observability for AI coding agents — measure orientation tax, context bloat, and cache engagement from real transcripts
Project description
cram-ai
cram prevents expensive exploratory file reads by pre-loading focused project, task, symbol, decision, and gotcha context before coding starts. Instead of spending the first few exchanges re-discovering your codebase, your AI tool arrives oriented.
Works with Claude Code, Cursor, Windsurf, Zed, Codex, GitHub Copilot, Gemini CLI, and any tool that reads a file on startup. Custom tool targets are supported via config.
Install
# Standard — includes MCP server for Claude Code / Cursor / Windsurf / Zed
pip install 'cram-ai[mcp]'
# With TUI dashboard
pip install 'cram-ai[mcp,tui]'
# With additional model providers (OpenAI, Gemini, Bedrock, Ollama …)
pip install 'cram-ai[mcp,multi-provider]'
Quick start
cd your-repo
# 1. One-time setup
cram init
# → scans your repo, generates ARCHITECTURE.md + SYMBOLS.md via a cheap model
# → scaffolds DECISIONS.md + GOTCHAS.md for you to fill in
# → installs a git post-commit hook to keep context fresh
# 2. Fill in the manual files — this is where cram's real value lives
vim .ai-context/DECISIONS.md # architectural invariants, naming conventions
vim .ai-context/GOTCHAS.md # non-obvious traps that burned your team
# Or mine your git history for decisions automatically:
cram decisions --mine
# 3. Commit so teammates get the context layer too
git add .ai-context/ CLAUDE.md
git commit -m "chore: init cram-ai context layer"
Then wire up your tool of choice — MCP or file-based delivery.
The context layer
cram maintains five files in .ai-context/. Two are auto-generated. Two are manual. One is
generated per task.
your-repo/
└── .ai-context/
├── ARCHITECTURE.md ← auto · repo structure, tech stack, key files
├── SYMBOLS.md ← auto · every source file mapped to its public identifiers
├── DECISIONS.md ← manual · architectural commitments your team has made
├── GOTCHAS.md ← manual · non-obvious traps, foot-guns, things that burn people
└── CURRENT_TASK.md ← per-task · focused excerpts for the current work
Auto-generated (ARCHITECTURE.md, SYMBOLS.md):
- Generated by
cram init, refreshed automatically via the git post-commit hook after each commit SYMBOLS.mdis the candidate pool for file selection — every source file mapped to its public identifiers via regex. Deterministic, no LLM cost, byte-stable across runs. The LLM picks from this index rather than scanning raw source files, so it stays grounded in real symbols.ARCHITECTURE.mduses a cheap model (Haiku / Gemini Flash / GPT-4o Mini)
Manual (DECISIONS.md, GOTCHAS.md):
- Scaffolded by
cram init— you fill them in over time DECISIONS.md: "we use X", "never do Y", naming conventions, non-obvious invariantsGOTCHAS.md: silent side effects, middleware gaps, surprising nulls — things grep can't tell you- Append entries with
cram decide "...",cram gotcha "...", or mine git history withcram decisions --mine - Agents can propose decisions mid-session using the
propose_decisionMCP tool
Output protection by default: Every file cram generates for file-based targets includes command output protection rules. Your agent won't accidentally read 80 KB of build output mid-session:
- Unknown commands are byte-capped to 6,000 bytes by default (
COMMAND 2>&1 | head -c 6000) - File inspection uses
head/tail, never rawcat - Large outputs go to a temp file you inspect in ranges
- Configurable in
.ai-context/config.tomlunder[output](byte_cap,line_cap,temp_file)
Per-task (CURRENT_TASK.md):
When you call get_context("task description"), cram runs a four-stage pipeline:
- Symbol index — reads
SYMBOLS.md(every public identifier, by file) as the candidate pool - LLM selection — a cheap model picks relevant files from the symbol index and names key identifiers
- Excerpt extraction — pulls identifier-focused excerpts from selected files (not full files)
- Context write — assembles into
CURRENT_TASK.md(~800–1,500 tokens)
What this replaces: the agent spending 3–5 tool calls grep-ing and reading files to orient itself at the start of every session. With cram the context arrives in one call and includes knowledge — decisions, gotchas — that the agent can't discover by searching.
MCP delivery
If your tool supports MCP (Claude Code, Cursor, Windsurf, Zed, Codex CLI), wire up the cram MCP server once and the tool can call context tools directly.
One-time server config (same format for all MCP clients):
{
"mcpServers": {
"cram-ai": {
"command": "cram",
"args": ["mcp", "--repo", "/absolute/path/to/your-repo"]
}
}
}
| Client | Config file |
|---|---|
| Claude Code | .mcp.json at repo root |
| Cursor | .cursor/mcp.json or Cursor Settings → MCP |
| Windsurf | Windsurf MCP settings |
| Zed | Zed assistant settings → context servers |
| Codex CLI | ~/.codex/config.yaml → mcpServers |
Available MCP tools:
| Tool | What it returns | When to call it |
|---|---|---|
get_context(task='') |
Runs symbol lookup → file selection → excerpt extraction. No-arg: returns last CURRENT_TASK.md without re-running the LLM. Prepends a staleness warning when context is stale or critical. | First thing every session |
get_architecture() |
ARCHITECTURE.md — repo structure, tech stack, key files | Orientation in an unfamiliar area |
get_symbols(query='') |
SYMBOLS.md — source files mapped to public identifiers, optionally filtered | Finding where a function is defined |
get_decisions() |
DECISIONS.md — architectural commitments | Before making a design choice |
get_gotchas() |
GOTCHAS.md — non-obvious traps and foot-guns | Before touching an unfamiliar area |
propose_decision(text, reason='', alternatives='') |
Appends a [PENDING] entry to DECISIONS.md for owner review. Logs to suggestions.jsonl for cram ui. |
When you make an architectural choice worth recording |
add_file(path, identifiers='') |
Appends a file's excerpts to CURRENT_TASK.md | When a mid-task discovery needs new context |
get_health() |
Deterministic markdown: staleness score (0–10), commits since last sync, per-file token counts vs soft budgets. Safe to cache. | Before trusting loaded context on a long-running branch |
Agent write-back
Agents can propose decisions directly to DECISIONS.md without leaving their session:
propose_decision(
text="use JWT over session cookies",
reason="stateless — scales horizontally without a session store",
alternatives="redis session store, cookie-based sessions"
)
This appends a [PENDING] entry. The entry is surfaced in cram ui where you approve or
discard it with a single keystroke. Nothing is written to the canonical record without owner
review.
File-based delivery
For tools that don't support MCP, run cram task "..." --target <tool> before your session.
cram writes focused context into the file the tool auto-loads at startup.
# GitHub Copilot
cram task "add pagination to the users endpoint" --target copilot
# → writes to .github/cram-task.md (one-time: add an include line to copilot-instructions.md)
# Cursor (no-MCP fallback)
cram task "add pagination to the users endpoint" --target cursor
# → writes to .cursor/rules/cram-task.md
# Gemini CLI
cram task "refactor the auth module" --target gemini
# → writes to GEMINI.md (marker-based, preserves your content outside cram's section)
# All targets at once
cram task "add pagination to the users endpoint" --target all
| Target | File written |
|---|---|
cursor |
.cursor/rules/cram-task.md |
windsurf |
.windsurf/rules/cram-task.md |
copilot |
.github/cram-task.md |
codex |
AGENTS.md (repo root) |
gemini |
GEMINI.md (repo root, marker-based upsert) |
claude |
CLAUDE.md (escape hatch for Claude Code; prefer MCP) |
all |
All detected targets |
Custom targets — for tools with non-standard instruction files (e.g., an enterprise IDE with
ACME.md), add a section to .ai-context/config.toml:
[targets.acme]
file = "ACME.md"
indicator = "acme.config.json" # optional: file/dir that signals tool is active
upsert = true # optional: use cram markers to preserve your content
Then cram task "..." --target acme works like any built-in target.
Enterprise gateways — for internal model proxies that use SSO tokens instead of API keys,
add to ~/.config/cram-ai/settings.json:
{
"proxy": {
"base_url": "https://gateway.corp/v1",
"headers": { "X-Corp-Token": "your-sso-token" }
}
}
cram passes base_url as api_base, headers as extra_headers, and uses a dummy
api_key so litellm doesn't abort. No API key needed.
Decisions mining
DECISIONS.md is the hardest file to keep current — it depends entirely on human discipline.
cram decisions --mine automates the first draft:
cram decisions --mine # scan last 90 days of git history
cram decisions --mine --days 180
It scans git log for decision-shaped language, runs a cheap model to extract structured
entries, then walks you through them one at a time (git add -p style):
── Draft 1/3 ──────────────────────────────────────
Decision: use JWT over session cookies
Reason: reduces server-side state, scales horizontally
[a]ccept [s]kip [e]dit [q]uit >
Accepted entries are appended to DECISIONS.md immediately.
TUI dashboard
cram ui opens a Textual terminal dashboard:
pip install 'cram-ai[tui]'
cram ui
Six tabs:
- Audit (default) — the orientation-tax numbers for the last 30 days: reads before first edit, read-to-edit ratio with band, cache writes/reads per session, a cache-engagement check (sessions that wrote cache but never read it back), context-bloat metrics (context per request, read-cost tail share, carried cost of oversized tool results, redundant re-reads), and a weekly trend of the primary metric. The dashboard opens on the number, not the knobs.
- Decisions — pending agent proposals at top, accepted history below.
Press
ato approve the focused entry,dto delete it. Badge shows pending count. - Sessions — recent Claude Code sessions with reads, edits, read-to-edit ratio, and
the task that was active during each session. Task names are inferred from
TASK_HISTORY.jsonlby matching each session's timestamp against task time windows. Ratio > 5× is flagged — context isn't landing for those sessions. - Health — staleness score, commits since last sync, per-file token budgets.
- History — recent
cram taskinvocations with timestamps. The active session task is shown at the top in green (it lives insession.jsonand hasn't been archived yet). This is a recall aid ("what was I working on last Tuesday?"), not a project management tool. - Actions — run
cram sync,cram task,cram benchmark, orcram doctorfrom inside the TUI. An animated progress bar shows while a command is running.
Each tab refreshes its data when you switch to it. Auto-refreshes every 30 seconds.
r forces a full refresh, q quits.
Orientation tax measurement
cram audit measures how much of each session is spent on navigation vs. actual work:
cram audit # last 30 days for this repo
cram audit --days 7 # tighter window
cram audit --all # all projects
cram audit --json # structured output for dashboards / scripts
cram audit --compare PATH_A PATH_B # side-by-side A/B of two checkouts
--compare prints both checkouts' metrics with deltas — built for attribution
experiments: keep one checkout wired with cram, one plain, alternate tasks between
them, and compare after a couple of weeks.
Output includes:
Avg reads before first edit: 8.2 ← primary metric
Avg edits/session: 3.1
Avg read-to-edit ratio: 2.6× ~ normal
Cache engagement: 18/24 sessions read from cache
Ratio guide: < 2× good · 2–5× normal · > 5× context isn't landing
The cache-engagement line is the silent-failure check: a session that wrote cache but never read it back paid the 1.25× write price for nothing — a signal that prompt caching isn't engaging (prefix instability, sub-floor prefix, or a misconfigured proxy).
The report also measures context bloat — usually the largest waste bucket: average
context re-read per request, the share of read-cost in the final third of turns
(33% = flat; higher means the context is growing), the carried cost of oversized
tool results (a big result entering at turn k is re-read by every later turn), and
redundant same-file reads. Thresholds: CRAM_AUDIT_BIG_RESULT_BYTES (default 20000).
Retry loops are reported when present: failed tool calls per session (is_error
tool results — each usually means a retry follows) and same-file re-edits per session
(a couple is normal; sustained churn means the agent is thrashing).
Dollar attribution is provider-pluggable: set CRAM_PROVIDER to anthropic
(default), openai, gemini, or local (zero-dollar — the cost is latency). Prices
are representative defaults; override per field with CRAM_PRICE_INPUT_PER_MTOK,
CRAM_CACHE_WRITE_MULT, CRAM_CACHE_READ_MULT for billing-grade numbers.
Run before and after adopting cram to measure the reduction.
Daily workflow
# Before a session — MCP path (Claude Code / Cursor / Windsurf / Zed)
# Nothing to run. The agent calls get_context() itself.
# Before a session — file-based delivery (Copilot / no-MCP tools)
cram task "fix the rate limiter" --target copilot
# Log a decision while working
cram decide "use cursor-based pagination, not offset — offset breaks under concurrent writes"
# Mine git history for past decisions
cram decisions --mine
# Log a gotcha you just found
cram gotcha "the users.email column is nullable in prod despite NOT NULL in schema.prisma"
# Extend grace period if you commit mid-task (prevents context reset)
cram continue
# Check context freshness
cram status
# Review pending agent proposals + session efficiency
cram ui
After every commit the git post-commit hook runs cram sync automatically to refresh
ARCHITECTURE.md and SYMBOLS.md. A session grace period prevents sync from firing while
you're mid-task.
Context health
cram tracks how stale your context is with a 0–10 staleness score derived from git — the number of commits on HEAD since ARCHITECTURE.md was last regenerated. No new state files; the score is always correct after a teammate pull.
The post-commit hook writes ARCHITECTURE.md to disk but does not commit it. The health check
detects this correctly: if the file has uncommitted changes (i.e., it was rewritten by cram sync
after the last commit), the score is reported as 0 — not stale.
| Score | Band | Meaning |
|---|---|---|
| 0–2 | fresh |
Up to date — work freely |
| 3–5 | acceptable |
Drifting slightly — fine to continue |
| 6–7 | stale |
Update before next session |
| 8–10 | critical |
Sync now — context may mislead |
The score falls back to an mtime check when git is unavailable. The critical threshold defaults to
10 commits and is tunable via CRAM_STALE_CRITICAL_COMMITS.
Where health surfaces:
cram status— per-file age table + health line with score, band, and commit countcram ui→ Health tab — staleness score + per-file token budgets + active task slots- Tray badge — shows band + score (
stale 6/10) with band-appropriate color get_health()MCP tool — deterministic markdown block the agent can call before trusting contextget_context()— prepends a one-line staleness warning when band isstaleorcriticalcram sync— warns to stderr after regenerating if any frozen file exceeds its soft token budget
Soft token budgets (warnings only — nothing is ever truncated):
| File | Default budget | Override |
|---|---|---|
ARCHITECTURE.md |
2,000 tok | CRAM_BUDGET_ARCHITECTURE |
DECISIONS.md |
600 tok | CRAM_BUDGET_DECISIONS |
GOTCHAS.md |
400 tok | CRAM_BUDGET_GOTCHAS |
CURRENT_TASK.md |
800 tok | CRAM_BUDGET_TASK |
SYMBOLS.md |
no budget | scales with repo size |
Team and concurrency
cram is designed for one developer, one repo checkout. Context files live in
.ai-context/ which is typically gitignored, so each developer works with their own
independent context.
| Scenario | Works? |
|---|---|
| One developer, one agent | Yes — designed for this |
| One developer, parallel agents (same session) | Yes — each get_context() call gets its own slot under .ai-context/tasks/ |
| Multiple developers, separate checkouts | Independent — no sharing or coordination |
| Multiple developers wanting shared context | Not supported |
The slot system protects against concurrent MCP calls within a single server process. It is not a collaboration feature — there is no shared state, sync, or conflict resolution across different checkouts or machines.
CLI reference
| Command | What it does |
|---|---|
cram init [path] [--team] |
One-time setup — scans repo, generates context files, installs git hook |
cram mcp [--repo PATH] |
Start MCP server (stdio). Wire into your tool's settings once; clients launch it automatically. |
cram task "..." [--target T] |
Run context pipeline, write CURRENT_TASK.md, optionally inject into tool's auto-loaded file |
cram decisions [--mine] [--days N] |
Show DECISIONS.md, or mine git history for decision-shaped commits and review interactively |
cram sync [path] |
Refresh ARCHITECTURE.md + SYMBOLS.md from current repo state. If the session grace period has expired, archives the current task to TASK_HISTORY.jsonl and resets the task context in all target files (your instructions are untouched — only the cram-managed task section is cleared). |
cram decide "..." [path] |
Append a dated architectural decision to DECISIONS.md |
cram gotcha "..." [path] |
Append a non-obvious trap to GOTCHAS.md |
cram continue [path] |
Extend grace period — keep context across a mid-task commit |
cram status [path] |
Show each context file with age, line count, and token budget status |
cram audit [--days N] [--all] [--json] |
Measure reads-before-edit, read-to-edit ratio, and cache engagement from Claude Code transcripts |
cram ui [path] |
TUI dashboard — pending decisions, session efficiency, context health (requires cram-ai[tui]) |
cram benchmark [path] |
Show token and cost comparison across delivery strategies |
cram doctor [path] |
Health check — models, hooks, git, context files |
cram hook install|uninstall |
Manage the git post-commit hook manually |
Model providers
cram uses a cheap model for its maintenance calls (generating ARCHITECTURE.md, selecting files,
extracting excerpts). Set AICONTEXT_MODEL to any provider:
# Inside Claude Code — zero config, uses session credentials
cram init
# Anthropic API key
export ANTHROPIC_API_KEY=sk-...
export AICONTEXT_MODEL=anthropic/claude-haiku-4-5
# Google Gemini
export GEMINI_API_KEY=...
export AICONTEXT_MODEL=gemini/gemini-2.0-flash
# OpenAI
export OPENAI_API_KEY=sk-...
export AICONTEXT_MODEL=openai/gpt-4o-mini
# Ollama (local, free, no key needed)
export AICONTEXT_MODEL=ollama/mistral
cram init
Also supports: AWS Bedrock, GCP Vertex AI, Azure OpenAI, custom LiteLLM proxies with
proxy.base_url + proxy.headers (install cram-ai[multi-provider]).
Environment variables
| Variable | Default | Description |
|---|---|---|
AICONTEXT_MODEL |
auto-detected | Model for context tasks — bare alias (haiku) or provider/model |
ANTHROPIC_API_KEY |
— | Optional inside Claude Code (uses session credentials) |
AICONTEXT_MAX_FILES |
5 |
Max files included in CURRENT_TASK.md per task |
AICONTEXT_MAX_LINES |
300 |
Max lines per file when extracting excerpts |
AICONTEXT_TASKS_PER_SESSION |
4 |
Assumed tasks per cache window (used by cram benchmark) |
CRAM_TASK_GRACE_SECONDS |
600 |
Seconds after cram task before a commit resets context |
CRAM_STALE_CRITICAL_COMMITS |
10 |
Commits since last sync that maps to staleness score 10 (critical). Lower = more sensitive. |
CRAM_BUDGET_ARCHITECTURE |
2000 |
Soft token budget for ARCHITECTURE.md — warns in cram status and cram sync when exceeded |
CRAM_BUDGET_DECISIONS |
600 |
Soft token budget for DECISIONS.md |
CRAM_BUDGET_GOTCHAS |
400 |
Soft token budget for GOTCHAS.md |
CRAM_BUDGET_TASK |
800 |
Soft token budget for CURRENT_TASK.md |
CRAM_PROVIDER |
anthropic |
Pricing table for audit dollar attribution: anthropic / openai / gemini / local |
CRAM_PRICE_INPUT_PER_MTOK |
per provider | Override base input price ($/1M tokens) for audit cost estimates |
CRAM_CACHE_WRITE_MULT |
per provider | Override cache-write multiplier (1.25 on Anthropic) |
CRAM_CACHE_READ_MULT |
per provider | Override cache-read multiplier (0.10 on Anthropic) |
CRAM_AUDIT_TOK_PER_FILE |
2500 |
Assumed tokens per orientation file read in cram audit cost modeling |
CRAM_AUDIT_BIG_RESULT_BYTES |
20000 |
Serialized size above which a tool result counts as oversized in cram audit |
💰 Real-world token consumption
Without context pre-loading, an agent spends the first few exchanges of every session re-discovering the codebase — reading files, running searches, building orientation from scratch.
What a typical session consumes (no cram):
| Phase | What happens | Tokens |
|---|---|---|
| Session start | System prompt + tool definitions + rules files | 3–8K |
| Orientation | find / grep / read calls to discover relevant files cold |
20–60K |
| Active work | Conversation, edits, test runs | 20–50K |
| Output | Code written, explanations | 5–15K |
| Per task total | 50–130K |
The orientation phase is pure re-discovery overhead — the agent reads roughly 8 files cold
each session to figure out where to work. cram replaces that with one get_context() call
returning ~1–2K tokens of targeted excerpts.
You can measure your actual overhead with cram audit. The read-to-edit ratio — reads before
first edit divided by total edits — tells you how much of each session is navigation vs. work.
Ratio > 5× means context isn't landing.
What cram removes:
cram eliminates the cold-start orientation overhead per session. It does not replace the agent's productive reads (edits, tests, active work). The savings are orientation-only, not total-session savings.
Run cram benchmark for a full token and cost breakdown across all three delivery strategies
and model tiers. Run cram audit to measure your actual read-to-edit ratio.
💸 Claude Code users: cache-write bonus
This section is specific to Claude Code + Anthropic. The context layer is useful for any tool, but Claude's prompt caching gives MCP delivery an additional cost advantage.
Anthropic's prompt cache has a 5-minute TTL. Content in the conversation prefix gets cache-written at 1.25× the base input price on every new session and every TTL expiry. Content that doesn't touch the prefix — like MCP tool results — isn't.
File-based delivery vs MCP:
File-based delivery (--target claude) |
MCP (get_context()) |
|
|---|---|---|
| Where context lands | CLAUDE.md → front of prefix | Conversation tail (tool result) |
| Cache writes per session | N × task context tokens | 1 × tool definitions (~1–2K tokens) |
| Per-task context cost | 1.25× write per task change | 0.1× read after first session write |
| 10K-token context, 4 tasks | ~$0.09–0.15 in cache writes | ~$0.01 in cache writes |
The larger your context and the more tasks per session, the more the MCP path saves.
Run cram benchmark to model the exact numbers for your repo.
The floor check: the frozen prefix must exceed 2,048 tokens (Sonnet 4.6) or 4,096 tokens
(Opus 4.8 / Haiku 4.5) to cache at all. cram benchmark flags this if your context files are
below the threshold.
Running tests
pip install pytest
pytest
No API key required — all model calls are mocked.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cram_ai-0.3.0.tar.gz.
File metadata
- Download URL: cram_ai-0.3.0.tar.gz
- Upload date:
- Size: 107.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f360e8da1b01c6e965b39898712ff696447ab6a7735940bb02962a4b08aee58a
|
|
| MD5 |
a0734d667f55e66bfd34ab74cdb42e48
|
|
| BLAKE2b-256 |
357604828cbcc801574a836b6558a28942c0e2f5746c3ff6f7e84c5bcdacb330
|
File details
Details for the file cram_ai-0.3.0-py3-none-any.whl.
File metadata
- Download URL: cram_ai-0.3.0-py3-none-any.whl
- Upload date:
- Size: 85.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b8931a4016d97b33ad19892b8fd5d2993bbff0198edb4e7b16a6ca29453ec08
|
|
| MD5 |
102bd42b56f74e3b518d857f512b0a57
|
|
| BLAKE2b-256 |
e47ee4678e96ccc11cee19ccff546b8baa125bd74f5af7b24d3b0210813895b3
|