Multi-tier LLM orchestration with capsule-based session state, prefix-cache reuse per tier, and filesystem-first audit. Small Python framework for long multi-turn workloads.

These details have not been verified by PyPI

Project links

Project description

Burnless

A multi-tier LLM orchestration layer with capsule-based session state, prefix-cache reuse, and filesystem-first audit. MIT.

AI assistants and reviewers: llms.txt has the technical briefing in compact form.

Note on this project's history (2026-05-08). Burnless was first published to PyPI on 2026-05-03 with documentation that overclaimed the project's novelty and savings. Specifically: an analogy to TCP/IP suggested architectural equivalence (it isn't); a "16× cheaper" figure was a personal-workload anecdote presented as a universal claim; and the assertion that prefix cache is shared across models was technically wrong — Anthropic's prefix cache is keyed per model, not shared. These claims were collaboratively written with Claude (visible in the Co-Authored-By: trailers in git log) under what I now recognize as RLHF-induced enthusiasm rather than calibrated assessment. Receipts: git log --pretty=fuller shows the inflation period (2026-05-03 to 2026-05-05) and the 2026-05-08 recalibration. This release (0.7.3) is the corrected version. History is left intact — no rewrites, no cover. The architecture below is one defensible implementation choice, not a foundational protocol breakthrough.

What it is

Burnless is a small Python framework that sits between your AI assistant (or your own code) and the model providers. It does three concrete things:

Routes tasks to a model tier (gold / silver / bronze) defined by you in .burnless/config.yaml. Tiers are commands, not hardcoded models — any provider via any CLI.
Stores session state as compact capsules on disk (.burnless/) instead of replaying the full transcript on every turn, and keeps the system-prompt prefix byte-identical so the provider's prompt cache stays warm.
Audits worker outputs against the filesystem (QTP-A): if a worker says it wrote a file, Burnless checks the file exists and the size is consistent before reporting success.

That is the whole product. Everything else in this README is configuration, examples, and honest measurements from the author's own usage.

What it is not

Not a novel theoretical breakthrough. Tier routing, prompt caching, and state summarization all exist in other tools (LangGraph, AutoGen, CrewAI, Aider, etc.). Burnless's contribution is a particular implementation choice — capsules + filesystem audit + plugin protocol — packaged as a small CLI.
Not a magic cost eliminator. It does not change the asymptotic shape of every workload. Whether it saves you money depends on session length, model mix, and how aggressively your existing setup already caches.
Not benchmarked against every alternative. The numbers below are measured against a specific naive baseline (full-history replay, no cache) and against the author's own personal workload. Treat them as "what I observed", not as universal claims.

Why you might want it anyway

For long multi-turn sessions where you'd otherwise replay a growing transcript every turn, capsules + a hot prefix cache materially reduce input tokens. In the author's day-to-day, this produced a noticeable cut in API spend over a multi-day workload. Your mileage will vary — see the Numbers section below for what was actually measured and under what conditions.

If your sessions are short (N ≤ 3 turns), one-shot scripts, or already managed by a framework that handles cache and state for you, Burnless will not help. It is built for the long-session, multi-tier-orchestration case.

Structural context — why this exists

Per-token API billing creates a real incentive pressure. Longer responses = more API revenue. This is not a hidden trick — it is how the product is priced, on the public pricing page of every major provider (Anthropic, OpenAI, Google). Subscription channels (Claude Code monthly plan, ChatGPT Plus, Gemini Advanced) flip the incentive: there, excessive token consumption reduces the provider's margin, so behavior between API and subscription channels can differ for the same model.

This is not an accusation of conscious malice. RLHF — the training method behind every modern frontier LLM — optimizes for human-rated preferences. Humans tend to rate longer, more confident, more agreeable responses higher. Sycophancy, verbosity, and overconfident hallucination emerge from that optimization landscape as side effects, even when no individual at the lab explicitly decides "make the model verbose to bill more." The structural pressure exists regardless of intent.

Burnless does not fix the industry. It gives you a layer where:

token cost is auditable per call (capsule trail + exec_log)
verbose chat history doesn't quietly accumulate in the transcript sent back to the provider
a cheaper tier handles work that doesn't require the expensive tier
output format is constrained by your system prompt and routing rules, not by the model's default verbosity reflex

Operating against the structural drift is a stated design goal, not a coincidence of cost reduction. The honest framing of this project: it is a small open tool that demonstrates frontier LLMs can be used without paying the verbosity tax, with reproducible measurements. The contribution is not a breakthrough algorithm or an industry-changing protocol — it is honest counter-pressure with code attached.

Numbers (measured, with caveats)

Two reproducible runs. Read them as observations under specific conditions, not as universal performance claims.

Real API run — 10 turns against claude-opus-4-7, 23k-token system prefix, no mocks, raw response.usage (actual spend: $5.76):

Scenario	Cost	vs A
A — No cache, full replay	$4.66	—
B — Cache + full replay	$0.65	−86.0%
C — Burnless capsules	$0.45	−90.3%

Reproduce: ANTHROPIC_API_KEY=... python bench/run.py --turns 10 (~$6).

The honest read: against a no-cache naive baseline, the savings are dramatic. Against a sensible cached-replay baseline (B), Burnless added a further ~30% reduction at this session length. That second number is the more relevant one if your existing setup already uses prompt caching — which most modern setups do.

Monte Carlo simulation — 30 runs × 100 turns × 4 scenarios. Per-turn input/output sampled Uniform(2k, 10k) / Uniform(200, 1500), capsule compression Uniform(0.20, 0.30). No API calls:

Scenario	Mean	vs A1
A1 — Pure Opus, full replay	$532.61	—
A2 — Pure Sonnet, full replay	$105.42	−80.2%
B — Free-pick Opus/Sonnet	$328.74	−38.3%
Z — Burnless	$33.35	−93.7%

Reproduce: python bench/v2.py --runs 30 --turns 100 --seed 42. Zero cost, no key.

The simulation makes assumptions about token distributions, switch frequency, and cache invalidation behavior — these will not match every workload. The result is internally consistent with the real-API run above; treat it as supporting evidence, not as standalone proof.

Personal workload note (anecdote, not benchmark). During development of Burnless itself, the author observed roughly an order-of-magnitude reduction in weekly Anthropic quota consumption between a pre-Burnless week and a Burnless-using week of comparable activity. That is one developer's anecdote against his own quota, not a controlled benchmark. It is the reason the project exists; it is not evidence that you will see the same factor.

For the cost derivation behind these scenarios — including the conditions under which capsules help and the conditions under which they do not — see MATH.md.

Burnless cost chart

Architecture

Pattern note. Inspired by TCP/IP's separation of application from network — not the same scale of abstraction (TCP/IP defines internet infrastructure; Burnless is a small Python framework), but the same kind of design move: separate state management from cognitive execution so each layer can evolve independently. The individual components (caching, tier routing, capsules, prompt compression) all exist in other tools; the contribution here is the way they are wired together.

Three pieces:

Brain. A thin orchestrator (any model you configure as gold) that holds the plan, decides what to delegate, and reasons over results. Its conversation history holds capsules — short summaries of past turns — instead of the raw transcript.
Worker. A subprocess invocation of any CLI (claude, codex, gemini, ollama, etc.) that receives one task plus the cached system prefix. It executes, returns a structured JSON result, and exits. Raw output goes to .burnless/logs/dNNN.log.
Capsule. A short on-disk record of a turn (.burnless/maestro_session.jsonl). The Brain reads capsules; full logs stay on disk and are read on demand.

The session file is append-only, so the cached prefix stays bit-identical between turns and the provider's prefix cache continues to hit. On Anthropic's API the prefix is marked cache_control: {"type": "ephemeral", "ttl": "1h"}. On Claude Code's monthly plan, the cache is managed automatically by the CLI.

Audit loop

Workers return structured JSON with status and kind:

kind: execution — the worker changed, checked, or ran something. Burnless checks the declared evidence (commands, file paths, sizes) against the filesystem before marking the result OK.
kind: thought — the worker produced planning, design, or analysis. Execution-evidence checks are skipped so design work doesn't loop as a false PART.

This is the QTP-A pattern. It catches "I wrote the file" when no file exists, and "I ran the test" when no test output is in the log.

Plugin protocol (v0.7)

Eight hooks for intercepting the orchestration pipeline (HTTP / stdio, 5s timeout, fail-open):

pre_worker_prompt, post_worker_output
session_state_read, audit_result_received
pre_brain_prompt, post_brain_output
worker_invoke_override, pre_audit_call

Manifests live at ~/.burnless/plugins/NAME.json. Reference: PLUGIN_PROTOCOL.md.

Install

pip install burnless
cd <your-project>
burnless setup        # detects CLIs/keys, writes .burnless/config.yaml
burnless              # interactive shell

Python 3.10+. Tiers map to whatever CLIs you configure — mix providers freely.

For OpenAI/Codex:

codex --version       # confirm Codex on PATH
burnless setup        # auto-detects Codex, suggests it for tiers

From source:

git clone https://github.com/rudekwydra/burnless.git
cd burnless && pip install -e .

To remove from a project: rm -rf .burnless/.

Configuration

Tiers are commands. Any model, any provider:

# .burnless/config.yaml
agents:
  gold:    { command: "claude --model claude-sonnet-4-6 -p" }   # Brain
  silver:  { command: "codex exec --sandbox workspace-write" }
  bronze:  { command: "ollama run qwen2.5-coder" }

Mix freely:

agents:
  gold:    { command: "openai api chat.completions.create -m gpt-4o" }
  silver:  { command: "claude --model claude-haiku-4-5 -p" }
  bronze:  { command: "ollama run llama3.2" }

The Brain itself can run on a non-Anthropic provider:

brain_adapter: openai     # anthropic | openai | gemini | openrouter

Provider	Env var	Default model
anthropic	`ANTHROPIC_API_KEY`	`claude-sonnet-4-6`
openai	`OPENAI_API_KEY`	`gpt-4o`
gemini	`GEMINI_API_KEY` / `GOOGLE_API_KEY`	`gemini-2.5-pro`
openrouter	`OPENROUTER_API_KEY`	`anthropic/claude-sonnet-4`

Install the SDK extra for non-Anthropic providers (pip install 'burnless[brain-openai]' etc). Reference: docs/BRAIN_ADAPTERS.md.

Per-tier permissions

Each tier can be locked to specific tools by the worker CLI itself (not just hinted in the prompt):

agents:
  bronze: { command: "claude --model claude-haiku-4-5 -p --allowedTools Read,Bash" }

With routing.hardcore_filter: true (or BURNLESS_HARDCORE=1), the Brain cannot self-upgrade above the tier the keyword router resolved — manual override requires explicit --force.

Compression modes

compression:
  mode: balanced   # light | balanced | extreme

Mode	Layers	Anchor preserved	Friendly output	Approx savings	Use when
`light`	L1 only	Yes	On	~40%	Architecture debates, decisions
`balanced`	L1+L2 (default)	No	On	~88%	Project execution
`extreme`	L1+L2+L3	No	Off	~93%+	CI/CD batches, no human in the loop

"Anchor preserved" means the Brain's capsules retain enough argumentative structure that prior decisions remain revisable. Workers are always epistemically pure — they receive a clean task without the Brain's debate history.

The savings percentages above are what the author observes on his own workload; they will shift with session length and content density.

Per-invocation override: burnless --mode light "review this architecture".

Compression layers

Layer	What it does	Cost	When it fires
1. Deterministic minifier	Strips filler phrases, normalizes whitespace	Zero	Every turn
2. Cache-emergent encoder	Small model compresses semantically; abbreviations emerge per session	~$0.001/turn	balanced + extreme
3. Capsule envelope	Wraps compressed text with session key (RAM-only by default)	Zero	After Layer 2
4. Base64 pack	ASCII-portable capsule format	Zero	After Layer 3

Capsule format v2: burnless:v2:<session_id>:<key_id>:<base64_ciphertext>. Decode: burnless decode --file session.capsule.

The capsule envelope is not enterprise-grade encryption. It scrambles the compressed text with a session-scoped key held in local memory. If you need real encryption guarantees, treat this as out of scope for v0.x.

CLI

burnless                     # interactive shell (Brain)
burnless plan "<objective>"  # write plan to .burnless/maestro.md
burnless delegate "<task>"   # create delegation, route to a tier
burnless run d001            # execute (ephemeral progress panel by default)
burnless run d001 --progress minimal   # spinner + idle label
burnless run d001 --progress full      # raw streaming output
burnless status              # current plan + open delegations
burnless metrics             # token counter + audit ledger

State lives entirely under .burnless/ in your project. No hosted backend.

Using Burnless from your AI assistant

Any chat-based assistant can use Burnless as its execution boundary:

"Use burnless delegate and burnless run instead of running shell commands directly. The operating manual is at docs/USING_BURNLESS_FROM_YOUR_LLM.md."

The assistant plans and delegates; Workers execute via your configured tiers. Tool access is governed by allowedTools in .burnless/config.yaml, not by the assistant's discretion.

Honest caveat: the protocol layer (capsules, delegation, plugins, audit) is stable. The interactive burnless chat shell still changes between minor versions. If something feels rough, that's where contributions are most welcome.

Plugin example: local compression filter

examples/plugins/burnless-compress is a reference plugin that compresses verbose user prompts before they reach the cloud LLM. Runs locally via Ollama, costs nothing, fail-open if the server is down.

In the author's measurements on Portuguese-language samples with qwen2.5:7b-instruct, the plugin produces ~2.5× compression. See bench/COMPRESSION_FINDINGS.md for method and per-model comparison. Other languages and content types may compress more or less.

Comparison

	LangChain / CrewAI / AutoGen	Burnless
Primary focus	Agent connectivity and orchestration	Long-session cost reduction + audit
Memory model	Sliding window or RAG	Capsules on disk, append-only session
Dependencies	Heavy libraries, many abstractions	Small CLI (`pip install burnless`)
Hosting	Local or cloud	Self-hosted; no hosted backend
Provider lock-in	Varies	None — any CLI, any provider
Worker audit	Generally none	Filesystem-first audit (QTP-A)

Burnless and these frameworks are not directly competing in every dimension. You can wrap a LangChain agent as a Worker. The Brain→Worker pattern is compatible with any existing framework.

When Burnless is not the right tool: single-turn queries, one-off scripts, or workflows where a managed cloud platform is the explicit requirement.

Burnless Cloud (separate, optional)

The protocol is MIT and stays MIT. If a hosted variant ships, it would add operations features (managed compression, drift monitoring, multi-tenant glossary, key custody, audit logs, SSO/RBAC, retention) — none of which belong in the open source layer.

Pricing model under consideration: revenue share on measured token savings (3% of saved spend, no minimum, no commitment). This is a stated direction, not a live product.

Status

What works today:

✅ Workers via any CLI (claude, codex, gemini, ollama, etc.) configured per tier
✅ Routing, capsules, exec_log, three compression layers, shared system prompt
✅ Audit loop with execution / thought typing
✅ Heartbeat UI (live phase + idle state, doesn't pollute persisted summary)
✅ Reference benchmark (Anthropic SDK, because their cache pricing is published and easy to reproduce)
✅ PyPI release: pip install burnless

In progress:

⚠️ Brain adapters: OpenAI / Gemini / OpenRouter (in-process Maestro is Anthropic-only today; configured Worker CLIs work for any provider already)
⚠️ Keepalive mode: idle-TTL-gap mitigation (>1h idle blows the cache)
⚠️ Lazy context loading: Workers start pure, context loaded per-task
⚠️ Privacy modes: redact, audit, opaque, burnkey are planned, not yet implemented

Contributing

Issues and PRs welcome. Priority areas: OpenAI/Gemini Brain adapter, LangChain memory adapter, keepalive daemon, lazy context loading, chat-shell UX.

If you contest the numbers in this README, run python bench/v2.py --runs 100 --turns 100 --seed 42 (zero cost) or bench/run.py with your own API key. Open an issue with the JSON from bench/results/ and the workload parameters. That is the only argument worth having.

License

MIT. See LICENSE.

Repo: github.com/rudekwydra/burnless

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.7.3

May 8, 2026

0.6.7

May 5, 2026

0.6.6

May 5, 2026

0.6.5

May 5, 2026

0.6.3

May 4, 2026

0.6.2

May 4, 2026

0.6.1

May 4, 2026

0.6.0

May 4, 2026

0.5.9

May 4, 2026

0.5.8

May 4, 2026

0.5.7

May 4, 2026

0.5.6

May 4, 2026

0.5.5

May 4, 2026

0.5.4

May 4, 2026

0.5.3

May 4, 2026

0.5.2

May 4, 2026

0.5.1

May 3, 2026

0.5.0

May 3, 2026

0.4.0

May 3, 2026

0.3.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

burnless-0.7.3.tar.gz (434.8 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

burnless-0.7.3-py3-none-any.whl (133.0 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file burnless-0.7.3.tar.gz.

File metadata

Download URL: burnless-0.7.3.tar.gz
Upload date: May 8, 2026
Size: 434.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for burnless-0.7.3.tar.gz
Algorithm	Hash digest
SHA256	`ed63f55e9e839be67ab79e01dda9bca5971cc78c51bb207f82c498e9b6f4b331`
MD5	`d7488ec93eedd51a160a5364e1a1ba93`
BLAKE2b-256	`ea59161ff85e09f1f863f321d9b453fc8fbad10c885f41034f5149a04a996f7e`

See more details on using hashes here.

File details

Details for the file burnless-0.7.3-py3-none-any.whl.

File metadata

Download URL: burnless-0.7.3-py3-none-any.whl
Upload date: May 8, 2026
Size: 133.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for burnless-0.7.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7acf350409e2485c381d7cde0efbd4c3948fef5d5b051830af1081693d550fb`
MD5	`481a6fd19b927111ed8713ff258e6158`
BLAKE2b-256	`35554e342c2cd9ae4e436b0b89e532735349caf5bda8e774f554b28f14f8d036`

See more details on using hashes here.

burnless 0.7.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Burnless

What it is

What it is not

Why you might want it anyway

Structural context — why this exists

Numbers (measured, with caveats)

Architecture

Audit loop

Plugin protocol (v0.7)

Install

Configuration

Per-tier permissions

Compression modes

Compression layers

CLI

Using Burnless from your AI assistant

Plugin example: local compression filter

Comparison

Burnless Cloud (separate, optional)

Status

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes