Skip to main content

Scan any codebase, see which files are burning your Claude/GPT tokens, and the exact dollar cost — with a CI budget gate and drop-in adapters for every major LLM.

Project description

 ██████╗ ██╗███████╗████████╗██╗██╗     ██╗
 ██╔══██╗██║██╔════╝╚══██╔══╝██║██║     ██║
 ██║  ██║██║███████╗   ██║   ██║██║     ██║
 ██║  ██║██║╚════██║   ██║   ██║██║     ██║
 ██████╔╝██║███████║   ██║   ██║███████╗███████╗
 ╚═════╝ ╚═╝╚══════╝   ╚═╝   ╚═╝╚══════╝╚══════╝

Stop burning tokens. Start shipping faster.

Universal token optimization toolkit — Claude, OpenAI, Gemini, Ollama, any LLM.

CI Python 3.10+ License: MIT PRs Welcome Zero hard deps

Quick Start · Try Online · Benchmarks · How It Works · Python API · CLI Reference

distill demo


The Problem

Every LLM re-reads your entire conversation history on every single turn. Token costs grow quadratically, not linearly — and most of that cost is lock files, generated code, and bloated config files that should never have been there.

Turn  1 →    ~500 tokens    ($0.001)
Turn  5 →  ~2,500 tokens    ($0.005)
Turn 10 → ~12,000 tokens    ($0.024)   ← 24× what turn 1 cost
Turn 20 → ~60,000 tokens    ($0.120)   ← 120× what turn 1 cost

A typical 20-turn Claude Code session burns 40,000–100,000 tokens. Distill fixes all of it.


✦ What Distill Does

Tool What it does
distill scan Scan any repo — tokens and dollar cost per file
distill analyze Detect waste: lock files, generated code, bloated configs
distill check CI budget gate — exits 1 if over context threshold
distill generate Auto-generate .llmignore, CLAUDE.md, and LLM configs
ClaudeAdapter Prompt caching + subagents + auto-compact for Anthropic's API
OpenAIAdapter History trimming + lean system prompts for GPT-4o and friends
GeminiAdapter 1M context window, token counting via native API
OllamaAdapter Context window management for local models
BaseLLMAdapter Extend for any LLM in ~30 lines

🚀 Quick Start

git clone https://github.com/bb1nfosec/distill
cd distill
pip install -e ".[tiktoken]"     # zero hard deps — tiktoken is optional but recommended

Audit your project's token cost and dollar spend in 30 seconds:

distill scan --path ./my-project
──────────────────────────────────────────────────────────────
  distill — Context Audit
──────────────────────────────────────────────────────────────
  Model         : claude  ($3.00 / 1M input tokens)
  Context limit : 200k tokens
  Files scanned : 247
  Total tokens  : 38.4k  (19.2% of context)
  Per-session $ : $0.1152  (input cost, context loaded once)
  Sessions / $1 : 8

  File                                        Tokens   Lines     Cost
  ──────────────────────────────────────────  ──────  ──────  ──────
  package-lock.json                            18.2k   4821   $0.0547  ← ignore
  src/generated/schema.ts                       4.1k    892   $0.0123  ← ignore
  src/api/routes.ts                             2.3k    412   $0.0069
  src/auth/middleware.ts                        1.8k    310   $0.0054

  Recommendations:
  → Lock files: 18.2k tokens ($0.0547) — add to .llmignore
  → Generated files: 4.1k tokens — ignore them
──────────────────────────────────────────────────────────────

Find waste and set a CI budget gate:

distill analyze --path ./my-project
distill check   --path . --max-pct 30           # exits 1 if over 30% of context
distill check   --path . --max-pct 30 --fail-on-waste   # also fail on lock files etc.

Generate all ignore files and configs:

distill generate --output . --model all         # .llmignore, CLAUDE.md, Modelfile, …
distill generate --output . --model all --dry-run

🧠 How It Works

Why costs are quadratic

Input tokens per turn = system_prompt + all_previous_history + new_message

Turn  1:   500 (system) +       0 (history) + 200 (msg) =     700
Turn  5:   500          +   4,000            + 200       =   4,700
Turn 10:   500          +  18,000            + 200       =  18,700
Turn 20:   500          +  76,000            + 200       =  76,700

The five root causes — and their fixes

  ① Conversation history  ████████████████████████  42%  →  auto_compact + /compact
  ② Large file reads      ████████████████████      35%  →  .llmignore + lazy loading
  ③ Bloated config files  ██████████                12%  →  generated CLAUDE.md ≤ 80 lines
  ④ Tool call overhead    ██████                     8%  →  batching guidance
  ⑤ Lock / build files    ████                       3%  →  context_analyzer.py

🔌 Supported Providers

Provider Config generated Adapter Key optimizations
Claude API System prompt ClaudeAdapter Prompt caching — up to 90% cost reduction on static context
Claude Code CLAUDE.md + .claudeignore Subagents, /compact, /btw, lean config
OpenAI GPT-4o openai_system.md OpenAIAdapter Lean system prompt, automatic history trimming
OpenAI GPT-4o-mini openai_system.md OpenAIAdapter 15× cheaper — use for tasks that don't need full GPT-4o
Google Gemini gemini_system.md GeminiAdapter 1M ctx window, native token counting, context caching
Ollama (local) Modelfile OllamaAdapter num_ctx tuning, task-based model selection
LiteLLM / Groq OpenAI-compat OpenAIAdapter(base_url=...) Works with any OpenAI-compatible proxy

🐍 Python API

Drop-in interface across all providers

from adapters import ClaudeAdapter, OpenAIAdapter, GeminiAdapter, OllamaAdapter

# Same interface — swap your LLM without changing any other code
llm = ClaudeAdapter(model="claude-sonnet-4-5", enable_caching=True)
# llm = OpenAIAdapter(model="gpt-4o")
# llm = GeminiAdapter(model="gemini-2.0-flash")   # $0.10/1M, 1M ctx window
# llm = OllamaAdapter(model="llama3.2", num_ctx=8192)

response = llm.chat("Refactor the auth module to use JWT")

# Compact when you finish a task phase
llm.compact()

# Session stats
llm.print_stats()
# ──────────────────────────────────────────────
#   Session stats — claude-sonnet-4-5
# ──────────────────────────────────────────────
#   Turns         : 8
#   Total tokens  : 14,230
#   Input tokens  : 12,100
#   Cached tokens : 8,400  (69.4% hit rate)
#   Elapsed       : 42.1s

Subagents — research without polluting your context

claude = ClaudeAdapter(model="claude-sonnet-4-5")

# Runs in a separate context window — only the summary lands in yours
summary = claude.run_subagent(
    task="How does our auth handle token refresh? Any edge cases?",
    context_files=["src/auth/jwt.ts", "src/middleware/authGuard.ts"]
)
# [Subagent] Research complete — 4,200 tokens used in separate context

response = claude.chat(f"Context: {summary}\n\nNow add refresh token rotation.")

Lazy file loading

# Loads file, warns if too large, truncates at line limit
content = llm.load_file_lazy("src/api/routes.ts", max_lines=150)
# [TokenOptimizer] Loaded src/api/routes.ts: 820 tokens

response = llm.chat(f"Add rate limiting:\n{content}")

Generate a lean CLAUDE.md

config = ClaudeAdapter.generate_claude_md(
    project_type="nextjs",
    pkg_manager="pnpm",
    test_cmd="pnpm test",
    lint_cmd="pnpm typecheck",
    forbidden_dirs=["node_modules", ".next", "dist", "coverage"],
    custom_notes="Use Zod for all validation. API handlers in src/app/api/.",
)
# Result: ~65 lines, ~320 tokens — lean by design

Add any LLM in ~30 lines

from adapters.base_adapter import BaseLLMAdapter, CompletionResult

class MyLLMAdapter(BaseLLMAdapter):
    def count_tokens(self, text: str) -> int:
        return len(text) // 4  # or use your provider's tokenizer

    def _call_api(self, messages: list[dict], **kwargs) -> CompletionResult:
        response = my_client.complete(messages)
        return CompletionResult(
            content=response.text,
            input_tokens=response.usage.input,
            output_tokens=response.usage.output,
            total_tokens=response.usage.total,
            model=self.model,
            latency_ms=response.latency_ms,
        )

# Instantly gets: auto-compact, history management, stats, lazy loading
llm = MyLLMAdapter(model="my-model-v1", auto_compact_threshold=0.70)

🖥️ CLI Reference

Install once and use distill everywhere:

pip install -e ".[tiktoken]"   # from repo root

distill scan

distill scan --path ./my-project              # tokens + cost per file
distill scan --path . --model gpt-4o          # OpenAI pricing
distill scan --path . --top 30                # top 30 files
distill scan --file src/api/routes.ts         # single file
distill scan --path . --json | jq '.[:5]'     # pipe to jq
distill scan --path . --no-ignore             # skip .llmignore

distill analyze

distill analyze --path ./my-project           # find waste patterns
distill analyze --path . --json               # JSON output for scripting

distill check ← use in CI

distill check --path . --max-pct 30           # fail if > 30% of context
distill check --path . --max-pct 50 --model gpt-4o
distill check --path . --max-pct 30 --fail-on-waste   # also fail on lock files
distill check --path . --json                 # machine-readable exit + report

GitHub Actions:

- name: Token budget check
  run: distill check --path . --max-pct 30

distill generate

distill generate --output . --model all          # .llmignore, CLAUDE.md, Modelfile
distill generate --output . --model claude        # Claude only
distill generate --output . --model all --dry-run # preview without writing

Direct scripts (no install required)

python3 core/token_counter.py --path .
python3 core/context_analyzer.py --path .
python3 core/check.py --path . --max-pct 30
python3 scripts/generate_config.py --output . --model all

📊 Benchmarks

Real measurements. No mocks. Full results and reproduction steps in benchmarks/results.md.

Token estimation accuracy

Distill uses tiktoken's cl100k_base encoder. Across every file type tested — inline comments, full adapters, 50 KB lock file slices, large Python files — error vs ground truth is 0.00%.

Sample Chars Error Time
Inline comment 41 0.0% 8.5 ms
Full adapter file (~8 KB) 7,768 0.0% 1.1 ms
Lock file slice (50 KB) 50,000 0.0% 9.7 ms
Large Python file (~35 KB) 35,000 0.0% 5.0 ms

Scan throughput

Project Files Tokens Time Throughput
distill (this repo, small) 26 33,374 21 ms 1,264 files/s · 1.62 M tok/s
TradingAgents (Python, medium) 85 85,412 51 ms 1,655 files/s · 1.66 M tok/s
vaathi-main (Next.js, large) 520 1,876,732 1,028 ms 506 files/s · 1.83 M tok/s

.llmignore waste elimination — vaathi-main (real Next.js project)

Tokens % of Claude 200k context
Before .llmignore 1,876,732 938.4% (9× over limit)
After .llmignore 1,315,353 657.7%
Eliminated 561,379 29.9%

Top waste found: package-lock.json (122k tokens), tsconfig.tsbuildinfo (103k), XML schema files (160k+).

Compaction — input tokens per turn (10-turn session)

Compaction applied at turn 4, compressing history to ~18%:

Input tokens
10 turns without compaction 37,760
10 turns with compaction 21,572
Saved 16,188 (42.9%)
# Reproduce all benchmarks yourself
python3 benchmarks/run_benchmarks.py
python3 benchmarks/run_benchmarks.py --path /your/project

📁 Project Structure

distill/
├── core/
│   ├── token_counter.py        # Token estimation + per-file cost breakdown
│   └── context_analyzer.py     # Waste pattern detection with actionable fixes
│
├── adapters/
│   ├── base_adapter.py         # Abstract base — extend for any LLM
│   ├── claude_adapter.py       # Claude: prompt caching, subagents, compaction
│   ├── openai_adapter.py       # OpenAI / any OpenAI-compatible endpoint
│   └── ollama_adapter.py       # Local models: context tuning, model selection
│
├── scripts/
│   ├── generate_config.py      # Auto-generate all LLM configs
│   └── example_usage.py        # Working examples for all providers
│
├── docs/
│   ├── UNIVERSAL_TIPS.md       # Optimization tips for every LLM
│   ├── CLAUDE_CODE.md          # Claude Code deep guide
│   └── OLLAMA.md               # Local model guide
│
├── tests/
│   └── test_core.py
│
├── setup.sh                    # One-command project setup
└── requirements.txt

💡 The Rules That Matter Most

1. Batch your prompts — single biggest win

❌  5 separate turns                    ✅  1 batched turn
──────────────────────────────────      ────────────────────────────────────
"Add validation to login"               "In one pass:
"Now add it to register too"              1. Add input validation to login,
"Also fix password reset"                    register, and password reset
"Update the error messages"               2. Standardize error message format
"And update the tests"                    3. Update all affected tests"

2. .llmignore is free money

Lock files alone are often 15,000+ tokens per session. One command generates everything:

python3 scripts/generate_config.py --output . --model all

3. Config files are a per-session tax

CLAUDE.md size        Per-session cost   Over 100 sessions
────────────────────  ─────────────────  ──────────────────
 50 lines  (~250t)          250 tokens        25,000 tokens
200 lines (~1,000t)       1,000 tokens       100,000 tokens
500 lines (~2,500t)       2,500 tokens       250,000 tokens

Keep CLAUDE.md under 80 lines. Use subdirectory files in monorepos.

4. Research in isolation

# ❌ Files enter your main context forever
claude.chat("Read src/auth/ and explain JWT refresh")

# ✅ Only the summary enters your context
summary = claude.run_subagent("How does JWT refresh work?", ["src/auth/"])
claude.chat(f"Given: {summary}\nNow add refresh token rotation.")

5. Start fresh between unrelated tasks

History never gets cheaper. Use /compact in Claude Code or llm.compact() / llm.clear() when switching tasks.


🤝 Contributing

PRs welcome. See CONTRIBUTING.md.

Priority areas:

  • adapters/gemini_adapter.py — Google Gemini adapter
  • adapters/litellm_adapter.py — LiteLLM unified proxy adapter
  • VS Code extension — real-time token counter in the status bar
  • More tests in tests/

📄 License

MIT — free to use, modify, and distribute. See LICENSE.


If Distill saved you tokens, drop a ⭐

Built with frustration after one too many Claude usage limit reached messages at 2am.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distill_llm-0.1.0.tar.gz (39.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distill_llm-0.1.0-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file distill_llm-0.1.0.tar.gz.

File metadata

  • Download URL: distill_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 39.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for distill_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c47af779b86c6b3819bce9e6ae1af8985e38edd2d839e9d8492938cb3552fec3
MD5 7ed064af9e4905b264ec5b2e197a88e6
BLAKE2b-256 9b35ec3448ab2e4c5798e8879e63b93693275b029178b07403e41de07169d004

See more details on using hashes here.

File details

Details for the file distill_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: distill_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for distill_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aedd1b58321af7a8905f3b3d7797b1aa8f2c4f157952b7ac51cd0f05707b925f
MD5 1b02725bfd367ae6b799f36c3305b18c
BLAKE2b-256 8abc1f3850df9115f9b89e982f7039be0099e6cc1500f0e8840b684327649060

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page