From raw PDFs to compressed prompts in 3 lines. Cut your LLM bill by 60%.
Project description
ContextBuddy
From raw PDFs to compressed prompts in 3 lines. Cut your LLM bill by 60%.
Create the demo GIF •
Drop it at assets/cli-demo.gif
______ __ __ ____ __ __
/ ____/___ ____ / /____ _ __/ /_/ __ )__ ______/ /___/ /_ __
/ / / __ \/ __ \/ __/ _ \| |/_/ __/ __ / / / / __ / __ / / / /
/ /___/ /_/ / / / / /_/ __/> </ /_/ /_/ / /_/ / /_/ / /_/ / /_/ /
\____/\____/_/ /_/\__/\___/_/|_|\__/_____/\__,_/\__,_/\__,_/\__, /
/____/
The missing compression layer for every LLM stack.
One line. 60% cheaper. Zero core dependencies. ContextBuddy sits between your raw data and your LLM call, strips the noise, keeps every entity, and shows you the savings -- in tokens and dollars -- on every single request.
Install
PyPI
pip install contextbuddy
TestPyPI (for pre-release testing)
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple contextbuddy==0.4.0
Optional extras
# MCP server tools
pip install "contextbuddy[mcp]==0.4.0"
# Python codegraph (tree-sitter call edges)
pip install "contextbuddy[codegraph]==0.4.0"
┌──────────────────────────────────────────────────────┐
│ ContextBuddy │
├──────────────────────────────────────────────────────┤
│ Tokens before 15000 after 3000 │
│ Saved -80.0% Est. $0.0600 │
│ [████████████████████████░░░░░░] 12000 tokens freed │
├──────────────────────────────────────────────────────┤
│ Chunks total 12 kept 4 pruned 8 │
│ Entities INV-92831, 2026-04-01, acct_12345 │
└──────────────────────────────────────────────────────┘
What is ContextBuddy?
ContextBuddy is a lightweight, open-source Python library that acts as a context middleware between your raw data (PDFs, web pages, documents, databases) and your LLM call. Its entire job is to take a massive, messy prompt -- like 20 pages of scraped text -- compress it, filter out the noise, preserve critical entities, and pass a clean, token-efficient prompt to any LLM.
Think of it as the missing layer in every AI stack: the part that makes sure you're not paying for 15,000 tokens when only 3,000 actually matter.
The Problem
Every developer building with LLMs hits the same wall:
- You're overpaying. You send 15,000 tokens of scraped text to GPT-4 when only 3,000 tokens actually matter. That's 5x the cost for worse results.
- Context is noisy. Raw PDFs, web scrapes, and database dumps are full of irrelevant paragraphs, boilerplate, and filler. Your LLM wastes attention on noise.
- Critical details get lost. When you manually truncate context to save tokens, you accidentally drop the one invoice ID or date the user asked about.
- Existing frameworks are bloated. LangChain has 100+ dependencies. LlamaIndex has 50+. You just want to load a PDF and ask a question.
The Solution
ContextBuddy solves all four problems in a single library:
- Semantic pruning -- scores every paragraph against your question and drops irrelevant content before it hits the expensive model.
- Entity preservation -- automatically extracts IDs, dates, URLs, phone numbers, and other critical data points, ensuring they are never accidentally pruned.
- Token budgeting -- enforces a strict token limit so your context always fits the window you set.
- ROI telemetry -- prints exactly how many tokens (and dollars) you saved on every call. Developers screenshot this and share it.
It works with any LLM -- OpenAI, Anthropic, Google, or local models -- because it only touches the prompt, not the model.
Who is this for?
ContextBuddy works at every scale. The value just shows up differently:
| Scale | How they use it | Why it matters |
|---|---|---|
| Solo dev / hobbyist | Drop-in middleware, skip LangChain entirely | Zero deps, 3 lines, no infrastructure to manage |
| Startup (seed to Series A) | Full pipeline replacing LangChain stack | Cut API bill from $10k to $3k/month, ship in days not weeks |
| Mid-size company | Compression layer inside their existing LangChain/LlamaIndex stack | Bolt on to existing code, save 60% without rewriting anything |
| Enterprise | Cost governance + smart routing across teams | ROI telemetry for budgeting, model routing to manage spend at scale |
The bigger the company, the more they overpay on tokens. A team running 1M LLM calls/day is burning $30k+/month in unnecessary tokens. A compression middleware that saves 60% is worth $18k/month to them -- and it plugs in with 3 lines.
Specifically built for:
- AI engineers building RAG pipelines who want to cut API costs without sacrificing answer quality.
- Startups shipping LLM-powered products who need to keep their OpenAI/Anthropic bill under control.
- Solo developers who want multi-doc RAG without installing LangChain and 100 transitive dependencies.
- Platform teams who need cost visibility and governance over LLM spend across the organization.
- Agent builders who need their tools to pass compressed, high-signal context to function calls.
- LangChain users who want to drop in a compression layer without rewriting a single retriever -- just install the
[langchain]extra. - Anyone already using LangChain/LlamaIndex who wants to cut costs without rewriting -- just drop ContextBuddy into your existing pipeline as a compression step.
Why ContextBuddy over the alternatives?
| Feature | LangChain | LlamaIndex | LightRAG | ContextBuddy |
|---|---|---|---|---|
| Install size | 100+ deps | 50+ deps | 20+ deps | 0 core deps |
| Lines to first RAG | ~30 | ~15 | ~10 | 3 |
| Cost optimization | None | None | None | Built-in |
| ROI telemetry | None | None | None | Every call |
| Vector DB required | Yes | Yes | Yes | No |
| Context compression | None | None | None | Semantic pruning + budgeting |
| PDF/URL/DOCX loading | Separate install | Built-in | Separate | Built-in (optional deps) |
| LangChain compatible | N/A (is LangChain) | Adapter needed | No | Native ([langchain] extra) |
ContextBuddy does 80% of what LangChain does in 10% of the code. Zero dependencies for the core. Optional extras for PDFs, web scraping, accurate tokenizers, and native LangChain integration.
Install
pip install contextbuddy
Optional extras:
pip install "contextbuddy[pdf]" # PDF loading (pymupdf)
pip install "contextbuddy[web]" # URL/web scraping (httpx + bs4)
pip install "contextbuddy[tiktoken]" # Accurate OpenAI token counts
pip install "contextbuddy[openai]" # OpenAI embeddings
pip install "contextbuddy[ollama]" # Free local semantic embeddings (requires Ollama)
pip install "contextbuddy[sbert]" # Free local semantic embeddings (sentence-transformers)
pip install "contextbuddy[loaders]" # All document loaders
pip install "contextbuddy[langchain]" # LangChain integration (langchain-core)
pip install "contextbuddy[mcp]" # MCP server for Cursor / Claude Desktop
pip install "contextbuddy[all]" # Everything (including MCP + LangChain)
MCP Server (Cursor / Claude Desktop)
ContextBuddy ships an MCP server that gives AI assistants direct access to codebase search and context compression — no manual copy-paste needed.
pip install "contextbuddy[mcp]"
Setup in Cursor
- Copy
.cursor/mcp.json.exampleto.cursor/mcp.json - Replace
/path/to/your/repowith the absolute path to your project - Restart Cursor — the server starts automatically
{
"mcpServers": {
"contextbuddy": {
"command": "python",
"args": ["-m", "contextbuddy.mcp.server"],
"env": {
"CONTEXTBUDDY_ALLOWED_ROOTS": "/absolute/path/to/your/repo"
}
}
}
}
Note:
.cursor/mcp.jsonis gitignored (it contains your local path). Commit.cursor/mcp.json.exampleinstead.
Slash commands (in Cursor chat)
| Command | What it does |
|---|---|
/cb <question> |
Quick codebase search + compression |
/cb_deep <question> |
Semantic + graph search (best quality, requires indexes) |
/cb_index |
Build vector + graph indexes for the repo |
The server exposes 12 tools. The LLM picks them automatically based on your question — no explicit invocation needed.
LangChain Integration
ContextBuddy plugs directly into LangChain as a native compression layer. No glue code, no adapters -- just install the extra and use the two provided classes.
pip install "contextbuddy[langchain]"
Requires
langchain-core>=0.1.0. If it is missing, importingContextBuddyCompressororContextBuddyRetrieverwill raise a helpfulImportErrortelling you exactly what to install.
ContextBuddyCompressor
A drop-in base_compressor for LangChain's ContextualCompressionRetriever. It scores retrieved documents against the query, prunes irrelevant ones, preserves entities, and enforces a token budget -- all before the LLM sees a single token.
ContextBuddyRetriever
Wraps any MemoryStore (or any object with a .search(query, top_k) method). Runs semantic search, compresses the results, and returns standard LangChain Document objects. Use it anywhere LangChain expects a BaseRetriever.
Example: both classes in action
from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import ChatOpenAI
from contextbuddy import (
ContextBuddyCompressor,
ContextBuddyRetriever,
MemoryStore,
load,
)
# --- Option A: Compress results from an existing LangChain retriever ---
compressor = ContextBuddyCompressor(max_context_tokens=3000, min_relevance=0.15)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=your_existing_retriever, # any LangChain BaseRetriever
)
docs = compression_retriever.invoke("What are the payment terms?")
# `docs` contains only the chunks that survived pruning + budgeting
# --- Option B: Full retrieval + compression from a ContextBuddy store ---
store = MemoryStore()
store.add(load("./contracts/"))
retriever = ContextBuddyRetriever(store=store, max_context_tokens=2000, top_k=20)
docs = retriever.invoke("What is the late penalty clause?")
# Returns Document objects -- plug straight into any LangChain chain
# Use with any LangChain chain
llm = ChatOpenAI(model="gpt-4o-mini")
for doc in docs:
print(doc.page_content[:120], "...")
| Class | Purpose | Key params |
|---|---|---|
ContextBuddyCompressor |
Prune docs from any retriever | max_context_tokens, min_relevance, conservative_mode |
ContextBuddyRetriever |
Search a MemoryStore + compress |
store, max_context_tokens, min_relevance, top_k |
Both classes are exported from the top-level package: from contextbuddy import ContextBuddyCompressor, ContextBuddyRetriever.
Embedding Levels (what to use)
ContextBuddy is compression-first. Embeddings are optional -- you only upgrade when you need more semantic accuracy.
| Level | What you get | Cost | Dependencies | When to use |
|---|---|---|---|---|
| Level 0 (default) | Hash/BM25-style relevance (fast, decent) | $0 | None (core) | Most business/technical docs with shared vocabulary |
| Level 1 (free semantic, local) | True semantic similarity (offline) | $0 | Optional | Synonyms/paraphrases matter; you want better recall without paying APIs |
| Level 2 (paid semantic) | Best-in-class embeddings | $$ | Optional | Multilingual / high-stakes accuracy / heavy paraphrasing |
Level 0 (default): zero-dependency
- Works out of the box, no setup.
- Best when the question and answer share some vocabulary.
from contextbuddy import ContextEngine, ContextEngineConfig
engine = ContextEngine(ContextEngineConfig(max_context_tokens=4000))
Level 1 (free semantic): local embeddings (recommended upgrade)
Pick one:
- Ollama (best DX, keeps your Python deps light):
pip install "contextbuddy[ollama]"
Requires Ollama installed and a local embedding model pulled. - Sentence Transformers (in-process, heavier install):
pip install "contextbuddy[sbert]"
from contextbuddy import ContextEngine, ContextEngineConfig, OllamaEmbedder
engine = ContextEngine(
ContextEngineConfig(max_context_tokens=4000),
embedder=OllamaEmbedder(model="nomic-embed-text"), # local + free
)
from contextbuddy import ContextEngine, ContextEngineConfig, SentenceTransformersEmbedder
engine = ContextEngine(
ContextEngineConfig(max_context_tokens=4000),
embedder=SentenceTransformersEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
)
Level 2 (paid semantic): OpenAI / Gemini
- Use when you want the highest semantic accuracy and you're okay with external API calls.
- Install:
pip install "contextbuddy[openai]"(and similarly for Gemini when you enable it).
from contextbuddy import ContextEngine, ContextEngineConfig, OpenAIEmbedder
engine = ContextEngine(
ContextEngineConfig(max_context_tokens=4000),
embedder=OpenAIEmbedder(model="text-embedding-3-small"),
)
from contextbuddy import ContextEngine, ContextEngineConfig, GeminiEmbedder
engine = ContextEngine(
ContextEngineConfig(max_context_tokens=4000),
embedder=GeminiEmbedder(model="text-embedding-004"),
)
90-second quickstart (the only path you need)
Compress a huge, noisy context into a budgeted prompt before the LLM call -- in three lines.
from contextbuddy import ContextEngine, ContextEngineConfig
engine = ContextEngine(ContextEngineConfig(dev_mode=True, max_context_tokens=4000))
huge_context = """
Invoice INV-92831 issued 2026-04-01 for account_id=acct_12345.
Amount: $4,500.00 USD. Payment due within 30 days.
... 20 pages of unrelated notes, meeting transcripts, old emails ...
Ticket ACME-2041: chargebacks for user_id=usr_9z8y7x6w.
"""
final_prompt, report = engine.build_prompt(
user_prompt="Summarize the invoice and ticket. Include all IDs and dates.",
context=huge_context,
)
print(report.reduction_pct, "% smaller, $", report.estimated_savings, "saved per call")
# Pass `final_prompt` to any LLM (OpenAI, Anthropic, Gemini, local -- ContextBuddy doesn't care).
When you're ready to call an LLM, use engine.run(...) (sync) or engine.arun(...) (async) and pass any llm_call callable. See 5 Ways to Use It for loaders, full RAG, LangChain, and pipeline patterns.
Benchmarks (quality gate)
ContextBuddy includes a small benchmark harness so "more compression" doesn't silently break correctness.
python -m pip install -e .
python -m contextbuddy bench --gate --json bench-report.json
See docs/benchmarks/benchmarks.md and benchmarks/datasets/v0.sample.json.
Docs
Start at docs/index.md.
What ContextBuddy guarantees
- Entity survival. Any regex-matched entity (IDs, emails, URLs, dates, money, tickets, phones, UUIDs, version strings) always survives compression.
- Never larger. Output is always shorter than input -- or unchanged if input already fits the budget.
- Never empty. If input has content, output is non-empty. Empty output is treated as a bug, not a valid result.
- Deterministic core. Same input + same config = same output. No randomness in the core pipeline.
- Zero core dependencies. Works on a fresh Python 3.9+ install.
pip install contextbuddy-> done. - Budget respected. Final prompt always fits
max_context_tokens. No mid-sentence cuts.
What ContextBuddy does not do
- Not an agent framework. It compresses context; it doesn't orchestrate tools, memory, or loops. Pair with LangGraph/CrewAI if you need that.
- Not a vector database. The in-memory store is great up to ~100k chunks. Above that, use Pinecone/Weaviate and plug ContextBuddy in as the compression layer.
- Doesn't call LLMs itself. You always pass
llm_call=.... Works with OpenAI, Anthropic, Gemini, Ollama, anything. - Doesn't learn. Scoring is algorithmic (BM25 + stemmer + synonyms + n-grams). No training, no drift.
- Doesn't ship a UI. It's a library, not a product.
5 Ways to Use It (pick your level)
Path 1: Compress raw text (3 lines)
from contextbuddy import ContextEngine, ContextEngineConfig
engine = ContextEngine(ContextEngineConfig(dev_mode=True, max_context_tokens=4000))
result = engine.run(
user_prompt="Summarize the key points.",
context=huge_raw_text,
llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)
Path 2: Load files + compress (3 lines)
from contextbuddy import ContextEngine, load
engine = ContextEngine(dev_mode=True, max_context_tokens=4000)
result = engine.run(
user_prompt="What are the payment terms?",
context=load("contract.pdf"),
llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)
Path 3: Multi-document RAG (3 lines)
from contextbuddy import Retriever, MemoryStore, load
store = MemoryStore().add(load("./docs/"))
result = Retriever(store, dev_mode=True).query(
"What are the payment terms?",
llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)
Path 4: Full pipeline (one-liner setup)
from contextbuddy import Pipeline
pipeline = Pipeline.from_directory("./docs/", dev_mode=True)
result = pipeline.query("Summarize the contract", llm_call=my_llm)
Path 5: LangChain pipeline
Drop ContextBuddy into any existing LangChain retriever as a ContextualCompressionRetriever. Your retriever stays the same -- ContextBuddy just compresses what it returns.
from langchain.retrievers import ContextualCompressionRetriever
from contextbuddy import ContextBuddyCompressor
compressor = ContextBuddyCompressor(max_context_tokens=3000)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=your_existing_retriever,
)
docs = retriever.invoke("What is the refund policy?")
# Only high-relevance, budget-fitting chunks survive. Entities always kept.
No rewrites required. Install contextbuddy[langchain], add 4 lines, and your pipeline is 60% cheaper.
Architecture
Your Files (PDFs, URLs, DOCX, TXT, CSV, directories)
|
v
+--------------+
| Loaders | load("file.pdf") / load("https://...") / load("./dir/")
+------+-------+
|
v
+--------------+
| Store | In-memory vector index (auto-dedup, metadata, persistence)
+------+-------+
|
v
+--------------+
| Retriever | Semantic search -> top-k chunks
+------+-------+
|
v
+--------------+
| Compressor | Prune -> entity keep-list -> token budget -> compose
+------+-------+
|
v
+--------------+
| Router | Score query complexity -> pick cheap or expensive model
+------+-------+
|
v
+--------------+
| Cache | Embedding cache + response cache (skip redundant work)
+------+-------+
|
v
Your LLM (OpenAI / Anthropic / Google / Local)
Every layer is optional. Use one, use all, or use any combination.
How Compression Actually Works (No ML, No NumPy)
ContextBuddy doesn't use a neural network to compress your context. The entire pipeline is algorithmic, using techniques that predate deep learning by decades -- but combined in a way that delivers results competitive with embedding-based approaches. Here's exactly what happens when you call engine.run():
Step 1: Chunking
Your raw text (PDF/web/code/plain text) is chunked into coherent units using a document-aware chunker:
- Generic text: paragraph/sentence-aware merging (avoids tiny orphan fragments)
- PDF: normalizes line-break artifacts and avoids page-wise chunking
- Contracts: groups clause/section headers with their bodies (keeps related content together)
- Python code: keeps imports + functions/classes intact (no mid-function splits)
The goal is not "more chunks" -- it's better chunk boundaries, so relevance scoring and budgeting keep the right information with fewer tokens.
Step 2: Relevance Scoring (HybridScorer -- the secret sauce)
This is where ContextBuddy is different from every other compression library. Instead of relying on a single signal, the default HybridScorer combines four independent scoring signals into one relevance score:
Signal 1: BM25 (70% weight) -- The same algorithm that powers Elasticsearch and Lucene. It handles term-frequency saturation (saying "payment" 10 times isn't 10x more relevant than once), document-length normalization (longer paragraphs don't cheat the ranking), and inverse-document-frequency weighting (rare words matter more than common ones). This alone is a massive upgrade over naive keyword matching.
Signal 2: Stemming (built into BM25) -- A lightweight suffix-stripping stemmer normalizes word forms before scoring. "payments" matches "payment". "running" matches "run". "organized" matches "organizing". No NLTK, no spaCy -- just 120 lines of pure Python implementing the most impactful Porter stemmer rules.
Signal 3: Synonym Expansion (15% weight) -- A built-in thesaurus of ~200 word groups covering business, legal, tech, medical, and general vocabulary. When you ask about "car insurance," the scorer automatically expands "car" to also check for "automobile," "vehicle," and "auto" in every paragraph. "Buy" matches "purchase." "Salary" matches "compensation." "Error" matches "bug." All offline, zero API calls.
Signal 4: Character N-gram Fuzzy Matching (15% weight) -- Catches morphological variants and typos that stemming misses. "optimise" matches "optimize." "colour" matches "color." Works by computing Jaccard similarity over character trigrams -- if two words share enough 3-character substrings, they're treated as partial matches.
The four signals are normalized to [0, 1] and combined with configurable weights. The result: paragraphs that are genuinely relevant to your question score high, even when they use completely different words.
from contextbuddy import HybridScorer
scorer = HybridScorer()
scores = scorer.score(
query="What is the car insurance policy?",
chunks=[
"The automobile coverage plan includes collision and liability.", # scores HIGH (synonym match)
"Employee cafeteria hours are 12pm to 2pm.", # scores LOW (irrelevant)
],
)
Step 3: Entity Extraction
Regex patterns scan every paragraph for critical data: emails, URLs, dates, dollar amounts, IDs, phone numbers, ticket numbers, etc. Any paragraph containing a detected entity is force-kept regardless of its relevance score, so you never accidentally drop the invoice ID the user asked about.
Step 4: Budget Enforcement
The surviving paragraphs are sorted by importance (entity-containing chunks first, then by relevance score) and greedily packed into the token budget. If even a single chunk won't fit, it's extractively summarized (leading sentences kept until the limit). The final prompt always fits the budget you set.
Scorer Comparison
HybridScorer (default) |
SemanticScorer + LocalHashEmbedder |
SemanticScorer + OpenAIEmbedder |
|
|---|---|---|---|
| Understands synonyms | Yes (built-in thesaurus) | No | Yes |
| Handles word forms | Yes (stemming) | No | Yes |
| Fuzzy matching | Yes (n-grams) | No | No |
| IDF weighting | Yes (BM25) | No | Yes |
| Needs API key | No | No | Yes |
| Needs internet | No | No | Yes |
| Dependencies | Zero | Zero | openai package |
| Cost | Free | Free | ~$0.0002/doc |
| Latency | <5ms | <2ms | ~200ms |
The HybridScorer is the default because it gives the best results for zero cost and zero dependencies. For production use cases with highly specialized vocabulary (niche medical terms, non-English content), you can still swap in OpenAIEmbedder for true neural semantic matching:
from contextbuddy.embedder import OpenAIEmbedder
engine = ContextEngine(
ContextEngineConfig(max_context_tokens=4000, dev_mode=True),
embedder=OpenAIEmbedder(), # neural embeddings for edge cases
)
Or bring your own scorer -- any object with a score(query=..., chunks=...) -> List[float] method works.
Document Loaders
from contextbuddy.loaders import load
load("report.pdf") # PDF (pip install contextbuddy[pdf])
load("https://docs.example.com") # Web page (pip install contextbuddy[web])
load("notes.docx") # Word doc (pip install contextbuddy[docx])
load("data.csv") # CSV (rows as chunks)
load("config.json") # JSON (keys/items as chunks)
load("./documents/") # Entire directory (recursive)
load(["a.pdf", "b.txt", "c.docx"]) # Batch load
Zero-dep formats: .txt, .md, .csv, .json, .log, .xml, .yaml, .html
Vector Store
from contextbuddy import MemoryStore, PersistentStore, load
# In-memory (default)
store = MemoryStore()
store.add(load("report.pdf"), metadata={"source": "report.pdf"})
store.add(load("notes.txt"), metadata={"source": "notes.txt"})
results = store.search("payment terms", top_k=10)
# Persistent (survives restarts)
store = PersistentStore("./my_index.json")
store.add(load("./docs/"))
# Auto-saves to disk. Reloads on next init.
Features: auto-deduplication, metadata tracking, serialization, pure-Python cosine search.
Smart Model Router
Route simple queries to cheap models. Route complex ones to expensive models. All offline.
from contextbuddy import Router, Pipeline
router = Router([
{"max_complexity": 0.3, "model": "gpt-4o-mini"},
{"max_complexity": 1.0, "model": "gpt-4o"},
])
pipeline = Pipeline.from_directory("./docs/", router=router, dev_mode=True)
result = pipeline.query(
"Summarize the contract",
llm_calls={
"gpt-4o-mini": lambda p: cheap_client.responses.create(model="gpt-4o-mini", input=p),
"gpt-4o": lambda p: expensive_client.responses.create(model="gpt-4o", input=p),
},
)
Caching
from contextbuddy import Pipeline, EmbeddingCache, ResponseCache
pipeline = Pipeline.from_directory(
"./docs/",
embedding_cache=EmbeddingCache(persist_path="./cache/embeddings.json"),
response_cache=ResponseCache(ttl_seconds=3600),
)
# First query embeds + calls LLM. Second identical query: instant.
Agent Tools
ContextBuddy generates OpenAI-compatible function/tool schemas for agents:
from contextbuddy.tools import make_search_tool, make_compress_tool, handle_tool_call
tools = [make_search_tool(store), make_compress_tool(engine)]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
# Dispatch tool calls
for tc in response.choices[0].message.tool_calls:
result = handle_tool_call(tc, tools)
Streaming
for chunk in engine.run(
user_prompt="Summarize",
context=load("report.pdf"),
llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p, stream=True),
stream=True,
):
print(chunk, end="")
OpenAI Drop-in Wrapper
Zero code changes to your existing app:
from contextbuddy import wrap_openai
client = wrap_openai(openai.OpenAI(), max_context_tokens=4000, dev_mode=True)
# Use client.chat.completions.create() exactly as before.
# System messages are automatically compressed.
CLI (no API key needed)
echo "Your huge context..." | python -m contextbuddy compress \
--prompt "What are the key points?" \
--max-tokens 2000 \
--show-prompt
python -m contextbuddy compress \
--file report.txt \
--prompt "Extract action items" \
--model gpt-4o
Entity Types Preserved
| Category | Examples |
|---|---|
| Emails | alice@example.com |
| URLs | https://api.example.com/v2/users |
| Dates | 2026-04-13, 04/13/2026, 2026-04-13T10:30 |
| UUIDs | 550e8400-e29b-41d4-a716-446655440000 |
| Tickets | JIRA-1234, ACME-2041 |
| Phone numbers | +1-555-867-5309 |
| Money | $4,500.00, 1000 USD |
| IPs | 192.168.1.100 |
| ID-like values | account_id=acct_12345 |
| Versions | v2.1.0 |
Pre-built Model Pricing
from contextbuddy.pricing import (
OPENAI_GPT4O, OPENAI_GPT4O_MINI, OPENAI_GPT41, OPENAI_GPT41_MINI,
OPENAI_O3, OPENAI_O4_MINI,
CLAUDE_OPUS_4, CLAUDE_SONNET_4, CLAUDE_HAIKU_35,
GEMINI_25_PRO, GEMINI_25_FLASH,
LOCAL_FREE,
get_pricing, # get_pricing("gpt-4o") -> ModelPricing
)
Programmatic Report
report = engine.last_report
report.original_prompt_tokens # 15000
report.final_prompt_tokens # 3000
report.reduction_pct # 80.0
report.estimated_savings # 0.06
report.kept_chunks # 4
report.total_chunks # 12
report.entities # ["INV-92831", "2026-04-01", ...]
Public API Reference
| Export | Module | Description |
|---|---|---|
ContextEngine |
contextbuddy.engine |
Core compression engine |
ContextEngineConfig |
contextbuddy.engine |
Configuration dataclass |
ContextReport |
contextbuddy.engine |
Compression telemetry / ROI report |
HybridScorer |
contextbuddy.hybrid_scorer |
BM25 + stemming + synonyms + n-grams scorer |
SemanticScorer |
contextbuddy.scoring |
Embedding-based cosine scorer |
MemoryStore |
contextbuddy.store.memory |
In-memory vector store |
PersistentStore |
contextbuddy.store.persistent |
Disk-backed vector store |
Retriever |
contextbuddy.retriever |
Search + compress pipeline |
Pipeline |
contextbuddy.pipeline |
Full end-to-end pipeline |
Router |
contextbuddy.router |
Complexity-based model router |
EmbeddingCache |
contextbuddy.cache |
Persistent embedding cache |
ResponseCache |
contextbuddy.cache |
TTL response cache |
ContextBuddyCompressor |
contextbuddy.langchain |
LangChain BaseDocumentCompressor |
ContextBuddyRetriever |
contextbuddy.langchain |
LangChain BaseRetriever with compression |
wrap_openai |
contextbuddy.wrappers |
OpenAI client drop-in wrapper |
load |
contextbuddy.loaders |
Universal file/URL/directory loader |
get_pricing |
contextbuddy.pricing |
Model pricing lookup |
Embedder |
contextbuddy.types |
Protocol for custom embedders |
Tokenizer |
contextbuddy.types |
Protocol for custom tokenizers |
Real-World Use Cases
Customer Support Bot
Your chatbot pulls a customer's full history (invoices, tickets, emails, notes) for every query -- ~15,000 tokens. Most of it is irrelevant.
from contextbuddy import Pipeline
pipeline = Pipeline.from_directory("./customer_data/acct_12345/", dev_mode=True, max_context_tokens=3000)
answer = pipeline.query("What was my last invoice amount?", llm_call=my_llm)
# [ContextBuddy] 15000 -> 2800 tokens (81.3% reduction). Est. savings: $0.0305
# Entity keep-list preserved: INV-92831, $4,500.00, 2026-04-01, acct_12345
At 10,000 queries/day: $11,250/month without ContextBuddy vs $2,250/month with it.
Legal Document Review
A law firm uploads a 50-page contract. Lawyers ask questions about specific clauses.
from contextbuddy import Pipeline
pipeline = Pipeline.from_directory("./contracts/", dev_mode=True, max_context_tokens=4000)
answer = pipeline.query("What are the payment terms and late penalties?", llm_call=my_llm)
ContextBuddy loads the PDF, indexes 200+ paragraphs, retrieves the relevant ones, prunes to the 5 that matter, and preserves all clause numbers, dates, and dollar amounts. Without it, you'd need LangChain + a vector database + 50 lines of glue code.
Internal Knowledge Base
500 internal docs (Confluence exports, PDFs, Markdown). Engineers ask questions via Slack bot.
from contextbuddy import Pipeline, PersistentStore, Router
pipeline = Pipeline(
store=PersistentStore("./index.json"),
router=Router([
{"max_complexity": 0.3, "model": "gpt-4o-mini"},
{"max_complexity": 1.0, "model": "gpt-4o"},
]),
dev_mode=True,
)
pipeline.add("./company_docs/")
answer = pipeline.query(slack_message, llm_calls={"gpt-4o-mini": cheap_fn, "gpt-4o": expensive_fn})
Simple questions ("What's the WiFi password?") route to the cheap model. Complex questions ("Compare our auth architecture options") route to the expensive one. Router alone saves 60-70% on top of compression.
When NOT to Use ContextBuddy
Being honest:
- Full agent orchestration (multi-step reasoning, tool chains, long-term memory) -- use LangGraph or CrewAI instead. ContextBuddy compresses context, it doesn't orchestrate agents.
- Billion-scale vector search -- if you have 100M+ documents and need sub-millisecond search, use Pinecone or Weaviate directly. ContextBuddy's in-memory store is designed for <100k chunks.
- Already deep in LangChain and it's working -- don't rewrite. Instead, add ContextBuddy as a compression layer with zero disruption:
from langchain.retrievers import ContextualCompressionRetriever
from contextbuddy import ContextBuddyCompressor
# 4 lines. Your existing retriever stays untouched.
compressor = ContextBuddyCompressor(max_context_tokens=4000)
compressed_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=your_existing_langchain_retriever,
)
docs = compressed_retriever.invoke(user_question)
# Irrelevant chunks pruned, entities preserved, token budget enforced.
# Pass `docs` to your chain exactly as before -- just cheaper.
Or, if you prefer the lower-level approach:
from contextbuddy import ContextEngine
engine = ContextEngine(max_context_tokens=4000)
# Inside your LangChain chain, after retrieval but before the LLM call:
compressed_prompt, report = engine.build_prompt(
user_prompt=user_question,
context=retrieved_documents, # from your existing LangChain retriever
)
# Pass compressed_prompt to your LLM instead of the raw retrieved docs
FAQ
Will this hurt answer quality?
It can if you prune too aggressively. Start with min_relevance=0.10 and inspect the compressed prompt in dev mode. The entity keep-list ensures critical data points survive.
Does it send my data anywhere?
Not by default. The built-in embedder and vector store run 100% locally with zero dependencies. Only if you explicitly plug in OpenAIEmbedder does it call an external API.
Does it work with async frameworks (FastAPI, etc.)?
engine.arun() is async-compatible -- the LLM call is awaited. Note: the compression step (chunking + scoring) runs synchronously inside the coroutine. For high-concurrency workloads, wrap compression with asyncio.to_thread(engine.build_prompt, ...). True async compression is planned for v0.4.0.
Does it work with streaming?
Yes. Pass stream=True to engine.run(). ContextBuddy emits the ROI report, then yields LLM chunks.
How accurate is the token count?
The default HeuristicTokenizer uses a 4-chars-per-token rule. For exact counts: pip install contextbuddy[tiktoken].
Can I use this in production?
Yes. The core pipeline is deterministic, dependency-free, and fast (<10ms for typical payloads). Set dev_mode=False to disable telemetry.
How is this different from LangChain?
ContextBuddy is compression-first. LangChain retrieves context but sends it all to the LLM. ContextBuddy retrieves, compresses, preserves entities, and shows you exactly how much you're saving. Zero core dependencies vs 100+. And with the [langchain] extra, the two work together -- ContextBuddy plugs in as the compression layer LangChain never had.
Does it work with LangChain?
Yes, natively. Install contextbuddy[langchain] and use ContextBuddyCompressor as a drop-in base_compressor for ContextualCompressionRetriever, or use ContextBuddyRetriever to wrap a MemoryStore. See the LangChain Integration section.
How does compression work without an LLM? It doesn't need one. The pipeline has four stages: (1) document-aware chunking, (2) relevance scoring via BM25 + stemming + synonym expansion + character n-gram fuzzy matching, (3) entity force-keep — any chunk containing an ID, date, dollar amount, UUID, etc. is kept regardless of score, (4) greedy budget packing. No neural network, no API calls, no randomness. Sub-5ms on a typical payload.
How do you guarantee compression quality without an LLM?
Two ways. First, the entity keep-list is a hard guarantee — regex-matched entities (IDs, dates, money, tickets) always survive, no matter what the scorer says. Second, every release must pass a benchmark gate: 100% entity survival rate and a minimum answer coverage threshold. If a code change breaks either, it doesn't ship. You can run the gate yourself: python -m contextbuddy bench --gate.
Do I need to set up OpenAI/Gemini/Meta embeddings manually? No. Each provider is a one-line install:
pip install "contextbuddy[openai]" # OpenAI
pip install "contextbuddy[gemini]" # Google Gemini
pip install "contextbuddy[ollama]" # Meta Llama / any local model via Ollama (no API key)
pip install "contextbuddy[sbert]" # sentence-transformers (fully local, no API key)
Then pass the embedder as a single argument to ContextEngine. Your API key goes in the environment (OPENAI_API_KEY, GOOGLE_API_KEY). Nothing else to configure.
What about Meta / Llama embeddings specifically?
Meta doesn't offer a hosted embedding API, so the practical path is Ollama — install Ollama, pull a model (ollama pull nomic-embed-text), and use OllamaEmbedder. Runs fully local, no API key, no data leaving your machine, zero cost.
Why use this over other tools? They solve retrieval — fetching the right documents. None of them compress what they retrieve. They send all 20 chunks to the LLM regardless of relevance. ContextBuddy cuts that down to the 4 that actually matter, preserves every entity, enforces a token budget, and shows you the dollar savings on every call. It's not a replacement for those frameworks — it's the compression layer they're all missing. And it plugs into all of them with 3 lines.
Why I Built This
I'm a Recent CS Grad. I was deep in the rabbit hole of context engineering -- reading papers, watching talks, experimenting with how LLMs actually use the context you feed them. And I kept hitting the same wall.
I had a project that needed RAG. Load some PDFs, ask questions, get answers. Simple, right? So I reached for LangChain. And then I spent two days wrestling with 100+ dependencies, cryptic abstractions, and a codebase that felt like it was designed for a different problem. I just wanted to load a PDF and compress the context before sending it to an LLM. I didn't need an agent framework. I didn't need a plugin ecosystem. I needed maybe 200 lines of focused code.
So I closed my laptop, went for a walk, and thought: what if the entire layer between "raw data" and "LLM call" was just... simple?
That's what ContextBuddy is. It's the library I wished existed when I started.
The core insight was that most LLM applications are sending 5-10x more context than they need to. You scrape a 50-page contract, dump the whole thing into GPT-4, and pay for 15,000 tokens when only 3,000 matter. The LLM doesn't even perform better with the extra noise -- it performs worse. Context engineering isn't about stuffing more tokens in. It's about sending the right tokens.
I built ContextBuddy with a few principles:
- Zero dependencies for the core. If you just want to compress text, you shouldn't need to install anything else. No numpy. No torch. No tiktoken. Just Python.
- Three lines to integrate. If it takes more than that, developers will bounce. I know because I bounced.
- Show the ROI. Every call prints exactly how many tokens and dollars you saved. Not because it's a gimmick -- because developers need to justify tool choices to their managers, and a screenshot of "$0.12 saved per call" does that instantly.
- Grow with you. Start with 3 lines. When you need PDF loading, add it. When you need a vector store, add it. When you need model routing, add it. You should never have to rip out ContextBuddy and replace it with LangChain because you outgrew it.
I'm not claiming this replaces LangChain for every use case. If you need multi-step agent orchestration with tool chains and long-term memory, LangChain/LangGraph is the right call. But for the 80% of LLM applications that just need to load data, compress context, and call a model? ContextBuddy does it in a fraction of the code, with zero bloat, and it shows you exactly how much money you're saving.
This started as a side project born out of frustration. I'm sharing it because I think every developer building with LLMs deserves a simpler option.
If it saves you time or money, star the repo. That's all I ask.
Contributing
git clone https://github.com/mohithgowdak/ContextBuddy.git
cd contextbuddy
pip install -e ".[dev]"
pytest
License
MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextbuddy-0.4.2.tar.gz.
File metadata
- Download URL: contextbuddy-0.4.2.tar.gz
- Upload date:
- Size: 131.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
674eb6a991f0c508214407741db487d4e27bf907715fe7705684905c05dfb5b9
|
|
| MD5 |
5e9ccc20f337594d9589cc601fdca2a9
|
|
| BLAKE2b-256 |
ac25556ff2f764f4947620d9b5bf9c2f0c42e478f7202aee0991229c12648248
|
Provenance
The following attestation bundles were made for contextbuddy-0.4.2.tar.gz:
Publisher:
pypi.yml on mohithgowdak/ContextBuddy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextbuddy-0.4.2.tar.gz -
Subject digest:
674eb6a991f0c508214407741db487d4e27bf907715fe7705684905c05dfb5b9 - Sigstore transparency entry: 1420268990
- Sigstore integration time:
-
Permalink:
mohithgowdak/ContextBuddy@c7f71919e3ac107c8ae0d3a665f5e412b11c5f92 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/mohithgowdak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@c7f71919e3ac107c8ae0d3a665f5e412b11c5f92 -
Trigger Event:
push
-
Statement type:
File details
Details for the file contextbuddy-0.4.2-py3-none-any.whl.
File metadata
- Download URL: contextbuddy-0.4.2-py3-none-any.whl
- Upload date:
- Size: 99.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a573bd04f5622aaccc0f9e4587ce27b5d8c077e8e08dd161c9ad4897157398c
|
|
| MD5 |
6cd87a04440ea744756298920f217735
|
|
| BLAKE2b-256 |
45b76a5a4ece024dd9eff785dcdfe46071793893a2bcdb02255d3dee35f69e27
|
Provenance
The following attestation bundles were made for contextbuddy-0.4.2-py3-none-any.whl:
Publisher:
pypi.yml on mohithgowdak/ContextBuddy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextbuddy-0.4.2-py3-none-any.whl -
Subject digest:
4a573bd04f5622aaccc0f9e4587ce27b5d8c077e8e08dd161c9ad4897157398c - Sigstore transparency entry: 1420269149
- Sigstore integration time:
-
Permalink:
mohithgowdak/ContextBuddy@c7f71919e3ac107c8ae0d3a665f5e412b11c5f92 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/mohithgowdak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@c7f71919e3ac107c8ae0d3a665f5e412b11c5f92 -
Trigger Event:
push
-
Statement type: