Skip to main content

From raw PDFs to compressed prompts in 3 lines. Cut your LLM bill by 60%.

Project description

ContextBuddy

From raw PDFs to compressed prompts in 3 lines. Cut your LLM bill by 60%.

Stars Version Python License Deps

Create the demo GIF • Drop it at assets/cli-demo.gif

-->
   ______            __            __  ____            __    __
  / ____/___  ____  / /____  _  __/ /_/ __ )__  ______/ /___/ /_  __
 / /   / __ \/ __ \/ __/ _ \| |/_/ __/ __  / / / / __  / __  / / / /
/ /___/ /_/ / / / / /_/  __/>  </ /_/ /_/ / /_/ / /_/ / /_/ / /_/ /
\____/\____/_/ /_/\__/\___/_/|_|\__/_____/\__,_/\__,_/\__,_/\__, /
                                                            /____/
        The missing compression layer for every LLM stack.

One line. 60% cheaper. Zero core dependencies. ContextBuddy sits between your raw data and your LLM call, strips the noise, keeps every entity, and shows you the savings -- in tokens and dollars -- on every single request.

Install

PyPI

pip install contextbuddy

TestPyPI (for pre-release testing)

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple contextbuddy==0.4.0

Optional extras

# MCP server tools
pip install "contextbuddy[mcp]==0.4.0"

# Python codegraph (tree-sitter call edges)
pip install "contextbuddy[codegraph]==0.4.0"
┌──────────────────────────────────────────────────────┐
│                   ContextBuddy                       │
├──────────────────────────────────────────────────────┤
│  Tokens   before    15000   after      3000          │
│  Saved    -80.0%              Est. $0.0600           │
│  [████████████████████████░░░░░░] 12000 tokens freed │
├──────────────────────────────────────────────────────┤
│  Chunks   total 12    kept 4    pruned 8             │
│  Entities INV-92831, 2026-04-01, acct_12345          │
└──────────────────────────────────────────────────────┘

What is ContextBuddy?

ContextBuddy is a lightweight, open-source Python library that acts as a context middleware between your raw data (PDFs, web pages, documents, databases) and your LLM call. Its entire job is to take a massive, messy prompt -- like 20 pages of scraped text -- compress it, filter out the noise, preserve critical entities, and pass a clean, token-efficient prompt to any LLM.

Think of it as the missing layer in every AI stack: the part that makes sure you're not paying for 15,000 tokens when only 3,000 actually matter.


The Problem

Every developer building with LLMs hits the same wall:

  1. You're overpaying. You send 15,000 tokens of scraped text to GPT-4 when only 3,000 tokens actually matter. That's 5x the cost for worse results.
  2. Context is noisy. Raw PDFs, web scrapes, and database dumps are full of irrelevant paragraphs, boilerplate, and filler. Your LLM wastes attention on noise.
  3. Critical details get lost. When you manually truncate context to save tokens, you accidentally drop the one invoice ID or date the user asked about.
  4. Existing frameworks are bloated. LangChain has 100+ dependencies. LlamaIndex has 50+. You just want to load a PDF and ask a question.

The Solution

ContextBuddy solves all four problems in a single library:

  • Semantic pruning -- scores every paragraph against your question and drops irrelevant content before it hits the expensive model.
  • Entity preservation -- automatically extracts IDs, dates, URLs, phone numbers, and other critical data points, ensuring they are never accidentally pruned.
  • Token budgeting -- enforces a strict token limit so your context always fits the window you set.
  • ROI telemetry -- prints exactly how many tokens (and dollars) you saved on every call. Developers screenshot this and share it.

It works with any LLM -- OpenAI, Anthropic, Google, or local models -- because it only touches the prompt, not the model.


Who is this for?

ContextBuddy works at every scale. The value just shows up differently:

Scale How they use it Why it matters
Solo dev / hobbyist Drop-in middleware, skip LangChain entirely Zero deps, 3 lines, no infrastructure to manage
Startup (seed to Series A) Full pipeline replacing LangChain stack Cut API bill from $10k to $3k/month, ship in days not weeks
Mid-size company Compression layer inside their existing LangChain/LlamaIndex stack Bolt on to existing code, save 60% without rewriting anything
Enterprise Cost governance + smart routing across teams ROI telemetry for budgeting, model routing to manage spend at scale

The bigger the company, the more they overpay on tokens. A team running 1M LLM calls/day is burning $30k+/month in unnecessary tokens. A compression middleware that saves 60% is worth $18k/month to them -- and it plugs in with 3 lines.

Specifically built for:

  • AI engineers building RAG pipelines who want to cut API costs without sacrificing answer quality.
  • Startups shipping LLM-powered products who need to keep their OpenAI/Anthropic bill under control.
  • Solo developers who want multi-doc RAG without installing LangChain and 100 transitive dependencies.
  • Platform teams who need cost visibility and governance over LLM spend across the organization.
  • Agent builders who need their tools to pass compressed, high-signal context to function calls.
  • LangChain users who want to drop in a compression layer without rewriting a single retriever -- just install the [langchain] extra.
  • Anyone already using LangChain/LlamaIndex who wants to cut costs without rewriting -- just drop ContextBuddy into your existing pipeline as a compression step.

Why ContextBuddy over the alternatives?

Feature LangChain LlamaIndex LightRAG ContextBuddy
Install size 100+ deps 50+ deps 20+ deps 0 core deps
Lines to first RAG ~30 ~15 ~10 3
Cost optimization None None None Built-in
ROI telemetry None None None Every call
Vector DB required Yes Yes Yes No
Context compression None None None Semantic pruning + budgeting
PDF/URL/DOCX loading Separate install Built-in Separate Built-in (optional deps)
LangChain compatible N/A (is LangChain) Adapter needed No Native ([langchain] extra)

ContextBuddy does 80% of what LangChain does in 10% of the code. Zero dependencies for the core. Optional extras for PDFs, web scraping, accurate tokenizers, and native LangChain integration.


Install

pip install contextbuddy

Optional extras:

pip install "contextbuddy[pdf]"         # PDF loading (pymupdf)
pip install "contextbuddy[web]"         # URL/web scraping (httpx + bs4)
pip install "contextbuddy[tiktoken]"    # Accurate OpenAI token counts
pip install "contextbuddy[openai]"      # OpenAI embeddings
pip install "contextbuddy[ollama]"      # Free local semantic embeddings (requires Ollama)
pip install "contextbuddy[sbert]"       # Free local semantic embeddings (sentence-transformers)
pip install "contextbuddy[loaders]"     # All document loaders
pip install "contextbuddy[langchain]"   # LangChain integration (langchain-core)
pip install "contextbuddy[mcp]"         # MCP server for Cursor / Claude Desktop
pip install "contextbuddy[all]"         # Everything (including MCP + LangChain)

MCP Server (Cursor / Claude Desktop)

ContextBuddy ships an MCP server that gives AI assistants direct access to codebase search and context compression — no manual copy-paste needed.

pip install "contextbuddy[mcp]"

Setup in Cursor

  1. Copy .cursor/mcp.json.example to .cursor/mcp.json
  2. Replace /path/to/your/repo with the absolute path to your project
  3. Restart Cursor — the server starts automatically
{
  "mcpServers": {
    "contextbuddy": {
      "command": "python",
      "args": ["-m", "contextbuddy.mcp.server"],
      "env": {
        "CONTEXTBUDDY_ALLOWED_ROOTS": "/absolute/path/to/your/repo"
      }
    }
  }
}

Note: .cursor/mcp.json is gitignored (it contains your local path). Commit .cursor/mcp.json.example instead.

Slash commands (in Cursor chat)

Command What it does
/cb <question> Quick codebase search + compression
/cb_deep <question> Semantic + graph search (best quality, requires indexes)
/cb_index Build vector + graph indexes for the repo

The server exposes 12 tools. The LLM picks them automatically based on your question — no explicit invocation needed.


LangChain Integration

ContextBuddy plugs directly into LangChain as a native compression layer. No glue code, no adapters -- just install the extra and use the two provided classes.

pip install "contextbuddy[langchain]"

Requires langchain-core>=0.1.0. If it is missing, importing ContextBuddyCompressor or ContextBuddyRetriever will raise a helpful ImportError telling you exactly what to install.

ContextBuddyCompressor

A drop-in base_compressor for LangChain's ContextualCompressionRetriever. It scores retrieved documents against the query, prunes irrelevant ones, preserves entities, and enforces a token budget -- all before the LLM sees a single token.

ContextBuddyRetriever

Wraps any MemoryStore (or any object with a .search(query, top_k) method). Runs semantic search, compresses the results, and returns standard LangChain Document objects. Use it anywhere LangChain expects a BaseRetriever.

Example: both classes in action

from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import ChatOpenAI
from contextbuddy import (
    ContextBuddyCompressor,
    ContextBuddyRetriever,
    MemoryStore,
    load,
)

# --- Option A: Compress results from an existing LangChain retriever ---
compressor = ContextBuddyCompressor(max_context_tokens=3000, min_relevance=0.15)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=your_existing_retriever,  # any LangChain BaseRetriever
)
docs = compression_retriever.invoke("What are the payment terms?")
# `docs` contains only the chunks that survived pruning + budgeting

# --- Option B: Full retrieval + compression from a ContextBuddy store ---
store = MemoryStore()
store.add(load("./contracts/"))

retriever = ContextBuddyRetriever(store=store, max_context_tokens=2000, top_k=20)
docs = retriever.invoke("What is the late penalty clause?")
# Returns Document objects -- plug straight into any LangChain chain

# Use with any LangChain chain
llm = ChatOpenAI(model="gpt-4o-mini")
for doc in docs:
    print(doc.page_content[:120], "...")
Class Purpose Key params
ContextBuddyCompressor Prune docs from any retriever max_context_tokens, min_relevance, conservative_mode
ContextBuddyRetriever Search a MemoryStore + compress store, max_context_tokens, min_relevance, top_k

Both classes are exported from the top-level package: from contextbuddy import ContextBuddyCompressor, ContextBuddyRetriever.


Embedding Levels (what to use)

ContextBuddy is compression-first. Embeddings are optional -- you only upgrade when you need more semantic accuracy.

Level What you get Cost Dependencies When to use
Level 0 (default) Hash/BM25-style relevance (fast, decent) $0 None (core) Most business/technical docs with shared vocabulary
Level 1 (free semantic, local) True semantic similarity (offline) $0 Optional Synonyms/paraphrases matter; you want better recall without paying APIs
Level 2 (paid semantic) Best-in-class embeddings $$ Optional Multilingual / high-stakes accuracy / heavy paraphrasing

Level 0 (default): zero-dependency

  • Works out of the box, no setup.
  • Best when the question and answer share some vocabulary.
from contextbuddy import ContextEngine, ContextEngineConfig

engine = ContextEngine(ContextEngineConfig(max_context_tokens=4000))

Level 1 (free semantic): local embeddings (recommended upgrade)

Pick one:

  • Ollama (best DX, keeps your Python deps light): pip install "contextbuddy[ollama]"
    Requires Ollama installed and a local embedding model pulled.
  • Sentence Transformers (in-process, heavier install): pip install "contextbuddy[sbert]"
from contextbuddy import ContextEngine, ContextEngineConfig, OllamaEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=OllamaEmbedder(model="nomic-embed-text"),  # local + free
)
from contextbuddy import ContextEngine, ContextEngineConfig, SentenceTransformersEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=SentenceTransformersEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
)

Level 2 (paid semantic): OpenAI / Gemini

  • Use when you want the highest semantic accuracy and you're okay with external API calls.
  • Install: pip install "contextbuddy[openai]" (and similarly for Gemini when you enable it).
from contextbuddy import ContextEngine, ContextEngineConfig, OpenAIEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=OpenAIEmbedder(model="text-embedding-3-small"),
)
from contextbuddy import ContextEngine, ContextEngineConfig, GeminiEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=GeminiEmbedder(model="text-embedding-004"),
)

90-second quickstart (the only path you need)

Compress a huge, noisy context into a budgeted prompt before the LLM call -- in three lines.

from contextbuddy import ContextEngine, ContextEngineConfig

engine = ContextEngine(ContextEngineConfig(dev_mode=True, max_context_tokens=4000))

huge_context = """
Invoice INV-92831 issued 2026-04-01 for account_id=acct_12345.
Amount: $4,500.00 USD. Payment due within 30 days.

... 20 pages of unrelated notes, meeting transcripts, old emails ...

Ticket ACME-2041: chargebacks for user_id=usr_9z8y7x6w.
"""

final_prompt, report = engine.build_prompt(
    user_prompt="Summarize the invoice and ticket. Include all IDs and dates.",
    context=huge_context,
)

print(report.reduction_pct, "% smaller,  $", report.estimated_savings, "saved per call")
# Pass `final_prompt` to any LLM (OpenAI, Anthropic, Gemini, local -- ContextBuddy doesn't care).

When you're ready to call an LLM, use engine.run(...) (sync) or engine.arun(...) (async) and pass any llm_call callable. See 5 Ways to Use It for loaders, full RAG, LangChain, and pipeline patterns.


Benchmarks (quality gate)

ContextBuddy includes a small benchmark harness so "more compression" doesn't silently break correctness.

python -m pip install -e .
python -m contextbuddy bench --gate --json bench-report.json

See docs/benchmarks/benchmarks.md and benchmarks/datasets/v0.sample.json.


Docs

Start at docs/index.md.


What ContextBuddy guarantees

  • Entity survival. Any regex-matched entity (IDs, emails, URLs, dates, money, tickets, phones, UUIDs, version strings) always survives compression.
  • Never larger. Output is always shorter than input -- or unchanged if input already fits the budget.
  • Never empty. If input has content, output is non-empty. Empty output is treated as a bug, not a valid result.
  • Deterministic core. Same input + same config = same output. No randomness in the core pipeline.
  • Zero core dependencies. Works on a fresh Python 3.9+ install. pip install contextbuddy -> done.
  • Budget respected. Final prompt always fits max_context_tokens. No mid-sentence cuts.

What ContextBuddy does not do

  • Not an agent framework. It compresses context; it doesn't orchestrate tools, memory, or loops. Pair with LangGraph/CrewAI if you need that.
  • Not a vector database. The in-memory store is great up to ~100k chunks. Above that, use Pinecone/Weaviate and plug ContextBuddy in as the compression layer.
  • Doesn't call LLMs itself. You always pass llm_call=.... Works with OpenAI, Anthropic, Gemini, Ollama, anything.
  • Doesn't learn. Scoring is algorithmic (BM25 + stemmer + synonyms + n-grams). No training, no drift.
  • Doesn't ship a UI. It's a library, not a product.

5 Ways to Use It (pick your level)

Path 1: Compress raw text (3 lines)

from contextbuddy import ContextEngine, ContextEngineConfig

engine = ContextEngine(ContextEngineConfig(dev_mode=True, max_context_tokens=4000))
result = engine.run(
    user_prompt="Summarize the key points.",
    context=huge_raw_text,
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)

Path 2: Load files + compress (3 lines)

from contextbuddy import ContextEngine, load

engine = ContextEngine(dev_mode=True, max_context_tokens=4000)
result = engine.run(
    user_prompt="What are the payment terms?",
    context=load("contract.pdf"),
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)

Path 3: Multi-document RAG (3 lines)

from contextbuddy import Retriever, MemoryStore, load

store = MemoryStore().add(load("./docs/"))
result = Retriever(store, dev_mode=True).query(
    "What are the payment terms?",
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)

Path 4: Full pipeline (one-liner setup)

from contextbuddy import Pipeline

pipeline = Pipeline.from_directory("./docs/", dev_mode=True)
result = pipeline.query("Summarize the contract", llm_call=my_llm)

Path 5: LangChain pipeline

Drop ContextBuddy into any existing LangChain retriever as a ContextualCompressionRetriever. Your retriever stays the same -- ContextBuddy just compresses what it returns.

from langchain.retrievers import ContextualCompressionRetriever
from contextbuddy import ContextBuddyCompressor

compressor = ContextBuddyCompressor(max_context_tokens=3000)
retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=your_existing_retriever,
)
docs = retriever.invoke("What is the refund policy?")
# Only high-relevance, budget-fitting chunks survive. Entities always kept.

No rewrites required. Install contextbuddy[langchain], add 4 lines, and your pipeline is 60% cheaper.


Architecture

Your Files (PDFs, URLs, DOCX, TXT, CSV, directories)
    |
    v
+--------------+
|  Loaders     |  load("file.pdf") / load("https://...") / load("./dir/")
+------+-------+
       |
       v
+--------------+
|  Store       |  In-memory vector index (auto-dedup, metadata, persistence)
+------+-------+
       |
       v
+--------------+
|  Retriever   |  Semantic search -> top-k chunks
+------+-------+
       |
       v
+--------------+
|  Compressor  |  Prune -> entity keep-list -> token budget -> compose
+------+-------+
       |
       v
+--------------+
|  Router      |  Score query complexity -> pick cheap or expensive model
+------+-------+
       |
       v
+--------------+
|  Cache       |  Embedding cache + response cache (skip redundant work)
+------+-------+
       |
       v
  Your LLM (OpenAI / Anthropic / Google / Local)

Every layer is optional. Use one, use all, or use any combination.


How Compression Actually Works (No ML, No NumPy)

ContextBuddy doesn't use a neural network to compress your context. The entire pipeline is algorithmic, using techniques that predate deep learning by decades -- but combined in a way that delivers results competitive with embedding-based approaches. Here's exactly what happens when you call engine.run():

Step 1: Chunking

Your raw text (PDF/web/code/plain text) is chunked into coherent units using a document-aware chunker:

  • Generic text: paragraph/sentence-aware merging (avoids tiny orphan fragments)
  • PDF: normalizes line-break artifacts and avoids page-wise chunking
  • Contracts: groups clause/section headers with their bodies (keeps related content together)
  • Python code: keeps imports + functions/classes intact (no mid-function splits)

The goal is not "more chunks" -- it's better chunk boundaries, so relevance scoring and budgeting keep the right information with fewer tokens.

Step 2: Relevance Scoring (HybridScorer -- the secret sauce)

This is where ContextBuddy is different from every other compression library. Instead of relying on a single signal, the default HybridScorer combines four independent scoring signals into one relevance score:

Signal 1: BM25 (70% weight) -- The same algorithm that powers Elasticsearch and Lucene. It handles term-frequency saturation (saying "payment" 10 times isn't 10x more relevant than once), document-length normalization (longer paragraphs don't cheat the ranking), and inverse-document-frequency weighting (rare words matter more than common ones). This alone is a massive upgrade over naive keyword matching.

Signal 2: Stemming (built into BM25) -- A lightweight suffix-stripping stemmer normalizes word forms before scoring. "payments" matches "payment". "running" matches "run". "organized" matches "organizing". No NLTK, no spaCy -- just 120 lines of pure Python implementing the most impactful Porter stemmer rules.

Signal 3: Synonym Expansion (15% weight) -- A built-in thesaurus of ~200 word groups covering business, legal, tech, medical, and general vocabulary. When you ask about "car insurance," the scorer automatically expands "car" to also check for "automobile," "vehicle," and "auto" in every paragraph. "Buy" matches "purchase." "Salary" matches "compensation." "Error" matches "bug." All offline, zero API calls.

Signal 4: Character N-gram Fuzzy Matching (15% weight) -- Catches morphological variants and typos that stemming misses. "optimise" matches "optimize." "colour" matches "color." Works by computing Jaccard similarity over character trigrams -- if two words share enough 3-character substrings, they're treated as partial matches.

The four signals are normalized to [0, 1] and combined with configurable weights. The result: paragraphs that are genuinely relevant to your question score high, even when they use completely different words.

from contextbuddy import HybridScorer

scorer = HybridScorer()
scores = scorer.score(
    query="What is the car insurance policy?",
    chunks=[
        "The automobile coverage plan includes collision and liability.",  # scores HIGH (synonym match)
        "Employee cafeteria hours are 12pm to 2pm.",                      # scores LOW (irrelevant)
    ],
)

Step 3: Entity Extraction

Regex patterns scan every paragraph for critical data: emails, URLs, dates, dollar amounts, IDs, phone numbers, ticket numbers, etc. Any paragraph containing a detected entity is force-kept regardless of its relevance score, so you never accidentally drop the invoice ID the user asked about.

Step 4: Budget Enforcement

The surviving paragraphs are sorted by importance (entity-containing chunks first, then by relevance score) and greedily packed into the token budget. If even a single chunk won't fit, it's extractively summarized (leading sentences kept until the limit). The final prompt always fits the budget you set.

Scorer Comparison

HybridScorer (default) SemanticScorer + LocalHashEmbedder SemanticScorer + OpenAIEmbedder
Understands synonyms Yes (built-in thesaurus) No Yes
Handles word forms Yes (stemming) No Yes
Fuzzy matching Yes (n-grams) No No
IDF weighting Yes (BM25) No Yes
Needs API key No No Yes
Needs internet No No Yes
Dependencies Zero Zero openai package
Cost Free Free ~$0.0002/doc
Latency <5ms <2ms ~200ms

The HybridScorer is the default because it gives the best results for zero cost and zero dependencies. For production use cases with highly specialized vocabulary (niche medical terms, non-English content), you can still swap in OpenAIEmbedder for true neural semantic matching:

from contextbuddy.embedder import OpenAIEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000, dev_mode=True),
    embedder=OpenAIEmbedder(),  # neural embeddings for edge cases
)

Or bring your own scorer -- any object with a score(query=..., chunks=...) -> List[float] method works.


Document Loaders

from contextbuddy.loaders import load

load("report.pdf")                    # PDF (pip install contextbuddy[pdf])
load("https://docs.example.com")      # Web page (pip install contextbuddy[web])
load("notes.docx")                    # Word doc (pip install contextbuddy[docx])
load("data.csv")                      # CSV (rows as chunks)
load("config.json")                   # JSON (keys/items as chunks)
load("./documents/")                  # Entire directory (recursive)
load(["a.pdf", "b.txt", "c.docx"])   # Batch load

Zero-dep formats: .txt, .md, .csv, .json, .log, .xml, .yaml, .html


Vector Store

from contextbuddy import MemoryStore, PersistentStore, load

# In-memory (default)
store = MemoryStore()
store.add(load("report.pdf"), metadata={"source": "report.pdf"})
store.add(load("notes.txt"), metadata={"source": "notes.txt"})
results = store.search("payment terms", top_k=10)

# Persistent (survives restarts)
store = PersistentStore("./my_index.json")
store.add(load("./docs/"))
# Auto-saves to disk. Reloads on next init.

Features: auto-deduplication, metadata tracking, serialization, pure-Python cosine search.


Smart Model Router

Route simple queries to cheap models. Route complex ones to expensive models. All offline.

from contextbuddy import Router, Pipeline

router = Router([
    {"max_complexity": 0.3, "model": "gpt-4o-mini"},
    {"max_complexity": 1.0, "model": "gpt-4o"},
])

pipeline = Pipeline.from_directory("./docs/", router=router, dev_mode=True)
result = pipeline.query(
    "Summarize the contract",
    llm_calls={
        "gpt-4o-mini": lambda p: cheap_client.responses.create(model="gpt-4o-mini", input=p),
        "gpt-4o": lambda p: expensive_client.responses.create(model="gpt-4o", input=p),
    },
)

Caching

from contextbuddy import Pipeline, EmbeddingCache, ResponseCache

pipeline = Pipeline.from_directory(
    "./docs/",
    embedding_cache=EmbeddingCache(persist_path="./cache/embeddings.json"),
    response_cache=ResponseCache(ttl_seconds=3600),
)
# First query embeds + calls LLM. Second identical query: instant.

Agent Tools

ContextBuddy generates OpenAI-compatible function/tool schemas for agents:

from contextbuddy.tools import make_search_tool, make_compress_tool, handle_tool_call

tools = [make_search_tool(store), make_compress_tool(engine)]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
)

# Dispatch tool calls
for tc in response.choices[0].message.tool_calls:
    result = handle_tool_call(tc, tools)

Streaming

for chunk in engine.run(
    user_prompt="Summarize",
    context=load("report.pdf"),
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p, stream=True),
    stream=True,
):
    print(chunk, end="")

OpenAI Drop-in Wrapper

Zero code changes to your existing app:

from contextbuddy import wrap_openai

client = wrap_openai(openai.OpenAI(), max_context_tokens=4000, dev_mode=True)
# Use client.chat.completions.create() exactly as before.
# System messages are automatically compressed.

CLI (no API key needed)

echo "Your huge context..." | python -m contextbuddy compress \
    --prompt "What are the key points?" \
    --max-tokens 2000 \
    --show-prompt

python -m contextbuddy compress \
    --file report.txt \
    --prompt "Extract action items" \
    --model gpt-4o

Entity Types Preserved

Category Examples
Emails alice@example.com
URLs https://api.example.com/v2/users
Dates 2026-04-13, 04/13/2026, 2026-04-13T10:30
UUIDs 550e8400-e29b-41d4-a716-446655440000
Tickets JIRA-1234, ACME-2041
Phone numbers +1-555-867-5309
Money $4,500.00, 1000 USD
IPs 192.168.1.100
ID-like values account_id=acct_12345
Versions v2.1.0

Pre-built Model Pricing

from contextbuddy.pricing import (
    OPENAI_GPT4O, OPENAI_GPT4O_MINI, OPENAI_GPT41, OPENAI_GPT41_MINI,
    OPENAI_O3, OPENAI_O4_MINI,
    CLAUDE_OPUS_4, CLAUDE_SONNET_4, CLAUDE_HAIKU_35,
    GEMINI_25_PRO, GEMINI_25_FLASH,
    LOCAL_FREE,
    get_pricing,  # get_pricing("gpt-4o") -> ModelPricing
)

Programmatic Report

report = engine.last_report

report.original_prompt_tokens   # 15000
report.final_prompt_tokens      # 3000
report.reduction_pct            # 80.0
report.estimated_savings        # 0.06
report.kept_chunks              # 4
report.total_chunks             # 12
report.entities                 # ["INV-92831", "2026-04-01", ...]

Public API Reference

Export Module Description
ContextEngine contextbuddy.engine Core compression engine
ContextEngineConfig contextbuddy.engine Configuration dataclass
ContextReport contextbuddy.engine Compression telemetry / ROI report
HybridScorer contextbuddy.hybrid_scorer BM25 + stemming + synonyms + n-grams scorer
SemanticScorer contextbuddy.scoring Embedding-based cosine scorer
MemoryStore contextbuddy.store.memory In-memory vector store
PersistentStore contextbuddy.store.persistent Disk-backed vector store
Retriever contextbuddy.retriever Search + compress pipeline
Pipeline contextbuddy.pipeline Full end-to-end pipeline
Router contextbuddy.router Complexity-based model router
EmbeddingCache contextbuddy.cache Persistent embedding cache
ResponseCache contextbuddy.cache TTL response cache
ContextBuddyCompressor contextbuddy.langchain LangChain BaseDocumentCompressor
ContextBuddyRetriever contextbuddy.langchain LangChain BaseRetriever with compression
wrap_openai contextbuddy.wrappers OpenAI client drop-in wrapper
load contextbuddy.loaders Universal file/URL/directory loader
get_pricing contextbuddy.pricing Model pricing lookup
Embedder contextbuddy.types Protocol for custom embedders
Tokenizer contextbuddy.types Protocol for custom tokenizers

Real-World Use Cases

Customer Support Bot

Your chatbot pulls a customer's full history (invoices, tickets, emails, notes) for every query -- ~15,000 tokens. Most of it is irrelevant.

from contextbuddy import Pipeline

pipeline = Pipeline.from_directory("./customer_data/acct_12345/", dev_mode=True, max_context_tokens=3000)
answer = pipeline.query("What was my last invoice amount?", llm_call=my_llm)
# [ContextBuddy] 15000 -> 2800 tokens (81.3% reduction). Est. savings: $0.0305
# Entity keep-list preserved: INV-92831, $4,500.00, 2026-04-01, acct_12345

At 10,000 queries/day: $11,250/month without ContextBuddy vs $2,250/month with it.

Legal Document Review

A law firm uploads a 50-page contract. Lawyers ask questions about specific clauses.

from contextbuddy import Pipeline

pipeline = Pipeline.from_directory("./contracts/", dev_mode=True, max_context_tokens=4000)
answer = pipeline.query("What are the payment terms and late penalties?", llm_call=my_llm)

ContextBuddy loads the PDF, indexes 200+ paragraphs, retrieves the relevant ones, prunes to the 5 that matter, and preserves all clause numbers, dates, and dollar amounts. Without it, you'd need LangChain + a vector database + 50 lines of glue code.

Internal Knowledge Base

500 internal docs (Confluence exports, PDFs, Markdown). Engineers ask questions via Slack bot.

from contextbuddy import Pipeline, PersistentStore, Router

pipeline = Pipeline(
    store=PersistentStore("./index.json"),
    router=Router([
        {"max_complexity": 0.3, "model": "gpt-4o-mini"},
        {"max_complexity": 1.0, "model": "gpt-4o"},
    ]),
    dev_mode=True,
)
pipeline.add("./company_docs/")
answer = pipeline.query(slack_message, llm_calls={"gpt-4o-mini": cheap_fn, "gpt-4o": expensive_fn})

Simple questions ("What's the WiFi password?") route to the cheap model. Complex questions ("Compare our auth architecture options") route to the expensive one. Router alone saves 60-70% on top of compression.


When NOT to Use ContextBuddy

Being honest:

  • Full agent orchestration (multi-step reasoning, tool chains, long-term memory) -- use LangGraph or CrewAI instead. ContextBuddy compresses context, it doesn't orchestrate agents.
  • Billion-scale vector search -- if you have 100M+ documents and need sub-millisecond search, use Pinecone or Weaviate directly. ContextBuddy's in-memory store is designed for <100k chunks.
  • Already deep in LangChain and it's working -- don't rewrite. Instead, add ContextBuddy as a compression layer with zero disruption:
from langchain.retrievers import ContextualCompressionRetriever
from contextbuddy import ContextBuddyCompressor

# 4 lines. Your existing retriever stays untouched.
compressor = ContextBuddyCompressor(max_context_tokens=4000)
compressed_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=your_existing_langchain_retriever,
)
docs = compressed_retriever.invoke(user_question)
# Irrelevant chunks pruned, entities preserved, token budget enforced.
# Pass `docs` to your chain exactly as before -- just cheaper.

Or, if you prefer the lower-level approach:

from contextbuddy import ContextEngine

engine = ContextEngine(max_context_tokens=4000)

# Inside your LangChain chain, after retrieval but before the LLM call:
compressed_prompt, report = engine.build_prompt(
    user_prompt=user_question,
    context=retrieved_documents,   # from your existing LangChain retriever
)
# Pass compressed_prompt to your LLM instead of the raw retrieved docs

FAQ

Will this hurt answer quality? It can if you prune too aggressively. Start with min_relevance=0.10 and inspect the compressed prompt in dev mode. The entity keep-list ensures critical data points survive.

Does it send my data anywhere? Not by default. The built-in embedder and vector store run 100% locally with zero dependencies. Only if you explicitly plug in OpenAIEmbedder does it call an external API.

Does it work with async frameworks (FastAPI, etc.)? engine.arun() is async-compatible -- the LLM call is awaited. Note: the compression step (chunking + scoring) runs synchronously inside the coroutine. For high-concurrency workloads, wrap compression with asyncio.to_thread(engine.build_prompt, ...). True async compression is planned for v0.4.0.

Does it work with streaming? Yes. Pass stream=True to engine.run(). ContextBuddy emits the ROI report, then yields LLM chunks.

How accurate is the token count? The default HeuristicTokenizer uses a 4-chars-per-token rule. For exact counts: pip install contextbuddy[tiktoken].

Can I use this in production? Yes. The core pipeline is deterministic, dependency-free, and fast (<10ms for typical payloads). Set dev_mode=False to disable telemetry.

How is this different from LangChain? ContextBuddy is compression-first. LangChain retrieves context but sends it all to the LLM. ContextBuddy retrieves, compresses, preserves entities, and shows you exactly how much you're saving. Zero core dependencies vs 100+. And with the [langchain] extra, the two work together -- ContextBuddy plugs in as the compression layer LangChain never had.

Does it work with LangChain? Yes, natively. Install contextbuddy[langchain] and use ContextBuddyCompressor as a drop-in base_compressor for ContextualCompressionRetriever, or use ContextBuddyRetriever to wrap a MemoryStore. See the LangChain Integration section.

How does compression work without an LLM? It doesn't need one. The pipeline has four stages: (1) document-aware chunking, (2) relevance scoring via BM25 + stemming + synonym expansion + character n-gram fuzzy matching, (3) entity force-keep — any chunk containing an ID, date, dollar amount, UUID, etc. is kept regardless of score, (4) greedy budget packing. No neural network, no API calls, no randomness. Sub-5ms on a typical payload.

How do you guarantee compression quality without an LLM? Two ways. First, the entity keep-list is a hard guarantee — regex-matched entities (IDs, dates, money, tickets) always survive, no matter what the scorer says. Second, every release must pass a benchmark gate: 100% entity survival rate and a minimum answer coverage threshold. If a code change breaks either, it doesn't ship. You can run the gate yourself: python -m contextbuddy bench --gate.

Do I need to set up OpenAI/Gemini/Meta embeddings manually? No. Each provider is a one-line install:

pip install "contextbuddy[openai]"    # OpenAI
pip install "contextbuddy[gemini]"    # Google Gemini
pip install "contextbuddy[ollama]"    # Meta Llama / any local model via Ollama (no API key)
pip install "contextbuddy[sbert]"     # sentence-transformers (fully local, no API key)

Then pass the embedder as a single argument to ContextEngine. Your API key goes in the environment (OPENAI_API_KEY, GOOGLE_API_KEY). Nothing else to configure.

What about Meta / Llama embeddings specifically? Meta doesn't offer a hosted embedding API, so the practical path is Ollama — install Ollama, pull a model (ollama pull nomic-embed-text), and use OllamaEmbedder. Runs fully local, no API key, no data leaving your machine, zero cost.

Why use this over other tools? They solve retrieval — fetching the right documents. None of them compress what they retrieve. They send all 20 chunks to the LLM regardless of relevance. ContextBuddy cuts that down to the 4 that actually matter, preserves every entity, enforces a token budget, and shows you the dollar savings on every call. It's not a replacement for those frameworks — it's the compression layer they're all missing. And it plugs into all of them with 3 lines.


Why I Built This

I'm a Recent CS Grad. I was deep in the rabbit hole of context engineering -- reading papers, watching talks, experimenting with how LLMs actually use the context you feed them. And I kept hitting the same wall.

I had a project that needed RAG. Load some PDFs, ask questions, get answers. Simple, right? So I reached for LangChain. And then I spent two days wrestling with 100+ dependencies, cryptic abstractions, and a codebase that felt like it was designed for a different problem. I just wanted to load a PDF and compress the context before sending it to an LLM. I didn't need an agent framework. I didn't need a plugin ecosystem. I needed maybe 200 lines of focused code.

So I closed my laptop, went for a walk, and thought: what if the entire layer between "raw data" and "LLM call" was just... simple?

That's what ContextBuddy is. It's the library I wished existed when I started.

The core insight was that most LLM applications are sending 5-10x more context than they need to. You scrape a 50-page contract, dump the whole thing into GPT-4, and pay for 15,000 tokens when only 3,000 matter. The LLM doesn't even perform better with the extra noise -- it performs worse. Context engineering isn't about stuffing more tokens in. It's about sending the right tokens.

I built ContextBuddy with a few principles:

  1. Zero dependencies for the core. If you just want to compress text, you shouldn't need to install anything else. No numpy. No torch. No tiktoken. Just Python.
  2. Three lines to integrate. If it takes more than that, developers will bounce. I know because I bounced.
  3. Show the ROI. Every call prints exactly how many tokens and dollars you saved. Not because it's a gimmick -- because developers need to justify tool choices to their managers, and a screenshot of "$0.12 saved per call" does that instantly.
  4. Grow with you. Start with 3 lines. When you need PDF loading, add it. When you need a vector store, add it. When you need model routing, add it. You should never have to rip out ContextBuddy and replace it with LangChain because you outgrew it.

I'm not claiming this replaces LangChain for every use case. If you need multi-step agent orchestration with tool chains and long-term memory, LangChain/LangGraph is the right call. But for the 80% of LLM applications that just need to load data, compress context, and call a model? ContextBuddy does it in a fraction of the code, with zero bloat, and it shows you exactly how much money you're saving.

This started as a side project born out of frustration. I'm sharing it because I think every developer building with LLMs deserves a simpler option.

If it saves you time or money, star the repo. That's all I ask.


Contributing

git clone https://github.com/mohithgowdak/ContextBuddy.git
cd contextbuddy
pip install -e ".[dev]"
pytest

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextbuddy-0.4.2.tar.gz (131.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextbuddy-0.4.2-py3-none-any.whl (99.3 kB view details)

Uploaded Python 3

File details

Details for the file contextbuddy-0.4.2.tar.gz.

File metadata

  • Download URL: contextbuddy-0.4.2.tar.gz
  • Upload date:
  • Size: 131.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for contextbuddy-0.4.2.tar.gz
Algorithm Hash digest
SHA256 674eb6a991f0c508214407741db487d4e27bf907715fe7705684905c05dfb5b9
MD5 5e9ccc20f337594d9589cc601fdca2a9
BLAKE2b-256 ac25556ff2f764f4947620d9b5bf9c2f0c42e478f7202aee0991229c12648248

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextbuddy-0.4.2.tar.gz:

Publisher: pypi.yml on mohithgowdak/ContextBuddy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file contextbuddy-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: contextbuddy-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 99.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for contextbuddy-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4a573bd04f5622aaccc0f9e4587ce27b5d8c077e8e08dd161c9ad4897157398c
MD5 6cd87a04440ea744756298920f217735
BLAKE2b-256 45b76a5a4ece024dd9eff785dcdfe46071793893a2bcdb02255d3dee35f69e27

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextbuddy-0.4.2-py3-none-any.whl:

Publisher: pypi.yml on mohithgowdak/ContextBuddy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page