Skip to main content

Production-grade LLMOps infrastructure for context window management, token counting, document chunking, and compression

Project description

LLM Context Forge

Production-Grade LLMOps Infrastructure for Context Window Management

Deterministic token counting · Intelligent chunking · Priority-based context assembly · Cost estimation — the foundation every AI application needs.

![Tests](https://github.com/dhruv-atomic-mui21/layout lm-forge/workflows/Tests/badge.svg) PyPI Python License: MIT Downloads

---

Note: This package is a general-purpose LLM context management toolkit and is not related to Microsoft's LayoutLM multimodal models.

Why LLM Context Forge?

Every production AI application eventually hits the same infrastructure problems:

Problem Impact LLM Context Forge Solution
Context window overflow Silent failures, truncated responses Priority-based assembly with overflow tracking
Inaccurate token counting Budget overruns, dropped requests Deterministic counting via tiktoken + heuristic fallbacks
Naive text splitting Broken semantics, degraded LLM reasoning 5 chunking strategies (sentence, paragraph, semantic, code, fixed)
Unpredictable API costs Surprise bills, no cost governance Pre-flight cost estimation across 15+ models
Oversized prompts Wasted tokens, slow responses 4 compression strategies (extractive, truncate, middle-out, map-reduce)

Installation

pip install llm-context-forge

With API server support:

pip install "llm-context-forge[api]"

Quick Start

Token Counting

from llm_context_forge import TokenCounter

counter = TokenCounter("gpt-4o")
tokens = counter.count("Hello, world!")
print(f"Tokens: {tokens}")  # Tokens: 4

# Check context window fit
fits = counter.fits_in_window("Your prompt...", reserve_output=500)

# Estimate cost before sending
cost = counter.estimate_cost("Your prompt...", direction="input")
print(f"Cost: ${cost:.6f}")

Intelligent Chunking

from llm_context_forge import DocumentChunker, ChunkStrategy

chunker = DocumentChunker("gpt-4o")

# Chunk respecting paragraph boundaries
chunks = chunker.chunk(
    long_document,
    strategy=ChunkStrategy.PARAGRAPH,
    max_tokens=500,
    overlap_tokens=50,
)

# Specialized chunkers
code_chunks = chunker.chunk_code(source_code, language="python")
md_chunks = chunker.chunk_markdown(readme_text)

Priority-Based Context Assembly

The core pattern for RAG applications — guarantee critical context fits while gracefully dropping lower-priority content:

from llm_context_forge import ContextWindow, Priority

window = ContextWindow("gpt-4o")

# System instructions — always included
window.add_block("You are a legal assistant.", Priority.CRITICAL, "system")

# User query — high priority
window.add_block("What is the statute of limitations?", Priority.HIGH, "query")

# RAG search results — included if space permits
window.add_block(search_result_1, Priority.MEDIUM, "rag_1")
window.add_block(search_result_2, Priority.LOW, "rag_2")

# Assemble: packs highest-priority blocks first
prompt = window.assemble(max_tokens=4096)

# See what was included/dropped
usage = window.usage()
print(f"Included: {usage['num_included']} blocks ({usage['included_tokens']} tokens)")
print(f"Dropped:  {usage['num_excluded']} blocks")

Cost Estimation

from llm_context_forge import CostCalculator

calc = CostCalculator("gpt-4o")

# Single prompt cost
cost = calc.estimate_prompt("Your prompt text here")
print(f"Input cost: ${cost.usd:.6f}")

# Compare models
comparison = calc.compare_models(
    texts=["Document 1...", "Document 2..."],
    models=["gpt-4o", "gpt-4o-mini", "claude-3.5-sonnet", "gemini-flash"],
)
for model, analysis in comparison.items():
    print(f"{model}: ${analysis.total_usd:.6f} for {analysis.total_tokens} tokens")

Context Compression

from llm_context_forge import ContextCompressor, CompressionStrategy

compressor = ContextCompressor("gpt-4o")

# Extractive: keeps most important sentences via TF-IDF scoring
result = compressor.compress(long_text, target_tokens=200)
print(f"Compressed: {result.original_tokens}{result.compressed_tokens} tokens")
print(f"Savings: {result.savings_pct:.1f}%")

# Middle-out: preserves start and end, removes middle
result = compressor.compress(log_text, target_tokens=300, strategy=CompressionStrategy.MIDDLE_OUT)

Conversation Management

from llm_context_forge import ConversationManager

manager = ConversationManager("gpt-4o")

manager.add_message("system", "You are a helpful Python tutor.")
manager.add_message("user", "Explain decorators")
manager.add_message("assistant", "Decorators are...")
# ... many more turns ...

# Auto-trim older messages to fit budget, preserving system prompt
trimmed = manager.get_context(max_tokens=4096, preserve_system=True)

Supported Models

Provider Models Token Counting Pricing
OpenAI GPT-4, GPT-4 Turbo, GPT-4o, GPT-4o-mini, GPT-3.5 Turbo tiktoken
Anthropic Claude 3 Opus, Claude 3.5 Sonnet, Claude 3 Haiku anthropic
Google Gemini Pro, Gemini Flash transformers
Meta Llama 3 8B, Llama 3 70B, Llama 3.1 405B transformers
Mistral Mistral Large mistral-common
Cohere Command R+ transformers

Production-Grade Tokenizer Fallback

In production environments, external tokenizer packages (transformers, mistral-common) might fail to download or initialize due to network errors. llm-context-forge provides a robust, production-grade fallback:

  • If a native tokenizer fails to load, the system degrades to OpenAI's fast cl100k_base (tiktoken).
  • Since most modern LLMs utilize similar Byte-Pair Encoding (BPE), cl100k_base offers a highly accurate baseline.
  • llm-context-forge automatically applies structural safety multipliers (e.g. 1.05x) specifically tuned to each backend before throwing an overflow warning.
  • A one-time warning is emitted via standard python logging to notify infrastructure teams of the fallback engagement.

Register custom models:

from llm_context_forge import ModelRegistry, ModelInfo, TokenizerBackend

ModelRegistry.register(ModelInfo(
    name="my-fine-tuned-model",
    backend=TokenizerBackend.OPENAI,
    context_window=16_384,
    encoding_name="cl100k_base",
    input_cost_per_1k=0.002,
    output_cost_per_1k=0.006,
))

CLI

# Count tokens
llm_context_forge count "Hello world" --model gpt-4o

# Chunk a document
llm_context_forge chunk document.md --strategy semantic --max-tokens 500

# Estimate cost
llm_context_forge cost document.txt --model claude-3.5-sonnet

# List all models
llm_context_forge models

# Health check
llm_context_forge doctor

# Start API server
llm_context_forge serve --port 8000

# Interactive demo
llm_context_forge demo

REST API

Start the server and access interactive docs at http://localhost:8000/docs:

pip install "llm_context_forge[api]"
llm_context_forge serve
Endpoint Method Description
/health GET System health + version
/api/v1/tokens/count POST Count tokens
/api/v1/tokens/validate POST Check context window fit
/api/v1/chunks/ POST Chunk text
/api/v1/context/assemble POST Priority-based assembly
/api/v1/compress/ POST Compress text
/api/v1/cost/estimate POST Estimate cost

Architecture

llm_context_forge/
├── models.py        # Model registry (15+ models, pricing, backends)
├── tokenizer.py     # Multi-provider token counter (tiktoken + heuristics)
├── chunker.py       # 5-strategy document chunker with overlap
├── context.py       # Priority-based context assembly + conversation manager
├── compressor.py    # 4-strategy compression engine (TF-IDF, middle-out, etc.)
├── cost.py          # Cost estimation engine with model comparison
├── cli/main.py      # Typer CLI with Rich output
└── api/             # FastAPI server with versioned routes

Docker

docker build -t llm_context_forge .
docker-compose up

Development

git clone https://github.com/dhruv-atomic-mui21/llm_context_forge.git
cd llm_context_forge
pip install -e ".[dev]"
pytest

Contributing

See CONTRIBUTING.md for development workflow guidelines.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_context_forge-0.1.5.tar.gz (36.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_context_forge-0.1.5-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file llm_context_forge-0.1.5.tar.gz.

File metadata

  • Download URL: llm_context_forge-0.1.5.tar.gz
  • Upload date:
  • Size: 36.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_context_forge-0.1.5.tar.gz
Algorithm Hash digest
SHA256 609cdf8e868e7e2119f5d307d1773cc59b811b5606b9f4df933306d7ce94ff78
MD5 6865a6c0ac23f9c19f60fde045a02e25
BLAKE2b-256 7db7b2ab2b349a643dae947a5021c14003cbd420063d78a09cce774ded12d5fe

See more details on using hashes here.

File details

Details for the file llm_context_forge-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_context_forge-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b4916f96da412384306ab7e64570604f643cadfbf10fca31d4b19f8f82f0dff5
MD5 a091d2c5091a37b3cdc930d7cd01fe5d
BLAKE2b-256 88c26e380a9228a8d133d6a095cb747548d72c2db8d91c9d6cca4d5669132281

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page