Production-grade LLMOps infrastructure for context window management, token counting, document chunking, and compression

These details have not been verified by PyPI

Project links

Project description

LLM Context Forge

Production-Grade LLMOps Infrastructure for Context Window Management

Deterministic token counting · Intelligent chunking · Priority-based context assembly · Cost estimation — the foundation every AI application needs.

![Tests](https://github.com/dhruv-atomic-mui21/layout lm-forge/workflows/Tests/badge.svg)

---

Note: This package is a general-purpose LLM context management toolkit and is not related to Microsoft's LayoutLM multimodal models.

Why LLM Context Forge?

Every production AI application eventually hits the same infrastructure problems:

Problem	Impact	LLM Context Forge Solution
Context window overflow	Silent failures, truncated responses	Priority-based assembly with overflow tracking
Inaccurate token counting	Budget overruns, dropped requests	Deterministic counting via tiktoken + heuristic fallbacks
Naive text splitting	Broken semantics, degraded LLM reasoning	5 chunking strategies (sentence, paragraph, semantic, code, fixed)
Unpredictable API costs	Surprise bills, no cost governance	Pre-flight cost estimation across 15+ models
Oversized prompts	Wasted tokens, slow responses	4 compression strategies (extractive, truncate, middle-out, map-reduce)

Installation

pip install llm-context-forge

With API server support:

pip install "llm-context-forge[api]"

Quick Start

Token Counting

from llm_context_forge import TokenCounter

counter = TokenCounter("gpt-4o")
tokens = counter.count("Hello, world!")
print(f"Tokens: {tokens}")  # Tokens: 4

# Check context window fit
fits = counter.fits_in_window("Your prompt...", reserve_output=500)

# Estimate cost before sending
cost = counter.estimate_cost("Your prompt...", direction="input")
print(f"Cost: ${cost:.6f}")

Intelligent Chunking

from llm_context_forge import DocumentChunker, ChunkStrategy

chunker = DocumentChunker("gpt-4o")

# Chunk respecting paragraph boundaries
chunks = chunker.chunk(
    long_document,
    strategy=ChunkStrategy.PARAGRAPH,
    max_tokens=500,
    overlap_tokens=50,
)

# Specialized chunkers
code_chunks = chunker.chunk_code(source_code, language="python")
md_chunks = chunker.chunk_markdown(readme_text)

Priority-Based Context Assembly

The core pattern for RAG applications — guarantee critical context fits while gracefully dropping lower-priority content:

from llm_context_forge import ContextWindow, Priority

window = ContextWindow("gpt-4o")

# System instructions — always included
window.add_block("You are a legal assistant.", Priority.CRITICAL, "system")

# User query — high priority
window.add_block("What is the statute of limitations?", Priority.HIGH, "query")

# RAG search results — included if space permits
window.add_block(search_result_1, Priority.MEDIUM, "rag_1")
window.add_block(search_result_2, Priority.LOW, "rag_2")

# Assemble: packs highest-priority blocks first
prompt = window.assemble(max_tokens=4096)

# See what was included/dropped
usage = window.usage()
print(f"Included: {usage['num_included']} blocks ({usage['included_tokens']} tokens)")
print(f"Dropped:  {usage['num_excluded']} blocks")

Cost Estimation

from llm_context_forge import CostCalculator

calc = CostCalculator("gpt-4o")

# Single prompt cost
cost = calc.estimate_prompt("Your prompt text here")
print(f"Input cost: ${cost.usd:.6f}")

# Compare models
comparison = calc.compare_models(
    texts=["Document 1...", "Document 2..."],
    models=["gpt-4o", "gpt-4o-mini", "claude-3.5-sonnet", "gemini-flash"],
)
for model, analysis in comparison.items():
    print(f"{model}: ${analysis.total_usd:.6f} for {analysis.total_tokens} tokens")

Context Compression

from llm_context_forge import ContextCompressor, CompressionStrategy

compressor = ContextCompressor("gpt-4o")

# Extractive: keeps most important sentences via TF-IDF scoring
result = compressor.compress(long_text, target_tokens=200)
print(f"Compressed: {result.original_tokens} → {result.compressed_tokens} tokens")
print(f"Savings: {result.savings_pct:.1f}%")

# Middle-out: preserves start and end, removes middle
result = compressor.compress(log_text, target_tokens=300, strategy=CompressionStrategy.MIDDLE_OUT)

Conversation Management

from llm_context_forge import ConversationManager

manager = ConversationManager("gpt-4o")

manager.add_message("system", "You are a helpful Python tutor.")
manager.add_message("user", "Explain decorators")
manager.add_message("assistant", "Decorators are...")
# ... many more turns ...

# Auto-trim older messages to fit budget, preserving system prompt
trimmed = manager.get_context(max_tokens=4096, preserve_system=True)

Supported Models

Provider	Models	Token Counting	Pricing
OpenAI	GPT-4, GPT-4 Turbo, GPT-4o, GPT-4o-mini, GPT-3.5 Turbo	✅ `tiktoken`	✅
Anthropic	Claude 3 Opus, Claude 3.5 Sonnet, Claude 3 Haiku	✅ `anthropic`	✅
Google	Gemini Pro, Gemini Flash	✅ `transformers`	✅
Meta	Llama 3 8B, Llama 3 70B, Llama 3.1 405B	✅ `transformers`	✅
Mistral	Mistral Large	✅ `mistral-common`	✅
Cohere	Command R+	✅ `transformers`	✅

Production-Grade Tokenizer Fallback

In production environments, external tokenizer packages (transformers, mistral-common) might fail to download or initialize due to network errors. llm-context-forge provides a robust, production-grade fallback:

If a native tokenizer fails to load, the system degrades to OpenAI's fast cl100k_base (tiktoken).
Since most modern LLMs utilize similar Byte-Pair Encoding (BPE), cl100k_base offers a highly accurate baseline.
llm-context-forge automatically applies structural safety multipliers (e.g. 1.05x) specifically tuned to each backend before throwing an overflow warning.
A one-time warning is emitted via standard python logging to notify infrastructure teams of the fallback engagement.

from llm_context_forge import ModelRegistry, ModelInfo, TokenizerBackend

ModelRegistry.register(ModelInfo(
    name="my-fine-tuned-model",
    backend=TokenizerBackend.OPENAI,
    context_window=16_384,
    encoding_name="cl100k_base",
    input_cost_per_1k=0.002,
    output_cost_per_1k=0.006,
))

CLI

# Count tokens
llm_context_forge count "Hello world" --model gpt-4o

# Chunk a document
llm_context_forge chunk document.md --strategy semantic --max-tokens 500

# Estimate cost
llm_context_forge cost document.txt --model claude-3.5-sonnet

# List all models
llm_context_forge models

# Health check
llm_context_forge doctor

# Start API server
llm_context_forge serve --port 8000

# Interactive demo
llm_context_forge demo

REST API

Start the server and access interactive docs at http://localhost:8000/docs:

pip install "llm_context_forge[api]"
llm_context_forge serve

Endpoint	Method	Description
`/health`	GET	System health + version
`/api/v1/tokens/count`	POST	Count tokens
`/api/v1/tokens/validate`	POST	Check context window fit
`/api/v1/chunks/`	POST	Chunk text
`/api/v1/context/assemble`	POST	Priority-based assembly
`/api/v1/compress/`	POST	Compress text
`/api/v1/cost/estimate`	POST	Estimate cost

Architecture

llm_context_forge/
├── models.py        # Model registry (15+ models, pricing, backends)
├── tokenizer.py     # Multi-provider token counter (tiktoken + heuristics)
├── chunker.py       # 5-strategy document chunker with overlap
├── context.py       # Priority-based context assembly + conversation manager
├── compressor.py    # 4-strategy compression engine (TF-IDF, middle-out, etc.)
├── cost.py          # Cost estimation engine with model comparison
├── cli/main.py      # Typer CLI with Rich output
└── api/             # FastAPI server with versioned routes

Docker

docker build -t llm_context_forge .
docker-compose up

Development

git clone https://github.com/dhruv-atomic-mui21/llm_context_forge.git
cd llm_context_forge
pip install -e ".[dev]"
pytest

Contributing

See CONTRIBUTING.md for development workflow guidelines.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.5

Apr 9, 2026

0.1.1

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_context_forge-0.1.5.tar.gz (36.5 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_context_forge-0.1.5-py3-none-any.whl (32.7 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file llm_context_forge-0.1.5.tar.gz.

File metadata

Download URL: llm_context_forge-0.1.5.tar.gz
Upload date: Apr 9, 2026
Size: 36.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_context_forge-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`609cdf8e868e7e2119f5d307d1773cc59b811b5606b9f4df933306d7ce94ff78`
MD5	`6865a6c0ac23f9c19f60fde045a02e25`
BLAKE2b-256	`7db7b2ab2b349a643dae947a5021c14003cbd420063d78a09cce774ded12d5fe`

See more details on using hashes here.

File details

Details for the file llm_context_forge-0.1.5-py3-none-any.whl.

File metadata

Download URL: llm_context_forge-0.1.5-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 32.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_context_forge-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4916f96da412384306ab7e64570604f643cadfbf10fca31d4b19f8f82f0dff5`
MD5	`a091d2c5091a37b3cdc930d7cd01fe5d`
BLAKE2b-256	`88c26e380a9228a8d133d6a095cb747548d72c2db8d91c9d6cca4d5669132281`

See more details on using hashes here.

llm-context-forge 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM Context Forge

Why LLM Context Forge?

Installation

Quick Start

Token Counting

Intelligent Chunking

Priority-Based Context Assembly

Cost Estimation

Context Compression

Conversation Management

Supported Models

Production-Grade Tokenizer Fallback

CLI

REST API

Architecture

Docker

Development

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes