Cut your LLM prompt size by 40-70% in one line of code -- semantic chunking + extractive summarization that preserves meaning, instructions, and key entities.
Project description
from llmslim import compress
result = compress(your_massive_prompt, target_ratio=0.5)
# That's it. 50% fewer tokens. Same meaning. Half the cost. ๐
โ Before (2,847 tokens โ $$$)
|
โ After (1,138 tokens โ ๐ฐ)
|
|
๐ 60% reduction โข 1,709 tokens saved โข $0.0043/request saved on GPT-4o / $0.0021 on GPT-5 |
|
๐ฏ Why llmslim?
๐ค The ProblemEvery token you send to an LLM costs money. Long prompts, RAG contexts, and chat histories bloat your API bills while most of the text is redundant filler that the model doesn't need.
|
๐ The Solutionllmslim uses semantic understanding to surgically remove redundancy while keeping every instruction, entity, and key detail intact.
|
โก Quickstart
Installation
# Core (works offline, no model downloads needed)
pip install llmslim
# With high-quality semantic embeddings (recommended)
pip install "llmslim[semantic]"
# Everything (semantic + fast token counting + NLTK sentence splitting)
pip install "llmslim[all]"
One Line Is All You Need
from llmslim import compress
result = compress(your_prompt, target_ratio=0.5)
print(result.compressed_text) # โ your compressed prompt
print(result.reduction_percent) # โ 52.3
print(result.tokens_saved) # โ 1,847
print(result.summary()) # โ full stats breakdown
Use Directly With Any LLM
from llmslim import compress
from openai import OpenAI
client = OpenAI()
# Compress before sending โ drop-in, zero friction
prompt = compress(massive_system_prompt, target_ratio=0.5)
response = client.chat.completions.create(
model="gpt-5",
messages=[
{"role": "system", "content": str(prompt)}, # โ compressed!
{"role": "user", "content": user_question},
],
)
# Same quality response. Half the cost.
๐ง How It Works
graph LR
A["๐ Input Text<br/><i>3,000 tokens</i>"] --> B["โ๏ธ Sentence<br/>Splitting"]
B --> C["๐งฉ Semantic<br/>Chunking"]
C --> D["๐ Extractive<br/>Ranking"]
D --> E["๐ Instruction<br/>Preservation"]
E --> F["๐ฏ Budget-Aware<br/>Selection"]
F --> G["โจ Output<br/><i>1,500 tokens</i>"]
style A fill:#1a1b27,stroke:#58a6ff,color:#c9d1d9
style B fill:#1a1b27,stroke:#7c3aed,color:#c9d1d9
style C fill:#1a1b27,stroke:#7c3aed,color:#c9d1d9
style D fill:#1a1b27,stroke:#f778ba,color:#c9d1d9
style E fill:#1a1b27,stroke:#ffa657,color:#c9d1d9
style F fill:#1a1b27,stroke:#ffa657,color:#c9d1d9
style G fill:#1a1b27,stroke:#10b981,color:#c9d1d9
The 6-Step Pipeline
| Step | What Happens | Why It Matters |
|---|---|---|
| 1. Sentence Splitting | Text โ individual sentences via NLTK/regex, preserving code blocks and markdown | Clean atomic units for analysis |
| 2. Semantic Chunking | Group sentences by topic using embedding similarity with drift detection | Per-topic ranking is far more accurate than global |
| 3. Centrality Ranking | LexRank-style cosine similarity to chunk centroid โ find the "core" sentences | Removes peripheral/redundant sentences |
| 4. Entity & Instruction Detection | Boost sentences with named entities, numbers, code, directives ("must", "never") | Never lose critical information |
| 5. Budget-Aware Selection | Greedily select top-scored sentences within the target token budget | Precise compression ratio control |
| 6. Ordered Reassembly | Reconstruct in original sentence order, preserving paragraph structure | Maintains logical flow and readability |
๐ฅ Features
๐ฏ Semantic ChunkingGroups sentences by topic using embedding similarity. Detects topic shifts so each chunk is ranked independently for maximum accuracy. |
๐ Instruction FidelityAutomatically detects and preserves imperative language, code blocks, numbered steps, and directives. Your instructions never get dropped. |
๐ Query-Aware RAGPass a |
๐ฐ Cost CalculatorBuilt-in cost savings estimation for GPT-5, GPT-4o, Claude, Gemini, and more. Know exactly how much you're saving. |
๐ Pluggable EmbeddingsWorks offline with TF-IDF out of the box. Upgrade to sentence-transformers for deep semantic understanding with one extra install. |
โก Chat & Pipeline APIsDedicated helpers for chat message compression and batch document compression โ fits right into your existing LLM pipeline. |
๐ค Works With Every LLM
| Provider | Models | Works? |
|---|---|---|
| OpenAI | GPT-5, GPT-4o, GPT-5.4, GPT-5 Mini | โ |
| Anthropic | Claude Opus 4.8, Claude Sonnet 4.6, Claude Haiku 4.5 | โ |
| Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Lite | โ | |
| DeepSeek | DeepSeek-V3, DeepSeek-R1 | โ |
| Mistral | Mistral Large 3, Mistral Small 4 | โ |
| Open Source | Llama, Phi, Qwen, anything | โ |
| Any LLM | If it accepts text, it works | โ |
llmslim is model-agnostic. It compresses the text before it reaches any model. Works with any API, any framework, any model.
๐ฌ Compress Chat Histories
from llmslim import compress_chat_messages
conversation = [
{"role": "system", "content": "You are a helpful coding assistant..."},
{"role": "user", "content": very_long_user_message},
{"role": "assistant", "content": very_long_assistant_response},
{"role": "user", "content": follow_up_question},
]
# Compress user & assistant messages, preserve system prompt
compressed = compress_chat_messages(conversation, target_ratio=0.5)
# Use directly with OpenAI, Anthropic, etc.
response = client.chat.completions.create(model="gpt-5", messages=compressed)
๐ RAG Pipeline Compression
from llmslim import compress_documents
# Your retrieved chunks from a vector DB
retrieved_chunks = [chunk1, chunk2, chunk3, chunk4, chunk5]
user_query = "How do I handle authentication in FastAPI?"
# Query-aware compression: keeps sentences relevant to the question
results = compress_documents(
retrieved_chunks,
query=user_query,
target_ratio=0.4, # aggressive 60% reduction
)
# Build compressed context
context = "\n\n".join(r.compressed_text for r in results)
total_saved = sum(r.tokens_saved for r in results)
print(f"Saved {total_saved} tokens across {len(results)} documents")
๐ฐ Cost Savings Calculator
from llmslim import compress, estimate_cost_savings
result = compress(prompt, target_ratio=0.5)
savings = estimate_cost_savings(
original_tokens=result.original_tokens,
compressed_tokens=result.compressed_tokens,
model="gpt-5",
requests_per_day=50_000,
)
print(savings.summary())
Model: gpt-5 ($0.00125/1K input tokens)
Tokens saved per request: 1,423 (51.2%)
At 50,000 requests/day:
Daily savings: $88.94
Monthly savings: $2,668.13
Annual savings: $32,462.19
๐ธ Annual Savings by Model & Volume
| Model | 1K req/day | 10K req/day | 50K req/day | 100K req/day | Pricing (1M tokens) |
|---|---|---|---|---|---|
| GPT-5 (latest flagship) | $717 | $7,170 | $35,848 | $71,696 | $1.25 / $10.00 |
| GPT-4o | $913 | $9,125 | $45,625 | $91,250 | $2.50 / $10.00 |
| GPT-5.4 (prev. flagship) | $1,173 | $11,732 | $58,661 | $117,321 | $2.50 / $15.00 |
| Claude Opus 4.8 (flagship) | $2,086 | $20,857 | $104,286 | $208,571 | $5.00 / $25.00 |
| Claude Sonnet 4.6 (mid-tier) | $1,251 | $12,514 | $62,571 | $125,142 | $3.00 / $15.00 |
| Claude Haiku 4.5 (fast/cheap) | $417 | $4,171 | $20,857 | $41,714 | $1.00 / $5.00 |
| Gemini 2.5 Pro | $522 | $5,220 | $26,099 | $52,198 | $1.25 / $5.00 |
| Gemini 2.5 Flash | $31 | $313 | $1,566 | $3,132 | $0.075 / $0.30 |
| DeepSeek-V3 | $58 | $585 | $2,925 | $5,850 | $0.14 / $0.28 |
| Mistral Large 3 | $417 | $4,171 | $20,857 | $41,714 | $1.00 / $3.00 |
Based on 50% compression of 1,000-token prompts at listed model pricing. Actual savings depend on your text and compression ratio.
๐ Benchmarks
Compression quality across different text types at various target ratios:
| Text Type | Target | Actual Reduction | Key Entities Kept | Instructions Kept | Latency |
|---|---|---|---|---|---|
| Chat Prompt | 50% | 52.3% | 96% | 100% | 45ms |
| RAG Context (5 docs) | 50% | 48.7% | 94% | 100% | 120ms |
| Long Document (10K tokens) | 50% | 51.1% | 92% | 100% | 340ms |
| System Prompt | 40% | 38.9% | 98% | 100% | 28ms |
| Chat Prompt | 70% | 68.4% | 88% | 100% | 42ms |
| Technical Documentation | 50% | 53.2% | 91% | 100% | 185ms |
๐ Key finding: Instructions (sentences with "must", "never", "ensure", code blocks) are always preserved at 100% regardless of compression ratio. Entity retention stays above 88% even at aggressive 70% reduction.
๐ฌ Run benchmarks yourself
# Clone and install
git clone https://github.com/Thanatos9404/llmslim.git
cd llmslim
pip install -e ".[all,dev]"
# Run the benchmark suite
python benchmarks/benchmark.py
๐ ๏ธ Advanced Configuration
from llmslim import ContextCompressor
compressor = ContextCompressor(
# Chunking parameters
max_chunk_tokens=180, # soft cap per semantic chunk
similarity_threshold=0.35, # topic drift sensitivity (lower = larger chunks)
# Compression behavior
min_tokens_for_compression=40, # skip tiny texts
# Scoring weights (tune to your use case)
weights={
"centrality": 0.35, # how representative of the chunk
"position": 0.15, # first/last sentence bonus
"entity": 0.15, # named entities, numbers, URLs
"instruction": 0.25, # directive language boost
"query": 0.35, # query relevance (RAG mode)
"length_penalty": 0.20, # penalize very short sentences
},
# Custom preservation rules
preserve_patterns=[
r"API_KEY", # always keep sentences mentioning API keys
r"^WARNING:", # keep warning lines
r"https?://", # keep sentences with URLs
],
)
result = compressor.compress(text, target_ratio=0.5, query="optional query")
๐ฅ๏ธ CLI Usage
# Basic compression
llmslim input.txt -r 0.5 -o compressed.txt
# With stats
llmslim input.txt --ratio 0.5 --stats
# With cost estimate
llmslim input.txt -r 0.5 --cost gpt-5 --requests-per-day 10000
# From stdin
cat prompt.txt | llmslim --ratio 0.4
# Pipe to clipboard (macOS)
llmslim input.txt -r 0.5 | pbcopy
๐ฆ API Reference
compress(text, target_ratio=0.5, query=None, **kwargs)
The main entry point. Compresses text in a single function call.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | The prompt or document to compress |
target_ratio |
float |
0.5 |
Fraction of tokens to retain (0.5 = keep 50%) |
query |
str | None |
None |
Query for relevance-aware compression (RAG) |
**kwargs |
Forwarded to ContextCompressor constructor |
Returns: CompressionResult with .compressed_text, .reduction_percent, .tokens_saved, .summary()
compress_chat_messages(messages, target_ratio=0.5, ...)
Compress chat message histories. Preserves system prompts by default.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
messages |
list[dict] |
required | Chat messages ({"role": ..., "content": ...}) |
target_ratio |
float |
0.5 |
Fraction of tokens to retain |
compressible_roles |
tuple |
("user", "assistant") |
Roles eligible for compression |
min_tokens |
int |
60 |
Skip messages below this token count |
Returns: list[dict] โ new message list with compressed content
compress_documents(documents, query=None, target_ratio=0.5)
Batch compress documents for RAG pipelines with optional query-aware ranking.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
documents |
list[str] |
required | Document texts to compress |
query |
str | None |
None |
User query for relevance-aware ranking |
target_ratio |
float |
0.5 |
Fraction of tokens to retain |
Returns: list[CompressionResult]
estimate_cost_savings(original_tokens, compressed_tokens, model, requests_per_day)
Calculate dollar savings from compression at your request volume.
Supported models: gpt-5, gpt-4o, gpt-5.4, gpt-5-mini, claude-opus-4.8, claude-sonnet-4.6, claude-haiku-4.5, gemini-2.5-pro, gemini-1.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, mistral-large-3, mistral-small-4, deepseek-v3, deepseek-r1.5
Returns: CostEstimate with .daily_savings_usd, .monthly_savings_usd, .annual_savings_usd
๐๏ธ Architecture
llmslim/
โโโ __init__.py # Public API exports
โโโ core.py # ContextCompressor class + compress() function
โโโ chunking.py # Semantic chunking with topic-drift detection
โโโ ranking.py # Multi-signal sentence scoring (centrality, entities, instructions)
โโโ embeddings.py # Pluggable backends: sentence-transformers + TF-IDF fallback
โโโ tokenization.py # Sentence/paragraph splitting with code-block protection
โโโ tokens.py # Token counting (tiktoken with heuristic fallback)
โโโ cost.py # Cost savings estimation for popular LLM models
โโโ pipelines.py # High-level helpers: chat compression, document batches
โโโ cli.py # Command-line interface
๐ค Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines.
# Development setup
git clone https://github.com/Thanatos9404/llmslim.git
cd llmslim
pip install -e ".[all,dev]"
pytest tests/ -v
๐ License
MIT License โ see LICENSE for details.
โญ Star History
If this project saved you money, star it! โญ
Built with โค๏ธ by Yashvardhan Thanvi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmslim-0.1.0.tar.gz.
File metadata
- Download URL: llmslim-0.1.0.tar.gz
- Upload date:
- Size: 37.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12e577a2b54c0adf552e672f30327fb070a2b0e46f751bac6095112d61ea9ffe
|
|
| MD5 |
6a6dddca46d3ec957189a25fbd5bab60
|
|
| BLAKE2b-256 |
df5b2fe14b5ab7da42ceb1fd956e0a5b134665fc1152e818ef8fccc759388a3c
|
File details
Details for the file llmslim-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmslim-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8968cd88e1f881dae35f4f10703c73a3d548c33ba9b175127a0fd050de43d8e
|
|
| MD5 |
e4a6d9c33f2bb1d1ea75b61f89af7e68
|
|
| BLAKE2b-256 |
4a04da0dc522c4bbcb1d90236deece7e394dc3041c12129783f1f8a0da8970bf
|