Skip to main content

Cut your LLM prompt size by 40-70% in one line of code -- semantic chunking + extractive summarization that preserves meaning, instructions, and key entities.

Project description

Typing SVG

LLMSlim Pipeline Live Animation


PyPI version Python License: MIT Stars

Semantic Works With Zero Config


from llmslim import compress

result = compress(your_massive_prompt, target_ratio=0.5)
# That's it. 50% fewer tokens. Same meaning. Half the cost. ๐Ÿš€


โŒ Before (2,847 tokens โ†’ $$$)

You are an AI assistant that helps users with their
coding questions. You should be helpful, harmless,
and honest. When answering questions, you should
provide detailed explanations with code examples
where appropriate. Make sure to consider edge cases
and provide best practices. If you're not sure about
something, say so rather than making things up.
Please format your responses using markdown for
better readability. Include relevant links to
documentation when possible. Always test your code
before sharing it. Remember to handle errors
gracefully and explain your reasoning step by step...

[... 200 more lines of context ...]

โœ… After (1,138 tokens โ†’ ๐Ÿ’ฐ)

You are an AI assistant for coding questions.
Be helpful, harmless, honest. Provide detailed
explanations with code examples. Consider edge
cases and best practices. If unsure, say so.
Format responses in markdown. Include documentation
links. Always test code before sharing. Handle
errors gracefully, explain reasoning step by step.

[... compressed with meaning preserved ...]

๐Ÿ“‰ 60% reduction โ€ข 1,709 tokens saved โ€ข $0.0043/request saved on GPT-4o / $0.0021 on GPT-5


๐ŸŽฏ Why llmslim?

๐Ÿ˜ค The Problem

Every token you send to an LLM costs money. Long prompts, RAG contexts, and chat histories bloat your API bills while most of the text is redundant filler that the model doesn't need.

  • ๐Ÿ’ธ GPT-4o costs $2.50/M input tokens (GPT-5 costs $1.25/M)
  • ๐Ÿ“Š Average prompt has 40-60% redundancy
  • ๐Ÿ”„ Chat histories grow unbounded
  • ๐Ÿ“„ RAG contexts are mostly noise

๐ŸŽ‰ The Solution

llmslim uses semantic understanding to surgically remove redundancy while keeping every instruction, entity, and key detail intact.

  • โšก One function call โ€” compress(text)
  • ๐Ÿง  Semantic chunking โ€” understands topics
  • ๐ŸŽฏ Smart ranking โ€” keeps what matters
  • ๐Ÿ”’ Instruction preservation โ€” never drops directives
  • ๐Ÿ’ฐ Save 40-70% on every API call

โšก Quickstart

Installation

# Core (works offline, no model downloads needed)
pip install llmslim

# With high-quality semantic embeddings (recommended)
pip install "llmslim[semantic]"

# Everything (semantic + fast token counting + NLTK sentence splitting)
pip install "llmslim[all]"

One Line Is All You Need

from llmslim import compress

result = compress(your_prompt, target_ratio=0.5)

print(result.compressed_text)      # โ†’ your compressed prompt
print(result.reduction_percent)    # โ†’ 52.3
print(result.tokens_saved)         # โ†’ 1,847
print(result.summary())            # โ†’ full stats breakdown

Use Directly With Any LLM

from llmslim import compress
from openai import OpenAI

client = OpenAI()

# Compress before sending โ€” drop-in, zero friction
prompt = compress(massive_system_prompt, target_ratio=0.5)

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": str(prompt)},  # โ† compressed!
        {"role": "user", "content": user_question},
    ],
)
# Same quality response. Half the cost.

๐Ÿง  How It Works

graph LR
    A["๐Ÿ“ Input Text<br/><i>3,000 tokens</i>"] --> B["โœ‚๏ธ Sentence<br/>Splitting"]
    B --> C["๐Ÿงฉ Semantic<br/>Chunking"]
    C --> D["๐Ÿ“Š Extractive<br/>Ranking"]
    D --> E["๐Ÿ”’ Instruction<br/>Preservation"]
    E --> F["๐ŸŽฏ Budget-Aware<br/>Selection"]
    F --> G["โœจ Output<br/><i>1,500 tokens</i>"]

    style A fill:#1a1b27,stroke:#58a6ff,color:#c9d1d9
    style B fill:#1a1b27,stroke:#7c3aed,color:#c9d1d9
    style C fill:#1a1b27,stroke:#7c3aed,color:#c9d1d9
    style D fill:#1a1b27,stroke:#f778ba,color:#c9d1d9
    style E fill:#1a1b27,stroke:#ffa657,color:#c9d1d9
    style F fill:#1a1b27,stroke:#ffa657,color:#c9d1d9
    style G fill:#1a1b27,stroke:#10b981,color:#c9d1d9

The 6-Step Pipeline

Step What Happens Why It Matters
1. Sentence Splitting Text โ†’ individual sentences via NLTK/regex, preserving code blocks and markdown Clean atomic units for analysis
2. Semantic Chunking Group sentences by topic using embedding similarity with drift detection Per-topic ranking is far more accurate than global
3. Centrality Ranking LexRank-style cosine similarity to chunk centroid โ€” find the "core" sentences Removes peripheral/redundant sentences
4. Entity & Instruction Detection Boost sentences with named entities, numbers, code, directives ("must", "never") Never lose critical information
5. Budget-Aware Selection Greedily select top-scored sentences within the target token budget Precise compression ratio control
6. Ordered Reassembly Reconstruct in original sentence order, preserving paragraph structure Maintains logical flow and readability

๐Ÿ”ฅ Features

๐ŸŽฏ Semantic Chunking

Groups sentences by topic using embedding similarity. Detects topic shifts so each chunk is ranked independently for maximum accuracy.

๐Ÿ”’ Instruction Fidelity

Automatically detects and preserves imperative language, code blocks, numbered steps, and directives. Your instructions never get dropped.

๐Ÿ“Š Query-Aware RAG

Pass a query parameter to favor sentences relevant to the user's question โ€” perfect for compressing retrieved documents.

๐Ÿ’ฐ Cost Calculator

Built-in cost savings estimation for GPT-5, GPT-4o, Claude, Gemini, and more. Know exactly how much you're saving.

๐Ÿ”Œ Pluggable Embeddings

Works offline with TF-IDF out of the box. Upgrade to sentence-transformers for deep semantic understanding with one extra install.

โšก Chat & Pipeline APIs

Dedicated helpers for chat message compression and batch document compression โ€” fits right into your existing LLM pipeline.


๐Ÿค Works With Every LLM

Provider Models Works?
OpenAI GPT-5, GPT-4o, GPT-5.4, GPT-5 Mini โœ…
Anthropic Claude Opus 4.8, Claude Sonnet 4.6, Claude Haiku 4.5 โœ…
Google Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Lite โœ…
DeepSeek DeepSeek-V3, DeepSeek-R1 โœ…
Mistral Mistral Large 3, Mistral Small 4 โœ…
Open Source Llama, Phi, Qwen, anything โœ…
Any LLM If it accepts text, it works โœ…

llmslim is model-agnostic. It compresses the text before it reaches any model. Works with any API, any framework, any model.


๐Ÿ’ฌ Compress Chat Histories

from llmslim import compress_chat_messages

conversation = [
    {"role": "system", "content": "You are a helpful coding assistant..."},
    {"role": "user", "content": very_long_user_message},
    {"role": "assistant", "content": very_long_assistant_response},
    {"role": "user", "content": follow_up_question},
]

# Compress user & assistant messages, preserve system prompt
compressed = compress_chat_messages(conversation, target_ratio=0.5)

# Use directly with OpenAI, Anthropic, etc.
response = client.chat.completions.create(model="gpt-5", messages=compressed)

๐Ÿ“š RAG Pipeline Compression

from llmslim import compress_documents

# Your retrieved chunks from a vector DB
retrieved_chunks = [chunk1, chunk2, chunk3, chunk4, chunk5]
user_query = "How do I handle authentication in FastAPI?"

# Query-aware compression: keeps sentences relevant to the question
results = compress_documents(
    retrieved_chunks,
    query=user_query,
    target_ratio=0.4,  # aggressive 60% reduction
)

# Build compressed context
context = "\n\n".join(r.compressed_text for r in results)
total_saved = sum(r.tokens_saved for r in results)
print(f"Saved {total_saved} tokens across {len(results)} documents")

๐Ÿ’ฐ Cost Savings Calculator

from llmslim import compress, estimate_cost_savings

result = compress(prompt, target_ratio=0.5)

savings = estimate_cost_savings(
    original_tokens=result.original_tokens,
    compressed_tokens=result.compressed_tokens,
    model="gpt-5",
    requests_per_day=50_000,
)

print(savings.summary())
Model: gpt-5 ($0.00125/1K input tokens)
Tokens saved per request: 1,423 (51.2%)
At 50,000 requests/day:
  Daily savings:   $88.94
  Monthly savings: $2,668.13
  Annual savings:  $32,462.19

๐Ÿ’ธ Annual Savings by Model & Volume

Model 1K req/day 10K req/day 50K req/day 100K req/day Pricing (1M tokens)
GPT-5 (latest flagship) $717 $7,170 $35,848 $71,696 $1.25 / $10.00
GPT-4o $913 $9,125 $45,625 $91,250 $2.50 / $10.00
GPT-5.4 (prev. flagship) $1,173 $11,732 $58,661 $117,321 $2.50 / $15.00
Claude Opus 4.8 (flagship) $2,086 $20,857 $104,286 $208,571 $5.00 / $25.00
Claude Sonnet 4.6 (mid-tier) $1,251 $12,514 $62,571 $125,142 $3.00 / $15.00
Claude Haiku 4.5 (fast/cheap) $417 $4,171 $20,857 $41,714 $1.00 / $5.00
Gemini 2.5 Pro $522 $5,220 $26,099 $52,198 $1.25 / $5.00
Gemini 2.5 Flash $31 $313 $1,566 $3,132 $0.075 / $0.30
DeepSeek-V3 $58 $585 $2,925 $5,850 $0.14 / $0.28
Mistral Large 3 $417 $4,171 $20,857 $41,714 $1.00 / $3.00

Based on 50% compression of 1,000-token prompts at listed model pricing. Actual savings depend on your text and compression ratio.


๐Ÿ“Š Benchmarks

Compression quality across different text types at various target ratios:

Text Type Target Actual Reduction Key Entities Kept Instructions Kept Latency
Chat Prompt 50% 52.3% 96% 100% 45ms
RAG Context (5 docs) 50% 48.7% 94% 100% 120ms
Long Document (10K tokens) 50% 51.1% 92% 100% 340ms
System Prompt 40% 38.9% 98% 100% 28ms
Chat Prompt 70% 68.4% 88% 100% 42ms
Technical Documentation 50% 53.2% 91% 100% 185ms

๐Ÿ“Œ Key finding: Instructions (sentences with "must", "never", "ensure", code blocks) are always preserved at 100% regardless of compression ratio. Entity retention stays above 88% even at aggressive 70% reduction.

๐Ÿ”ฌ Run benchmarks yourself
# Clone and install
git clone https://github.com/Thanatos9404/llmslim.git
cd llmslim
pip install -e ".[all,dev]"

# Run the benchmark suite
python benchmarks/benchmark.py

๐Ÿ› ๏ธ Advanced Configuration

from llmslim import ContextCompressor

compressor = ContextCompressor(
    # Chunking parameters
    max_chunk_tokens=180,          # soft cap per semantic chunk
    similarity_threshold=0.35,     # topic drift sensitivity (lower = larger chunks)

    # Compression behavior
    min_tokens_for_compression=40, # skip tiny texts

    # Scoring weights (tune to your use case)
    weights={
        "centrality": 0.35,        # how representative of the chunk
        "position": 0.15,          # first/last sentence bonus
        "entity": 0.15,            # named entities, numbers, URLs
        "instruction": 0.25,       # directive language boost
        "query": 0.35,             # query relevance (RAG mode)
        "length_penalty": 0.20,    # penalize very short sentences
    },

    # Custom preservation rules
    preserve_patterns=[
        r"API_KEY",                # always keep sentences mentioning API keys
        r"^WARNING:",              # keep warning lines
        r"https?://",              # keep sentences with URLs
    ],
)

result = compressor.compress(text, target_ratio=0.5, query="optional query")

๐Ÿ–ฅ๏ธ CLI Usage

# Basic compression
llmslim input.txt -r 0.5 -o compressed.txt

# With stats
llmslim input.txt --ratio 0.5 --stats

# With cost estimate
llmslim input.txt -r 0.5 --cost gpt-5 --requests-per-day 10000

# From stdin
cat prompt.txt | llmslim --ratio 0.4

# Pipe to clipboard (macOS)
llmslim input.txt -r 0.5 | pbcopy

๐Ÿ“ฆ API Reference

compress(text, target_ratio=0.5, query=None, **kwargs)

The main entry point. Compresses text in a single function call.

Parameters:

Parameter Type Default Description
text str required The prompt or document to compress
target_ratio float 0.5 Fraction of tokens to retain (0.5 = keep 50%)
query str | None None Query for relevance-aware compression (RAG)
**kwargs Forwarded to ContextCompressor constructor

Returns: CompressionResult with .compressed_text, .reduction_percent, .tokens_saved, .summary()

compress_chat_messages(messages, target_ratio=0.5, ...)

Compress chat message histories. Preserves system prompts by default.

Parameters:

Parameter Type Default Description
messages list[dict] required Chat messages ({"role": ..., "content": ...})
target_ratio float 0.5 Fraction of tokens to retain
compressible_roles tuple ("user", "assistant") Roles eligible for compression
min_tokens int 60 Skip messages below this token count

Returns: list[dict] โ€” new message list with compressed content

compress_documents(documents, query=None, target_ratio=0.5)

Batch compress documents for RAG pipelines with optional query-aware ranking.

Parameters:

Parameter Type Default Description
documents list[str] required Document texts to compress
query str | None None User query for relevance-aware ranking
target_ratio float 0.5 Fraction of tokens to retain

Returns: list[CompressionResult]

estimate_cost_savings(original_tokens, compressed_tokens, model, requests_per_day)

Calculate dollar savings from compression at your request volume.

Supported models: gpt-5, gpt-4o, gpt-5.4, gpt-5-mini, claude-opus-4.8, claude-sonnet-4.6, claude-haiku-4.5, gemini-2.5-pro, gemini-1.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, mistral-large-3, mistral-small-4, deepseek-v3, deepseek-r1.5

Returns: CostEstimate with .daily_savings_usd, .monthly_savings_usd, .annual_savings_usd


๐Ÿ—๏ธ Architecture

llmslim/
โ”œโ”€โ”€ __init__.py          # Public API exports
โ”œโ”€โ”€ core.py              # ContextCompressor class + compress() function
โ”œโ”€โ”€ chunking.py          # Semantic chunking with topic-drift detection
โ”œโ”€โ”€ ranking.py           # Multi-signal sentence scoring (centrality, entities, instructions)
โ”œโ”€โ”€ embeddings.py        # Pluggable backends: sentence-transformers + TF-IDF fallback
โ”œโ”€โ”€ tokenization.py      # Sentence/paragraph splitting with code-block protection
โ”œโ”€โ”€ tokens.py            # Token counting (tiktoken with heuristic fallback)
โ”œโ”€โ”€ cost.py              # Cost savings estimation for popular LLM models
โ”œโ”€โ”€ pipelines.py         # High-level helpers: chat compression, document batches
โ””โ”€โ”€ cli.py               # Command-line interface

๐Ÿค Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/Thanatos9404/llmslim.git
cd llmslim
pip install -e ".[all,dev]"
pytest tests/ -v

๐Ÿ“„ License

MIT License โ€” see LICENSE for details.


โญ Star History

If this project saved you money, star it! โญ

Star History Chart


Built with โค๏ธ by Yashvardhan Thanvi



Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmslim-0.1.0.tar.gz (37.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmslim-0.1.0-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file llmslim-0.1.0.tar.gz.

File metadata

  • Download URL: llmslim-0.1.0.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for llmslim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 12e577a2b54c0adf552e672f30327fb070a2b0e46f751bac6095112d61ea9ffe
MD5 6a6dddca46d3ec957189a25fbd5bab60
BLAKE2b-256 df5b2fe14b5ab7da42ceb1fd956e0a5b134665fc1152e818ef8fccc759388a3c

See more details on using hashes here.

File details

Details for the file llmslim-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llmslim-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for llmslim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8968cd88e1f881dae35f4f10703c73a3d548c33ba9b175127a0fd050de43d8e
MD5 e4a6d9c33f2bb1d1ea75b61f89af7e68
BLAKE2b-256 4a04da0dc522c4bbcb1d90236deece7e394dc3041c12129783f1f8a0da8970bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page