Skip to main content

Reduce LLM costs by 10-30% through tokenizer-aware prompt compression

Project description

TokenOptim

Reduce LLM costs by 10-30% through tokenizer-aware prompt compression. Works with any LLM provider.

PyPI Version Python Versions License GitHub Stars Issues

If you find TokenOptim useful, consider giving it a star on GitHub — it helps others discover the project and motivates continued development.

Table of Contents

Quick Start

pip install tokenoptim

Compress a prompt

import tokenoptim

result = tokenoptim.optimize("your long prompt here", model="gpt-4")
print(result.text)           # compressed text
print(result.savings_pct)    # e.g. 28.0
print(result.cost_saved_usd) # e.g. 0.000630

Use with any provider

import tokenoptim
from openai import OpenAI

client = OpenAI()
result = tokenoptim.optimize("your long prompt here", model="gpt-4")

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": result.text}],
)

Works the same way with Anthropic, DeepSeek, Mistral, Google, or any other provider:

import tokenoptim
from anthropic import Anthropic

client = Anthropic()
result = tokenoptim.optimize("your long prompt here", model="claude-sonnet-4")

response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=1024,
    messages=[{"role": "user", "content": result.text}],
)

Compress chat messages

import tokenoptim

optimized = tokenoptim.optimize_messages([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in detail."},
], model="gpt-4")

# Use optimized.messages directly in your API call
response = client.chat.completions.create(
    model="gpt-4",
    messages=optimized.messages,
)
print(f"Saved {optimized.tokens_saved} tokens ({optimized.savings_pct}%)")

Track cumulative savings

import tokenoptim

with tokenoptim.session(model="gpt-4") as s:
    s.optimize("first prompt")
    s.optimize("second prompt")
    s.optimize_messages([{"role": "user", "content": "third prompt"}])

print(f"Saved {s.total_tokens_saved} tokens across {s.call_count} calls")
print(f"Cost saved: ${s.total_cost_saved_usd:.4f}")
print(f"Avg savings: {s.avg_savings_pct}%")

API Reference

Function Description
tokenoptim.optimize(text, model=, ...) Compress a string. Returns CompressionResult
tokenoptim.optimize_messages(messages, model=, ...) Compress chat messages. Returns MessagesResult
tokenoptim.session(model=, ...) Context manager tracking cumulative stats
tokenoptim.suggest_cache_split(text, model=) Suggest prefix-caching split point. Returns CacheSplitResult
tokenoptim.suggest_output_format(text) Detect verbose output patterns. Returns list[OutputFormatSuggestion]
tokenoptim.compare_models(text, models=) Compare token counts and costs. Returns list[ModelCostComparison]

Options available on all functions:

Parameter Default Description
model "gpt-4" Target model (for tokenizer and cost calculation)
enable_contractions True Apply contractions (do not -> don't)
enable_filler_removal True Strip filler phrases
enable_phrase_shortening True Replace verbose phrases (due to the fact that -> because)
enable_numeric_normalization True Normalize numbers (1,000 -> 1000, 3.00 -> 3)
enable_separator_removal True Remove separator lines (---, ===) and boilerplate phrases
enable_html_stripping False Strip HTML/XML tags and unescape entities
enable_code_comment_stripping False Strip # ... and // ... end-of-line comments
enable_json_minification False Minify JSON blocks and inline objects
enable_duplicate_removal False Remove consecutive duplicate lines/paragraphs
enable_abbreviations False Replace common long words (configuration -> config)
enable_markdown_stripping False Strip markdown formatting (preserves code blocks)
enable_semantic_dedup False Remove near-duplicate sentences via TF-IDF similarity
semantic_dedup_threshold 0.8 Cosine similarity threshold for semantic dedup
enable_indentation_compaction False Reduce 4-space/tab indentation to 2-space
enable_url_shortening False Replace URLs with domain only (preserves code blocks)
enable_article_trimming False Remove redundant articles after prepositions/verbs
enable_list_compaction False Convert short bullet/numbered lists to comma-separated
enable_xml_minification False Minify XML in fenced blocks and inline
enable_yaml_minification False Minify YAML in fenced blocks (strip comments, reduce indent)
track True Log metrics to the dashboard database

Lower-level Compressor class

For direct control without metrics tracking:

from tokenoptim import Compressor

c = Compressor(model="gpt-4")
result = c.compress("your prompt text here")

print(f"Original:   {result.original_tokens} tokens")
print(f"Compressed: {result.compressed_tokens} tokens")
print(f"Saved:      {result.savings_pct}%")
print(f"Cost saved: ${result.cost_saved_usd:.6f}")

Compression Examples

System prompt with fillers and contractions

Before (38 tokens):

You are a helpful coding assistant. Please note that you should provide
concise and accurate code. It is important to mention that you should
not make up APIs. You should always include error handling.

After (29 tokens):

You're a helpful coding assistant. You should provide concise and accurate
code. You shouldn't make up APIs. You should always include error handling.

Savings: 24% — filler removal (please note that, it is important to mention that) and contractions (You areYou're, should notshouldn't).


Filler-heavy requirements prompt

Before (38 tokens):

It is important to note that I need a REST API. Please note that it should
handle authentication. It should be noted that rate limiting is required.
As previously mentioned we are using PostgreSQL.

After (21 tokens):

I need a REST API. It should handle authentication. Rate limiting is
required. We're using PostgreSQL.

Savings: 45% — four filler phrases stripped, plus contractions.


Code-adjacent prompt

Before (24 tokens):

Write a Python function that does not raise an exception. It is important
to note that the function should return a list.

After (18 tokens):

Write a Python function that doesn't raise an exception. The function
should return a list.

Savings: 25% — contractions and filler removal apply; code structure is preserved.


Unicode and whitespace cleanup

Before (35 tokens):

The model\u2019s predictions are   very   accurate.    We have not tested
the   edge cases yet,  but  we  should  not  skip  them.

After (24 tokens):

The model's predictions are very accurate. We've not tested the edge
cases yet, but we shouldn't skip them.

Savings: 31% — smart quotes normalized, extra whitespace collapsed, contractions applied.


Chat messages (multi-message compression)

Before (61 tokens):

messages = [
    {"role": "system", "content": "You are an expert Python developer. Please note that you should write clean code. You should not use global variables. You should always add type hints."},
    {"role": "user", "content": "It is important to note that I need a function to parse JSON. The function does not need to handle errors. It is worth noting that performance matters."},
]

After (47 tokens):

messages = [
    {"role": "system", "content": "You're an expert Python developer. You should write clean code. You shouldn't use global variables. You should always add type hints."},
    {"role": "user", "content": "I need a function to parse JSON. The function doesn't need to handle errors. Performance matters."},
]

Savings: 23% — each message is compressed independently; fillers and contractions stack up across the conversation.

How It Works

TokenOptim applies 24 compression strategies in order:

  1. Line ending normalization — normalize \r\n and \r to \n (always on)
  2. Unicode normalization — NFC normalize, replace smart quotes/dashes with ASCII
  3. Indentation compaction — reduce 4-space/tab indentation to 2-space (opt-in)
  4. Whitespace normalization — collapse multiple spaces, tabs, blank lines
  5. JSON minification — minify fenced and inline JSON blocks (opt-in)
  6. XML minification — minify fenced and inline XML (opt-in)
  7. YAML minification — strip comments, reduce indent, remove blank lines in fenced YAML (opt-in)
  8. Redundant punctuation!!!!, ?????
  9. HTML/XML stripping — remove tags and unescape entities (opt-in)
  10. Markdown stripping — remove formatting while preserving code blocks (opt-in)
  11. URL shortening — replace full URLs with domain only, strip www. (opt-in)
  12. Filler removal — strip phrases like "please note that", "basically", "it is important to mention that"
  13. Separator/boilerplate removal — remove lines of ---, ===, etc. and phrases like "please find below"
  14. Duplicate line removal — remove consecutive duplicate lines and paragraphs (opt-in)
  15. List compaction — convert short bullet/numbered lists to comma-separated (opt-in)
  16. Verbose phrase shorteningdue to the fact thatbecause, in order toto, prior tobefore
  17. Abbreviationsconfigurationconfig, documentationdocs, databasedb (opt-in)
  18. Article trimming — remove redundant the/a/an after prepositions and common verbs (opt-in)
  19. Contractionsdo notdon't, it isit's (configurable)
  20. Numeric normalization1,000,0001000000, 3.003, 0077
  21. Code comment stripping — remove # ... and // ... end-of-line comments (opt-in)
  22. Semantic deduplication — remove near-duplicate sentences using TF-IDF cosine similarity (opt-in)
  23. Trailing whitespace — strip per-line trailing spaces
  24. Tokenizer-specific — model-aware optimizations (e.g., \n \n\n\n saves 2 tokens in tiktoken)

All strategies preserve semantic meaning. Code and structured data pass through with minimal changes.

Dashboard

Launch the real-time savings dashboard:

tokenoptim dashboard

Open http://localhost:8383 to see:

  • Total tokens and cost saved
  • Savings over time charts
  • Per-model breakdown
  • Recent requests log
  • ROI calculator

Dashboard overview Requests and ROI calculator

Terminal stats

tokenoptim stats

Configuration

from tokenoptim import Compressor

# Disable contractions (for formal prompts)
c = Compressor(model="gpt-4", enable_contractions=False)

# Disable filler removal
c = Compressor(model="gpt-4", enable_filler_removal=False)

# Add custom filler phrases
c = Compressor(model="gpt-4", custom_fillers=["in my opinion", "to be honest"])

# Enable HTML stripping (opt-in — useful for web-scraped content)
c = Compressor(model="gpt-4", enable_html_stripping=True)

# Enable code comment stripping (opt-in — useful for code-heavy prompts)
c = Compressor(model="gpt-4", enable_code_comment_stripping=True)

# Disable verbose phrase shortening
c = Compressor(model="gpt-4", enable_phrase_shortening=False)

# Enable JSON minification (opt-in — useful for prompts with JSON data)
c = Compressor(model="gpt-4", enable_json_minification=True)

# Enable markdown stripping (opt-in — useful for web-scraped markdown)
c = Compressor(model="gpt-4", enable_markdown_stripping=True)

# Enable abbreviations (opt-in — replaces common long words)
c = Compressor(model="gpt-4", enable_abbreviations=True)

# Enable semantic deduplication (opt-in — removes near-duplicate sentences)
c = Compressor(model="gpt-4", enable_semantic_dedup=True, semantic_dedup_threshold=0.8)

# Enable indentation compaction (opt-in — reduces 4-space/tab to 2-space)
c = Compressor(model="gpt-4", enable_indentation_compaction=True)

# Enable URL shortening (opt-in — replaces URLs with domain only)
c = Compressor(model="gpt-4", enable_url_shortening=True)

# Enable article trimming (opt-in — removes redundant the/a/an)
c = Compressor(model="gpt-4", enable_article_trimming=True)

# Enable list compaction (opt-in — converts short lists to comma-separated)
c = Compressor(model="gpt-4", enable_list_compaction=True)

# Enable XML minification (opt-in — minifies XML in fenced blocks)
c = Compressor(model="gpt-4", enable_xml_minification=True)

# Enable YAML minification (opt-in — strips comments, reduces indent in YAML blocks)
c = Compressor(model="gpt-4", enable_yaml_minification=True)

DeepSeek

import tokenoptim
from openai import OpenAI

client = OpenAI(base_url="https://api.deepseek.com", api_key="your-key")
result = tokenoptim.optimize("your prompt here", model="deepseek-chat")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": result.text}],
)

Advisor Utilities

TokenOptim includes advisory functions that help you optimize LLM costs beyond compression:

Suggest cache-friendly splits

import tokenoptim

result = tokenoptim.suggest_cache_split("""
You are a helpful assistant specialized in Python.
Always provide working code examples.
Answer the user's question about: {topic}
""")

print(f"Static prefix: {result.static_tokens} tokens")
print(f"Dynamic suffix: {result.dynamic_tokens} tokens")
print(f"Cache savings estimate: {result.cache_savings_estimate:.0%}")

Suggest concise output formats

suggestions = tokenoptim.suggest_output_format(
    "Please explain in detail how neural networks work and provide a detailed analysis."
)
for s in suggestions:
    print(f"Pattern: '{s.current_pattern}' → {s.suggestion} (saves ~{s.estimated_savings_pct}%)")

Compare model costs

comparisons = tokenoptim.compare_models(
    "Your prompt text here",
    models=["gpt-4", "gpt-4o", "gpt-3.5-turbo", "claude-3-5-sonnet", "deepseek-chat"],
)
for c in comparisons:
    print(f"{c.model:20s} {c.tokens:5d} tokens  ${c.cost_per_call:.6f}/call  ({c.provider})")

Supported Models

Provider Models Tokenizer
OpenAI gpt-5.2, gpt-5.2-pro, gpt-5.1, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4, gpt-4o, gpt-3.5-turbo, o3, o4-mini, o1 tiktoken
Anthropic claude-3-opus/sonnet/haiku, claude-3.5-*, claude-opus-4, claude-sonnet-4 tiktoken (approx)
DeepSeek deepseek-chat, deepseek-reasoner, deepseek-v3, deepseek-r1 transformers / tiktoken fallback
Mistral mistral-large, mistral-small, codestral, mixtral tiktoken (approx)
Google gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash tiktoken (approx)
Meta llama-3, llama-2 tiktoken (approx)
Qwen qwen-max, qwen-plus, qwen-turbo tiktoken (approx)
Local Any HuggingFace model transformers / fallback

Installation Options

# Core (includes tiktoken for token counting)
pip install tokenoptim

# With local model support (HuggingFace transformers)
pip install tokenoptim[local]

# Development
pip install tokenoptim[dev]

Project Structure

tokenoptim/
├── src/tokenoptim/
│   ├── compressor.py          # Core compression engine
│   ├── tokenizers.py          # Tokenizer registry & pricing
│   ├── metrics/               # Usage tracking (SQLite)
│   │   ├── collector.py
│   │   ├── models.py
│   │   └── db.py
│   ├── server/                # FastAPI dashboard backend
│   │   ├── app.py
│   │   └── routes.py
│   ├── api.py                 # Public Python API (optimize, session)
│   └── advisor.py             # Advisory utilities (cache split, model compare)
├── dashboard/                 # React dashboard (Vite + TailwindCSS)
└── tests/                     # pytest suite (198 tests)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run dashboard dev server (frontend)
cd dashboard && npm install && npm run dev

# Run API server (backend)
tokenoptim dashboard

Limitations

  • English only — contractions, filler removal, and verbose phrase shortening are designed for English text. Unicode normalization, whitespace cleanup, numeric normalization, and HTML/code comment stripping work with any language.
  • No semantic compression — TokenOptim applies rule-based transformations only. It does not paraphrase, summarize, or use ML models.
  • Tokenizer approximation — for providers without a public tokenizer (Anthropic, Mistral, Google, Meta, Qwen), token counts are approximated using tiktoken's cl100k_base encoding.

Contributing

Found a bug or have an idea? Open an issue or submit a PR. If TokenOptim saved you tokens (and money), a star goes a long way!

Adding a new compression strategy

Want to contribute a new strategy? Here's how — it only takes 3 steps:

1. Add your strategy to src/tokenoptim/compressor.py:

# Module-level data (if needed)
_MY_REPLACEMENTS: dict[str, str] = {
    "long phrase": "short",
}

# In the Compressor class:

# Add a constructor param (opt-in strategies default to False)
def __init__(self, ..., enable_my_strategy: bool = False):
    self.enable_my_strategy = enable_my_strategy

# Add a method
@staticmethod
def _apply_my_strategy(text: str) -> str:
    for old, new in _MY_REPLACEMENTS.items():
        text = text.replace(old, new)
    return text

# Wire it into the compress() pipeline at the right position
if self.enable_my_strategy:
    compressed = self._apply_my_strategy(compressed)

2. Propagate the param through src/tokenoptim/api.py:

Add enable_my_strategy: bool = False to optimize(), optimize_messages(), the Session dataclass, and session() — then pass it to the Compressor constructor in each.

3. Add tests in tests/test_compressor.py:

class TestMyStrategy:
    def test_basic(self):
        c = Compressor(model="gpt-4", enable_my_strategy=True)
        result = c.compress("some long phrase here")
        assert "short" in result.text

    def test_disabled_by_default(self, compressor):
        result = compressor.compress("some long phrase here")
        assert "long phrase" in result.text

Run PYTHONPATH=src python3 -m pytest tests/ -v and make sure everything passes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenoptim-0.1.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenoptim-0.1.0-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file tokenoptim-0.1.0.tar.gz.

File metadata

  • Download URL: tokenoptim-0.1.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for tokenoptim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 82b5d367a1eaa3709696a0324bff64ce7ada59c7f7e67d5436ae414eb49938ef
MD5 dc7afaf1dd6f816a672f7f88146b8402
BLAKE2b-256 a5b80521f37780a32839b96c9e60791d9161397e851e1b61bbd26942cca736bc

See more details on using hashes here.

File details

Details for the file tokenoptim-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tokenoptim-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for tokenoptim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48f12aa5dad10e1d29ea54c7d3f9dc0263009429c30a6f8b311ccb0a5cdf57ec
MD5 60de8409ee0a3961a62ad35efa4bc879
BLAKE2b-256 d023589c459320a4b38356971a1808dc22c4205e1b0b7c0539d62dd7fa627425

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page