Slash LLM input tokens by 70-80% — compress prompts, code, and conversations for Claude, GPT-4, and any LLM without losing meaning
Project description
tokenpruner
Slash LLM input tokens by 70–80% without losing meaning.
tokenpruner compresses prompts, code context, and multi-turn conversations before they are sent to Claude, GPT-4, or any LLM — reducing cost and latency with zero external model dependencies.
from tokenpruner import TextPruner, PruningConfig, PruningStrategy
pruner = TextPruner(PruningConfig(strategy=PruningStrategy.COMPOSITE))
result = pruner.prune(my_12000_token_prompt)
print(result)
# PruningResult(original=12400tok, pruned=2700tok, saved=78%)
Why tokenpruner?
| Pain point | tokenpruner solution |
|---|---|
| Long codebase context sent to Claude | Code minification strips comments + whitespace (40–60% reduction) |
| Repeated boilerplate in system prompts | Template stripping removes redundant instructions |
| Near-duplicate RAG chunks | Jaccard-based deduplication before embedding |
| Long conversation history | Smart sliding-window + semantic compression of older turns |
| Uncontrolled token spend | Rate limiter + circuit breaker protect every API call |
| Batch processing at scale | Async batch with bounded concurrency and per-item error isolation |
Installation
pip install tokenpruner
# Optional: exact token counts via tiktoken
pip install "tokenpruner[tiktoken]"
Quick start
Compress a single prompt
from tokenpruner import TextPruner, PruningConfig, PruningStrategy
# COMPOSITE runs: template_strip → code_minify → dedup → semantic → sliding_window
pruner = TextPruner(PruningConfig(
strategy=PruningStrategy.COMPOSITE,
target_ratio=0.25, # keep 25% of tokens = 75% reduction
))
result = pruner.prune(long_prompt)
print(f"Saved {result.reduction_ratio:.0%} — {result.tokens_saved} tokens")
Prune a full conversation
from tokenpruner import ConversationPruner, Message, PruningConfig
msgs = [Message(role=m["role"], content=m["content"]) for m in conversation]
pruner = ConversationPruner(
PruningConfig(max_tokens=8000),
keep_recent_turns=3, # last 3 exchanges verbatim
)
result = pruner.prune(msgs)
pruned_dicts = [{"role": m.role, "content": m.content} for m in result.pruned_messages]
Drop-in Claude adapter
import anthropic
from tokenpruner import PruningConfig, PruningStrategy
from tokenpruner.adapters.claude import ClaudeAdapter
client = anthropic.Anthropic()
adapter = ClaudeAdapter(client, config=PruningConfig(max_tokens=8000))
response, meta = adapter.messages_create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": giant_codebase_dump}],
system="You are a code reviewer.",
)
print(f"Tokens saved: {meta['messages_reduction']:.0%}")
Drop-in OpenAI adapter
import openai
from tokenpruner import PruningConfig
from tokenpruner.adapters.openai import OpenAIAdapter
client = openai.OpenAI()
adapter = OpenAIAdapter(client, config=PruningConfig(max_tokens=6000))
response, meta = adapter.chat_completions_create(
model="gpt-4o",
messages=[{"role": "user", "content": long_context}],
)
Strategies
| Strategy | Best for | Typical reduction |
|---|---|---|
COMPOSITE |
General prompts, mixed content | 60–80% |
CODE_MINIFY |
Code files, diffs | 40–60% |
SEMANTIC |
Long documents, RAG chunks | 40–70% |
DEDUP |
Repeated passages, few-shot examples | 30–70% |
TEMPLATE_STRIP |
System prompts with boilerplate | 20–40% |
SLIDING_WINDOW |
Long conversation history | Configurable |
TRUNCATE |
Hard budget enforcement | Configurable |
SUMMARY |
Semantic + dedup combined | 50–75% |
Advanced features
Smart Cache (LRU + TTL)
from tokenpruner import SmartCache, TextPruner
cache: SmartCache = SmartCache(maxsize=256, ttl=300)
pruner = TextPruner()
@cache.memoize
def cached_prune(text: str):
return pruner.prune(text)
cached_prune(prompt) # computes
cached_prune(prompt) # cache hit — free
print(cache.stats()) # {'hits': 1, 'misses': 1, 'hit_rate': 0.5, ...}
Pipeline with audit log and retry
from tokenpruner import PruningPipeline, PruningConfig, PruningStrategy, TextPruner
make_pruner = lambda s: lambda t: TextPruner(PruningConfig(strategy=s)).prune(t).pruned_text
pipeline = (
PruningPipeline()
.add_step("dedup", make_pruner(PruningStrategy.DEDUP))
.add_step("semantic", make_pruner(PruningStrategy.SEMANTIC))
.with_retry(n=2, backoff=0.5)
)
result, audit = pipeline.run(long_text)
print(audit) # per-step token counts, duration, errors
# Async
result, audit = await pipeline.arun(long_text)
Declarative validator
from tokenpruner import PruningValidator
validator = (
PruningValidator()
.require_min_length(10)
.require_max_tokens(100_000)
.require_no_secrets()
.add_rule("no_pii", lambda t: "@" not in t, "Email address detected")
)
errors = validator.validate(prompt) # {} = valid
await validator.avalidate(prompt) # async version
validator.validate_or_raise(prompt) # raises ValidationError if invalid
Async batch processing
from tokenpruner import async_batch, batch, PruningConfig
# Sync
results = batch(list_of_1000_prompts, concurrency=16)
# Async
results = await async_batch(list_of_1000_prompts, concurrency=32)
# Per-item errors are isolated — one bad item doesn't abort the batch
for r in results:
if isinstance(r, Exception):
print("Item failed:", r)
else:
print(f"Saved {r.reduction_ratio:.0%}")
Rate limiter
from tokenpruner import RateLimiter, get_rate_limiter
# Global limiter
limiter = RateLimiter(rate=10, capacity=10)
with limiter:
result = pruner.prune(text)
async with limiter:
result = pruner.prune(text)
# Per-key limiting (e.g. per user/API key)
limiter = get_rate_limiter("user-abc", rate=5, capacity=5)
print(limiter.stats())
Streaming for large documents
from tokenpruner import stream_prune, async_stream_prune
# Sync streaming
for chunk_result in stream_prune(huge_document, chunk_size=2000):
send_to_llm(chunk_result.pruned_text)
# Async streaming (cancellation-safe)
async for chunk_result in async_stream_prune(huge_document, chunk_size=2000):
await send_to_llm(chunk_result.pruned_text)
Diff engine
from tokenpruner import diff_prune
diff = diff_prune(original_prompt)
print(diff.summary())
# Pruning Summary
# Original : 12,400 tokens, 312 lines
# Pruned : 2,730 tokens, 68 lines
# Removed : 9,670 tokens (78.0%)
# ...
data = diff.to_json() # machine-readable dict
Circuit breaker
from tokenpruner import CircuitBreaker
cb = CircuitBreaker(failure_threshold=5, reset_timeout=30)
@cb.protect
def call_llm_api(prompt: str) -> str:
... # your API call
# Or inline
result = cb.call(pruner.prune, long_text)
print(cb.stats())
# {'state': 'closed', 'failures': 0, 'total_calls': 42, 'rejected_calls': 0}
Compression techniques
tokenpruner applies these evidence-based techniques:
- Template stripping — removes
You are a helpful assistantand empty XML tags (20–40%) - Code minification — strips comments, normalises whitespace (40–60% on code)
- Jaccard deduplication — removes near-duplicate sentences (30–70%)
- Heuristic semantic scoring — keeps high-value sentences (keyword density, position, structure)
- Sliding window — retains a prefix anchor + most-recent suffix
- Hard truncation — deterministic budget enforcement
Optional: exact token counting
from tokenpruner.utils.tokenizers import count_tokens_exact
# Uses tiktoken if installed, falls back to fast estimate
n = count_tokens_exact("hello world", model="cl100k_base")
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenpruner-1.0.0.tar.gz.
File metadata
- Download URL: tokenpruner-1.0.0.tar.gz
- Upload date:
- Size: 28.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1ccf7e8faaf9cf29b08cb750cf00d05aa2e2ab81ef37330c950d1e39af154b5
|
|
| MD5 |
a39d2275c0cfdd5144b18f7f045c821a
|
|
| BLAKE2b-256 |
0d8649c0d971819b65fc3f7490fed27f3aae3e58f1e029ef04dc1ab3a1039f5c
|
File details
Details for the file tokenpruner-1.0.0-py3-none-any.whl.
File metadata
- Download URL: tokenpruner-1.0.0-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6ed8cee89a85cabe437baf246333b65bf09ec935cb9fff72b31b29f811cb63e
|
|
| MD5 |
d781659d8cf7a9765a6559c99f208122
|
|
| BLAKE2b-256 |
cf4def0b0b984e5ff569b9bdf8326ad7b45fd8e1974f80eb6dcf2416fc493c37
|