Skip to main content

Slash LLM input tokens by 70-80% — compress prompts, code, and conversations for Claude, GPT-4, and any LLM without losing meaning

Project description

tokenpruner

Slash LLM input tokens by 70–80% without losing meaning.

tokenpruner compresses prompts, code context, and multi-turn conversations before they are sent to Claude, GPT-4, or any LLM — reducing cost and latency with zero external model dependencies.

from tokenpruner import TextPruner, PruningConfig, PruningStrategy

pruner = TextPruner(PruningConfig(strategy=PruningStrategy.COMPOSITE))
result = pruner.prune(my_12000_token_prompt)
print(result)
# PruningResult(original=12400tok, pruned=2700tok, saved=78%)

Why tokenpruner?

Pain point tokenpruner solution
Long codebase context sent to Claude Code minification strips comments + whitespace (40–60% reduction)
Repeated boilerplate in system prompts Template stripping removes redundant instructions
Near-duplicate RAG chunks Jaccard-based deduplication before embedding
Long conversation history Smart sliding-window + semantic compression of older turns
Uncontrolled token spend Rate limiter + circuit breaker protect every API call
Batch processing at scale Async batch with bounded concurrency and per-item error isolation

Installation

pip install tokenpruner

# Optional: exact token counts via tiktoken
pip install "tokenpruner[tiktoken]"

Quick start

Compress a single prompt

from tokenpruner import TextPruner, PruningConfig, PruningStrategy

# COMPOSITE runs: template_strip → code_minify → dedup → semantic → sliding_window
pruner = TextPruner(PruningConfig(
    strategy=PruningStrategy.COMPOSITE,
    target_ratio=0.25,   # keep 25% of tokens = 75% reduction
))

result = pruner.prune(long_prompt)
print(f"Saved {result.reduction_ratio:.0%}{result.tokens_saved} tokens")

Prune a full conversation

from tokenpruner import ConversationPruner, Message, PruningConfig

msgs = [Message(role=m["role"], content=m["content"]) for m in conversation]

pruner = ConversationPruner(
    PruningConfig(max_tokens=8000),
    keep_recent_turns=3,  # last 3 exchanges verbatim
)
result = pruner.prune(msgs)
pruned_dicts = [{"role": m.role, "content": m.content} for m in result.pruned_messages]

Drop-in Claude adapter

import anthropic
from tokenpruner import PruningConfig, PruningStrategy
from tokenpruner.adapters.claude import ClaudeAdapter

client = anthropic.Anthropic()
adapter = ClaudeAdapter(client, config=PruningConfig(max_tokens=8000))

response, meta = adapter.messages_create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": giant_codebase_dump}],
    system="You are a code reviewer.",
)
print(f"Tokens saved: {meta['messages_reduction']:.0%}")

Drop-in OpenAI adapter

import openai
from tokenpruner import PruningConfig
from tokenpruner.adapters.openai import OpenAIAdapter

client = openai.OpenAI()
adapter = OpenAIAdapter(client, config=PruningConfig(max_tokens=6000))

response, meta = adapter.chat_completions_create(
    model="gpt-4o",
    messages=[{"role": "user", "content": long_context}],
)

Strategies

Strategy Best for Typical reduction
COMPOSITE General prompts, mixed content 60–80%
CODE_MINIFY Code files, diffs 40–60%
SEMANTIC Long documents, RAG chunks 40–70%
DEDUP Repeated passages, few-shot examples 30–70%
TEMPLATE_STRIP System prompts with boilerplate 20–40%
SLIDING_WINDOW Long conversation history Configurable
TRUNCATE Hard budget enforcement Configurable
SUMMARY Semantic + dedup combined 50–75%

Advanced features

Smart Cache (LRU + TTL)

from tokenpruner import SmartCache, TextPruner

cache: SmartCache = SmartCache(maxsize=256, ttl=300)
pruner = TextPruner()

@cache.memoize
def cached_prune(text: str):
    return pruner.prune(text)

cached_prune(prompt)   # computes
cached_prune(prompt)   # cache hit — free
print(cache.stats())   # {'hits': 1, 'misses': 1, 'hit_rate': 0.5, ...}

Pipeline with audit log and retry

from tokenpruner import PruningPipeline, PruningConfig, PruningStrategy, TextPruner

make_pruner = lambda s: lambda t: TextPruner(PruningConfig(strategy=s)).prune(t).pruned_text

pipeline = (
    PruningPipeline()
    .add_step("dedup", make_pruner(PruningStrategy.DEDUP))
    .add_step("semantic", make_pruner(PruningStrategy.SEMANTIC))
    .with_retry(n=2, backoff=0.5)
)

result, audit = pipeline.run(long_text)
print(audit)  # per-step token counts, duration, errors

# Async
result, audit = await pipeline.arun(long_text)

Declarative validator

from tokenpruner import PruningValidator

validator = (
    PruningValidator()
    .require_min_length(10)
    .require_max_tokens(100_000)
    .require_no_secrets()
    .add_rule("no_pii", lambda t: "@" not in t, "Email address detected")
)

errors = validator.validate(prompt)   # {} = valid
await validator.avalidate(prompt)     # async version
validator.validate_or_raise(prompt)   # raises ValidationError if invalid

Async batch processing

from tokenpruner import async_batch, batch, PruningConfig

# Sync
results = batch(list_of_1000_prompts, concurrency=16)

# Async
results = await async_batch(list_of_1000_prompts, concurrency=32)

# Per-item errors are isolated — one bad item doesn't abort the batch
for r in results:
    if isinstance(r, Exception):
        print("Item failed:", r)
    else:
        print(f"Saved {r.reduction_ratio:.0%}")

Rate limiter

from tokenpruner import RateLimiter, get_rate_limiter

# Global limiter
limiter = RateLimiter(rate=10, capacity=10)
with limiter:
    result = pruner.prune(text)

async with limiter:
    result = pruner.prune(text)

# Per-key limiting (e.g. per user/API key)
limiter = get_rate_limiter("user-abc", rate=5, capacity=5)
print(limiter.stats())

Streaming for large documents

from tokenpruner import stream_prune, async_stream_prune

# Sync streaming
for chunk_result in stream_prune(huge_document, chunk_size=2000):
    send_to_llm(chunk_result.pruned_text)

# Async streaming (cancellation-safe)
async for chunk_result in async_stream_prune(huge_document, chunk_size=2000):
    await send_to_llm(chunk_result.pruned_text)

Diff engine

from tokenpruner import diff_prune

diff = diff_prune(original_prompt)
print(diff.summary())
# Pruning Summary
#   Original : 12,400 tokens, 312 lines
#   Pruned   : 2,730 tokens, 68 lines
#   Removed  : 9,670 tokens (78.0%)
#   ...

data = diff.to_json()   # machine-readable dict

Circuit breaker

from tokenpruner import CircuitBreaker

cb = CircuitBreaker(failure_threshold=5, reset_timeout=30)

@cb.protect
def call_llm_api(prompt: str) -> str:
    ...  # your API call

# Or inline
result = cb.call(pruner.prune, long_text)

print(cb.stats())
# {'state': 'closed', 'failures': 0, 'total_calls': 42, 'rejected_calls': 0}

Compression techniques

tokenpruner applies these evidence-based techniques:

  1. Template stripping — removes You are a helpful assistant and empty XML tags (20–40%)
  2. Code minification — strips comments, normalises whitespace (40–60% on code)
  3. Jaccard deduplication — removes near-duplicate sentences (30–70%)
  4. Heuristic semantic scoring — keeps high-value sentences (keyword density, position, structure)
  5. Sliding window — retains a prefix anchor + most-recent suffix
  6. Hard truncation — deterministic budget enforcement

Optional: exact token counting

from tokenpruner.utils.tokenizers import count_tokens_exact

# Uses tiktoken if installed, falls back to fast estimate
n = count_tokens_exact("hello world", model="cl100k_base")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenpruner-1.0.0.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenpruner-1.0.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file tokenpruner-1.0.0.tar.gz.

File metadata

  • Download URL: tokenpruner-1.0.0.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for tokenpruner-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e1ccf7e8faaf9cf29b08cb750cf00d05aa2e2ab81ef37330c950d1e39af154b5
MD5 a39d2275c0cfdd5144b18f7f045c821a
BLAKE2b-256 0d8649c0d971819b65fc3f7490fed27f3aae3e58f1e029ef04dc1ab3a1039f5c

See more details on using hashes here.

File details

Details for the file tokenpruner-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: tokenpruner-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for tokenpruner-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6ed8cee89a85cabe437baf246333b65bf09ec935cb9fff72b31b29f811cb63e
MD5 d781659d8cf7a9765a6559c99f208122
BLAKE2b-256 cf4def0b0b984e5ff569b9bdf8326ad7b45fd8e1974f80eb6dcf2416fc493c37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page