Token cost control and auto-optimization for LLM apps — compress prompts, estimate costs, enforce budgets, route to cheap models, and cut LLM spend by up to 60%
Project description
llm-token-optimizer
Token cost control and auto-optimization for LLM applications.
Compress prompts, estimate costs before calls, enforce budgets, route to cheap models, and cut LLM spend by up to 60% — with no vendor lock-in.
pip install llm-token-optimizer
Why llm-token-optimizer?
In 2026, LLM API costs are the #1 operational expense for AI teams. Every wasted token costs money. Teams have no easy way to:
- Estimate cost before making an API call
- Compress prompts without breaking them
- Route requests to cheaper models automatically
- Enforce per-day or per-job token budgets
- Detect cost drift across model upgrades
llm-token-optimizer fixes all of this — with a clean, provider-agnostic API.
Quickstart
from llm_token_optimizer import (
optimize_prompt, CostEstimator, estimate_tokens,
)
prompt = """
Please note that you should summarize the following document.
As an AI language model, I'd be happy to help.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
"""
# Step 1: Estimate cost before calling the LLM
estimator = CostEstimator()
estimate = estimator.estimate("gpt-4o", prompt, estimated_output_tokens=200)
print(f"Estimated cost: ${estimate.total_cost_usd:.6f}")
print(f"Input tokens: {estimate.input_tokens}")
# Step 2: Optimize the prompt
result = optimize_prompt(prompt, strategies=["whitespace", "fillers", "dedup"])
print(f"Tokens saved: {result.tokens_saved}")
print(f"Compression ratio: {result.compression_ratio:.2f}")
print(result.optimized_text)
Built-in Optimization Strategies
| Strategy | Description |
|---|---|
whitespace |
Collapse redundant spaces and blank lines |
fillers |
Remove filler phrases ("Please note that", "As an AI...") |
dedup |
Remove repeated paragraphs |
examples |
Trim few-shot examples to first N (default 3) |
Model Pricing (2026 catalog)
Pre-loaded pricing for OpenAI, Anthropic, Google, and Mistral:
from llm_token_optimizer import CostEstimator, ModelTier
estimator = CostEstimator()
# Compare models before choosing
results = estimator.compare_models(
["gpt-4o", "gpt-4o-mini", "claude-haiku-4-5-20251001"],
prompt="Your prompt here",
estimated_output_tokens=300,
)
for r in results:
print(f"{r.model_id}: ${r.total_cost_usd:.6f}")
# Find cheapest in a tier
cheapest = estimator.cheapest_model(prompt, tier=ModelTier.CHEAP)
Advanced Features
Caching (LRU + TTL + SHA-256)
from llm_token_optimizer.advanced import OptimizationCache
cache = OptimizationCache(max_size=1000, ttl=600)
memoized = cache.memoize(optimize_prompt)
result = memoized(prompt, ["whitespace", "fillers"]) # cached on second call
print(cache.stats())
Semantic Cache (cosine similarity)
from llm_token_optimizer.advanced import SemanticCache
sc = SemanticCache(threshold=0.92)
sc.put(prompt, result)
cached = sc.get("similar prompt text...") # returns if similarity >= 0.92
Optimization Pipeline
from llm_token_optimizer.advanced import OptimizationPipeline
pipeline = (
OptimizationPipeline()
.map("strip", lambda t: t.strip())
.filter("non_empty", lambda t: len(t) > 0)
.branch(
condition=lambda t: len(t) > 2000,
true_fn=lambda t: t[:2000],
false_fn=lambda t: t,
)
.with_retry("strip", retries=2)
)
optimized = pipeline.run(prompt)
print(pipeline.audit_log)
import asyncio
optimized = asyncio.run(pipeline.arun(prompt))
Declarative Token Constraints
from llm_token_optimizer.advanced import PromptConstraintValidator, PromptConstraint
validator = (
PromptConstraintValidator()
.add(PromptConstraint("context_limit", max_tokens=4096, model_id="gpt-4o"))
.add(PromptConstraint("min_content", min_tokens=10, model_id="gpt-4o"))
)
violations = validator.validate(prompt)
PII Scrubbing
from llm_token_optimizer.advanced import PIIScrubber
scrubber = PIIScrubber()
clean = scrubber.scrub("Contact: john@example.com, SSN: 123-45-6789")
# → "Contact: [EMAIL], SSN: [SSN]"
Rate Limiter (sync + async)
from llm_token_optimizer.advanced import RateLimiter
import asyncio
limiter = RateLimiter(rate=10, capacity=10) # 10 calls/s
if limiter.acquire():
result = optimize_prompt(prompt)
Async Batch Optimization
from llm_token_optimizer.advanced import abatch_optimize, batch_optimize
import asyncio
prompts = [prompt1, prompt2, prompt3]
results = asyncio.run(abatch_optimize(prompts, optimize_prompt, concurrency=8))
results = batch_optimize(prompts, optimize_prompt, max_workers=4)
Budget-Controlled Optimization
from llm_token_optimizer.advanced import optimize_with_budget
results = optimize_with_budget(prompts, optimize_prompt, budget_seconds=5.0)
Observability
from llm_token_optimizer.advanced import OperationProfiler, CostTelemetry, DriftDetector
# Timing profiler
profiler = OperationProfiler()
profiled = profiler.profile(optimize_prompt)
profiled(prompt)
print(profiler.report())
# Cost tracking
telemetry = CostTelemetry()
from llm_token_optimizer.models import TokenUsage
telemetry.record(TokenUsage(model_id="gpt-4o", input_tokens=500, output_tokens=100,
input_cost_usd=0.0025, output_cost_usd=0.0015, total_cost_usd=0.004))
print(telemetry.summary())
print(telemetry.by_model())
# Drift detection
drift_detector = DriftDetector(threshold=0.05)
drift_detector.set_baseline(result_v1)
drift = drift_detector.detect(result_v2)
Streaming
from llm_token_optimizer.advanced import stream_optimize, results_to_ndjson, results_to_csv
for result in stream_optimize(prompts, optimize_prompt):
print(result.tokens_saved)
for line in results_to_ndjson(prompts, optimize_prompt):
print(line)
csv_str = results_to_csv(results)
Diff & Regression Tracking
from llm_token_optimizer.advanced import diff_optimizations, RegressionTracker, ScoreTrend
diff = diff_optimizations(result_v1, result_v2)
print(diff.summary())
print(diff.to_json())
tracker = RegressionTracker(window=20)
tracker.record(result_v1)
tracker.record(result_v2)
print(tracker.trend()) # "improving" / "declining" / "stable"
trend = ScoreTrend(window=10)
trend.record(result.tokens_saved)
print(trend.trend(), trend.volatility())
Cost Ledger, Batch API Router, Model Router
from llm_token_optimizer.advanced import CostLedger, BatchAPIRouter, ModelRouter
# Hard budget enforcement
ledger = CostLedger(budget_usd=5.0)
ledger.record("gpt-4o", tokens=1000, cost_usd=0.005)
print(ledger.summary()) # raises BudgetExceededError if over budget
# 50% batch discount routing
router = BatchAPIRouter(latency_sensitive=False)
model_id, use_batch = router.route("gpt-4o", prompt)
effective_cost = router.effective_cost("gpt-4o", tokens=10000)
# Auto-route cheap vs. frontier
model_router = ModelRouter(cheap_token_threshold=500)
recommended_model = model_router.route(prompt) # e.g. "gemini-2.0-flash" for short prompts
Audit Log
from llm_token_optimizer.advanced import AuditLog
log = AuditLog()
log.log("optimize", {"tokens_saved": 150, "model": "gpt-4o"})
print(log.to_json())
Custom Model Pricing
from llm_token_optimizer import PricingRegistry, ModelPricing, ModelTier
registry = PricingRegistry()
registry.register(ModelPricing(
model_id="my-fine-tuned-model",
tier=ModelTier.STANDARD,
input_cost_per_1k=0.002,
output_cost_per_1k=0.006,
context_window=32768,
supports_batch=True,
batch_discount=0.50,
))
Installation
pip install llm-token-optimizer
# With exact tiktoken counting (optional):
pip install "llm-token-optimizer[tiktoken]"
Python 3.8+ · No external dependencies (stdlib + pydantic; tiktoken optional)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_token_optimizer-1.0.0.tar.gz.
File metadata
- Download URL: llm_token_optimizer-1.0.0.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f3ff6ea2427efe9e630b9aaf4734d81ae657178b8d80c3453e286c41d8032cf
|
|
| MD5 |
aae23ea3cdf2c5259a9a7643314f8f23
|
|
| BLAKE2b-256 |
1b61c60f40c1fad50d3cc1b895997357be2186e5f9b4f47a98d0776e93700cd4
|
File details
Details for the file llm_token_optimizer-1.0.0-py3-none-any.whl.
File metadata
- Download URL: llm_token_optimizer-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa17abad43cd15a0406540e4d299857026fe0d8aa031a62487ee2d63a106780b
|
|
| MD5 |
5f7cc2a3ee1241ad6c169954e2fa01b8
|
|
| BLAKE2b-256 |
b950723e515e1cb03f101441765dc33def03a2b21f7722d1dca84f97b27f6d95
|