Skip to main content

Token cost control and auto-optimization for LLM apps — compress prompts, estimate costs, enforce budgets, route to cheap models, and cut LLM spend by up to 60%

Project description

llm-token-optimizer

Token cost control and auto-optimization for LLM applications.

Compress prompts, estimate costs before calls, enforce budgets, route to cheap models, and cut LLM spend by up to 60% — with no vendor lock-in.

pip install llm-token-optimizer

Why llm-token-optimizer?

In 2026, LLM API costs are the #1 operational expense for AI teams. Every wasted token costs money. Teams have no easy way to:

  • Estimate cost before making an API call
  • Compress prompts without breaking them
  • Route requests to cheaper models automatically
  • Enforce per-day or per-job token budgets
  • Detect cost drift across model upgrades

llm-token-optimizer fixes all of this — with a clean, provider-agnostic API.


Quickstart

from llm_token_optimizer import (
    optimize_prompt, CostEstimator, estimate_tokens,
)

prompt = """
Please note that you should summarize the following document.
As an AI language model, I'd be happy to help.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
"""

# Step 1: Estimate cost before calling the LLM
estimator = CostEstimator()
estimate = estimator.estimate("gpt-4o", prompt, estimated_output_tokens=200)
print(f"Estimated cost: ${estimate.total_cost_usd:.6f}")
print(f"Input tokens: {estimate.input_tokens}")

# Step 2: Optimize the prompt
result = optimize_prompt(prompt, strategies=["whitespace", "fillers", "dedup"])
print(f"Tokens saved: {result.tokens_saved}")
print(f"Compression ratio: {result.compression_ratio:.2f}")
print(result.optimized_text)

Built-in Optimization Strategies

Strategy Description
whitespace Collapse redundant spaces and blank lines
fillers Remove filler phrases ("Please note that", "As an AI...")
dedup Remove repeated paragraphs
examples Trim few-shot examples to first N (default 3)

Model Pricing (2026 catalog)

Pre-loaded pricing for OpenAI, Anthropic, Google, and Mistral:

from llm_token_optimizer import CostEstimator, ModelTier

estimator = CostEstimator()

# Compare models before choosing
results = estimator.compare_models(
    ["gpt-4o", "gpt-4o-mini", "claude-haiku-4-5-20251001"],
    prompt="Your prompt here",
    estimated_output_tokens=300,
)
for r in results:
    print(f"{r.model_id}: ${r.total_cost_usd:.6f}")

# Find cheapest in a tier
cheapest = estimator.cheapest_model(prompt, tier=ModelTier.CHEAP)

Advanced Features

Caching (LRU + TTL + SHA-256)

from llm_token_optimizer.advanced import OptimizationCache

cache = OptimizationCache(max_size=1000, ttl=600)
memoized = cache.memoize(optimize_prompt)
result = memoized(prompt, ["whitespace", "fillers"])  # cached on second call
print(cache.stats())

Semantic Cache (cosine similarity)

from llm_token_optimizer.advanced import SemanticCache

sc = SemanticCache(threshold=0.92)
sc.put(prompt, result)
cached = sc.get("similar prompt text...")  # returns if similarity >= 0.92

Optimization Pipeline

from llm_token_optimizer.advanced import OptimizationPipeline

pipeline = (
    OptimizationPipeline()
    .map("strip", lambda t: t.strip())
    .filter("non_empty", lambda t: len(t) > 0)
    .branch(
        condition=lambda t: len(t) > 2000,
        true_fn=lambda t: t[:2000],
        false_fn=lambda t: t,
    )
    .with_retry("strip", retries=2)
)
optimized = pipeline.run(prompt)
print(pipeline.audit_log)

import asyncio
optimized = asyncio.run(pipeline.arun(prompt))

Declarative Token Constraints

from llm_token_optimizer.advanced import PromptConstraintValidator, PromptConstraint

validator = (
    PromptConstraintValidator()
    .add(PromptConstraint("context_limit", max_tokens=4096, model_id="gpt-4o"))
    .add(PromptConstraint("min_content", min_tokens=10, model_id="gpt-4o"))
)
violations = validator.validate(prompt)

PII Scrubbing

from llm_token_optimizer.advanced import PIIScrubber

scrubber = PIIScrubber()
clean = scrubber.scrub("Contact: john@example.com, SSN: 123-45-6789")
# → "Contact: [EMAIL], SSN: [SSN]"

Rate Limiter (sync + async)

from llm_token_optimizer.advanced import RateLimiter
import asyncio

limiter = RateLimiter(rate=10, capacity=10)  # 10 calls/s
if limiter.acquire():
    result = optimize_prompt(prompt)

Async Batch Optimization

from llm_token_optimizer.advanced import abatch_optimize, batch_optimize
import asyncio

prompts = [prompt1, prompt2, prompt3]
results = asyncio.run(abatch_optimize(prompts, optimize_prompt, concurrency=8))
results = batch_optimize(prompts, optimize_prompt, max_workers=4)

Budget-Controlled Optimization

from llm_token_optimizer.advanced import optimize_with_budget

results = optimize_with_budget(prompts, optimize_prompt, budget_seconds=5.0)

Observability

from llm_token_optimizer.advanced import OperationProfiler, CostTelemetry, DriftDetector

# Timing profiler
profiler = OperationProfiler()
profiled = profiler.profile(optimize_prompt)
profiled(prompt)
print(profiler.report())

# Cost tracking
telemetry = CostTelemetry()
from llm_token_optimizer.models import TokenUsage
telemetry.record(TokenUsage(model_id="gpt-4o", input_tokens=500, output_tokens=100,
                            input_cost_usd=0.0025, output_cost_usd=0.0015, total_cost_usd=0.004))
print(telemetry.summary())
print(telemetry.by_model())

# Drift detection
drift_detector = DriftDetector(threshold=0.05)
drift_detector.set_baseline(result_v1)
drift = drift_detector.detect(result_v2)

Streaming

from llm_token_optimizer.advanced import stream_optimize, results_to_ndjson, results_to_csv

for result in stream_optimize(prompts, optimize_prompt):
    print(result.tokens_saved)

for line in results_to_ndjson(prompts, optimize_prompt):
    print(line)

csv_str = results_to_csv(results)

Diff & Regression Tracking

from llm_token_optimizer.advanced import diff_optimizations, RegressionTracker, ScoreTrend

diff = diff_optimizations(result_v1, result_v2)
print(diff.summary())
print(diff.to_json())

tracker = RegressionTracker(window=20)
tracker.record(result_v1)
tracker.record(result_v2)
print(tracker.trend())  # "improving" / "declining" / "stable"

trend = ScoreTrend(window=10)
trend.record(result.tokens_saved)
print(trend.trend(), trend.volatility())

Cost Ledger, Batch API Router, Model Router

from llm_token_optimizer.advanced import CostLedger, BatchAPIRouter, ModelRouter

# Hard budget enforcement
ledger = CostLedger(budget_usd=5.0)
ledger.record("gpt-4o", tokens=1000, cost_usd=0.005)
print(ledger.summary())  # raises BudgetExceededError if over budget

# 50% batch discount routing
router = BatchAPIRouter(latency_sensitive=False)
model_id, use_batch = router.route("gpt-4o", prompt)
effective_cost = router.effective_cost("gpt-4o", tokens=10000)

# Auto-route cheap vs. frontier
model_router = ModelRouter(cheap_token_threshold=500)
recommended_model = model_router.route(prompt)  # e.g. "gemini-2.0-flash" for short prompts

Audit Log

from llm_token_optimizer.advanced import AuditLog

log = AuditLog()
log.log("optimize", {"tokens_saved": 150, "model": "gpt-4o"})
print(log.to_json())

Custom Model Pricing

from llm_token_optimizer import PricingRegistry, ModelPricing, ModelTier

registry = PricingRegistry()
registry.register(ModelPricing(
    model_id="my-fine-tuned-model",
    tier=ModelTier.STANDARD,
    input_cost_per_1k=0.002,
    output_cost_per_1k=0.006,
    context_window=32768,
    supports_batch=True,
    batch_discount=0.50,
))

Installation

pip install llm-token-optimizer

# With exact tiktoken counting (optional):
pip install "llm-token-optimizer[tiktoken]"

Python 3.8+ · No external dependencies (stdlib + pydantic; tiktoken optional)


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_token_optimizer-1.0.0.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_token_optimizer-1.0.0-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_token_optimizer-1.0.0.tar.gz.

File metadata

  • Download URL: llm_token_optimizer-1.0.0.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llm_token_optimizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9f3ff6ea2427efe9e630b9aaf4734d81ae657178b8d80c3453e286c41d8032cf
MD5 aae23ea3cdf2c5259a9a7643314f8f23
BLAKE2b-256 1b61c60f40c1fad50d3cc1b895997357be2186e5f9b4f47a98d0776e93700cd4

See more details on using hashes here.

File details

Details for the file llm_token_optimizer-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_token_optimizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa17abad43cd15a0406540e4d299857026fe0d8aa031a62487ee2d63a106780b
MD5 5f7cc2a3ee1241ad6c169954e2fa01b8
BLAKE2b-256 b950723e515e1cb03f101441765dc33def03a2b21f7722d1dca84f97b27f6d95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page