Reduce LLM costs by 90% - AI recommendations with NO API keys needed!
Project description
LLMOptimize
Cut your AI API costs — automatically. One import. Zero config. No API key required.
pip install llmoptimize
What It Does
LLMOptimize is a complete AI cost optimization SDK that silently watches every AI API call your code makes and surfaces actionable savings — without ever touching your prompts or responses.
- Zero setup — just
import llmoptimize - No API key needed for recommendations
- Never reads your prompt text — only token counts and model names
- Works with OpenAI, Anthropic, Groq and any framework built on them (LangChain, CrewAI, LlamaIndex, etc.)
What it analyzes:
| Feature | Description |
|---|---|
| 💰 Cost tracking | Real token usage → exact cost per call |
| 💡 Model recommendations | Heuristic + ML engine finds cheaper alternatives |
| 🔁 Loop detection | Catches agent loops before they drain your budget |
| 📚 RAG pattern detection | Identifies RAG pipelines and embedding savings |
| ⚡ Cache opportunities | Finds repeated prompts that should be cached |
| 🧠 ML model | Learns from your usage and improves over time |
| 🤖 Agent workflow | Multi-step tracking, context growth, step analytics |
| 📏 Context optimizer | Detects context window growth, compression tips |
| 🛡️ Security guardrails | Flags if API keys or sensitive data appear in prompts |
Quickstart (2 lines)
import llmoptimize # ← add this at the top
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model = "gpt-4",
messages = [{"role": "user", "content": "Summarize this article..."}],
)
print(response.choices[0].message.content) # your real output, unchanged
llmoptimize.report() # ← add this at the end
That's it. Your code runs exactly as before. At the end you'll see a full cost + optimization report.
What the Report Shows
Running llmoptimize.report() produces a full interactive terminal report covering:
╔══════════════════════════════════════════════════════════════╗
║ 🚀 L L M O P T I M I Z E R E P O R T 🚀 ║
╚══════════════════════════════════════════════════════════════╝
📊 YOUR USAGE SUMMARY
🚀 Total API Calls Tracked 3
💰 Total Cost $0.0041
💎 Potential Savings $0.0039 (94% less!)
📋 USAGE BY TYPE
💬 Chat 2 calls $0.0040 → gpt-4o-mini (saves 94%)
📚 Embedding 1 call $0.0001 → text-embedding-3-small (saves 80%)
💡 PERSONALIZED RECOMMENDATIONS
╭────────────────────────────────────────────────────╮
│ #1 Switch to: gpt-4o-mini 💰 Save 94% │
│ Why: You called gpt-4 2x — gpt-4o-mini costs 94% │
│ less, saves ~$0.18 per 1,000 calls │
│ Fix: model="gpt-4" → "gpt-4o-mini" │
╰────────────────────────────────────────────────────╯
🤖 AGENT WORKFLOW ANALYSIS
Steps tracked: 3
Models used: 2 (multi-model workflow)
Avg tokens/step: 420
Context growth: 1.2x
💡 Multi-model workflow — heuristic engine can recommend
the cheapest model per step automatically
⚡ CACHING OPPORTUNITIES
Found 1 repeated prompt pattern — caching could save ~$0.0008
🧠 ML MODEL STATUS
ML collecting training data (12 samples — activates after 50+)
Track more calls to unlock ML-powered model selection
📏 CONTEXT WINDOW OPTIMIZER
Prompt sizes stable (avg 380 tokens) — context is well-managed
🔧 SDK UTILITIES
llmoptimize.select_model(code) → pick cheapest Groq model
llmoptimize.check_loop(actions) → detect agent loops
llmoptimize.analyze(prompt, model) → instant recommendation
Dry-Run Mode — Plan Costs Before Spending
Test your full code flow and get savings advice before spending a dollar.
Wrap your code with with llmoptimize.report: — real API calls are intercepted,
mock responses are returned so your code runs fully, and the report prints on exit.
import llmoptimize
import openai
client = openai.OpenAI(api_key="anything") # not used in dry-run
with llmoptimize.report:
# No real API calls — mock responses returned automatically
client.embeddings.create(
model = "text-embedding-3-large",
input = ["RAG systems retrieve relevant documents."],
)
client.chat.completions.create(
model = "gpt-4",
messages = [{"role": "user", "content": "Summarize this."}],
)
# Report prints automatically when the block exits
When you're ready to go live, just remove the with llmoptimize.report: line — your code is already correct.
Named Task Sessions
Use llmoptimize.task() to get a separate labelled report per pipeline stage.
Each block gets a clean slate, its own label, and optional dry-run mode.
import llmoptimize
import openai
client = openai.OpenAI()
# Track real costs per stage
with llmoptimize.task("rag-pipeline"):
chunks = client.embeddings.create(model="text-embedding-3-large", input=["..."])
summary = client.chat.completions.create(model="gpt-4", messages=[...])
# Plan costs before shipping — no real API calls
with llmoptimize.task("cost-planning", dry_run=True):
client.chat.completions.create(model="gpt-4", messages=[...])
SDK Utility Functions
These work anywhere in your code — no extra setup, no API key.
Instant Model Recommendation
result = llmoptimize.analyze(
prompt = "Classify this support ticket as urgent or normal: ...",
model = "gpt-4o",
)
# result["recommendation"]["suggested_model"] → "gpt-4o-mini"
# result["recommendation"]["estimated_savings_percent"] → 96
# result["recommendation"]["reasoning"] → "Classification task — cheaper models maintain 95%+ accuracy"
Smart Model Selector for Code Tasks
Automatically picks the cheapest Groq model that can handle your code complexity. Saves up to 84% vs always using the 70B model.
result = llmoptimize.select_model("""
Extract the user name and email from this JSON string.
Return as a Python dict.
""")
# result["selected_model"] → "llama-3.1-8b-instant"
# result["complexity_level"] → "simple"
# result["model_info"]["best_for"] → "Simple scripts, single API calls, basic logic"
# result["vs_heavy_model"]["savings_pct"] → 84
# result["vs_heavy_model"]["message"] → "Simple task — llama-3.1-8b-instant is 84% cheaper..."
| Complexity | Model Selected | Use Case |
|---|---|---|
simple |
llama-3.1-8b-instant |
Single API calls, basic logic, extraction |
medium |
openai/gpt-oss-20b |
Multi-step workflows, moderate complexity |
complex |
llama-3.3-70b-versatile |
Complex agents, advanced reasoning |
Agent Loop Detection
Catches repetitive agent behavior before it drains your budget.
result = llmoptimize.check_loop([
"search web for python docs",
"read python docs",
"search web for python docs", # repeated!
"read python docs",
"search web for python docs", # repeated again!
])
# result["loop_detected"] → True
# result["loops"][0]["pattern_type"] → "exact_repeat"
# result["loops"][0]["severity"] → "warning"
# result["loops"][0]["recommendation"] → "Add a stop condition..."
# result["message"] → "⚠️ 1 loop pattern detected..."
Detects:
- Exact repeats — same action 3+ times
- Circular patterns — A→B→C→A→B→C
- Alternating loops — A→B→A→B
LangChain & CrewAI
No changes needed to your agents or chains. Just add import llmoptimize at the top — all LLM calls inside chains and agents are tracked automatically.
import llmoptimize # ← one line at the top
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
llm = ChatOpenAI(model="gpt-4")
chain = LLMChain(llm=llm, prompt=my_prompt)
chain.invoke({"input": "..."})
llmoptimize.report() # see exactly what the chain spent and how to cut it
import llmoptimize # ← one line at the top
from crewai import Agent, Task, Crew
researcher = Agent(role="Researcher", llm="gpt-4", ...)
crew = Crew(agents=[researcher], tasks=[...])
crew.kickoff()
llmoptimize.report() # full agent workflow analysis included
CLI — Audit a File Before Running It
No code changes needed. Point it at any Python file:
llmoptimize audit mycode.py
╔════════════════════════════════════════════════════════════════╗
║ 🤖 AI CODE AUDIT REPORT ║
╚════════════════════════════════════════════════════════════════╝
📄 File: mycode.py
📊 SUMMARY
API calls found: 7
Issues detected: 4
Models used: gpt-4, claude-3-opus
Est. monthly cost: $342 (at 1,000 runs/month)
Potential savings: $298 (87%)
🔍 RECOMMENDATIONS
🔴 Line 42 — claude-3-opus
Switch to: claude-3-5-haiku | saves 95%
Why: Classification task — claude-3-5-haiku is 18x cheaper
with comparable accuracy for structured output.
Options:
llmoptimize audit mycode.py # full report
llmoptimize audit mycode.py --quiet # one-line summary
llmoptimize audit mycode.py --force # skip cache, always re-analyze
llmoptimize stats # show cache statistics
llmoptimize clear-cache # clear cached results
Supported Providers
import llmoptimize automatically patches every AI library you have installed.
| Provider | Library | Chat | Embeddings |
|---|---|---|---|
| OpenAI | openai |
✅ | ✅ |
| Anthropic | anthropic |
✅ | — |
| Groq | groq |
✅ | — |
Pricing data for 60+ models: OpenAI, Anthropic, Groq, Gemini, Mistral, Cohere, Voyage AI, Jina AI, AWS Bedrock.
How Recommendations Work
Recommendations use a 3-layer engine — never just the cheapest model:
1. Heuristic engine (5ms)
↓ Keyword-based task detection (classification → gpt-4o-mini, etc.)
2. ML model (10ms)
↓ Trained on real accept/reject decisions from all SDK users
↓ Learns which models work best per prompt category + complexity
3. Crowd-sourced patterns (instant)
↓ Global anonymised data: which model won for this task type?
Capability tiers are respected — you'll never see a recommendation that drops more than one quality tier:
| Tier | Examples |
|---|---|
| Frontier | gpt-4, claude-3-opus, o1 |
| Strong | gpt-4o, claude-3-5-sonnet, gemini-1.5-pro |
| Capable | gpt-4o-mini, claude-3-haiku, gemini-1.5-flash |
| Lightweight | gemini-1.5-flash-8b, llama-3.1-8b-instant |
Session Management
llmoptimize.new_session() # clear tracking, start fresh
llmoptimize.report(interactive=False) # no menu prompt — useful in scripts
Jupyter / VS Code Interactive Window note: The Python kernel stays alive between cells, so
llmoptimizeaccumulates calls across all cells. Callllmoptimize.new_session()before each test run:import llmoptimize llmoptimize.new_session() # ← reset before testing a new model resp = client.embeddings.create(model="text-embedding-3-small", input=texts) llmoptimize.report()Regular
.pyscripts reset automatically on each run.
Manual Tracking
For custom or self-hosted models not auto-patched:
llmoptimize.track(
model = "my-custom-model",
prompt_tokens = 400,
completion_tokens = 120,
provider = "custom",
)
llmoptimize.report()
Free Tier & License
LLMOptimize includes 500 free tracked calls per machine.
Activate a paid license
llmoptimize activate llmopt-xxxxxxxxxxxx
# ✅ License activated! Plan: starter | 500 calls/month
# Valid through: 2026-04
Remove a license
llmoptimize deactivate
# ✅ License removed. Free tier limits restored.
For servers / containers
export AIOPTIMIZE_LICENSE_KEY="llmopt-xxxxxxxxxxxx"
Privacy
| Data | Stored locally | Sent to server |
|---|---|---|
| Your prompt text | Never | Never |
| Token counts | Yes | Yes (anonymised) |
| Model names | Yes | Yes |
| Cost figures | Yes | Yes |
| API keys | Never stored | Never sent |
Only the first 100 chars of your prompt are optionally sent for category classification (e.g. "classification" vs "summarization") — never the full text, never stored.
To disable server tracking entirely:
export AIOPTIMIZE_SERVER_URL=""
Full API Reference
# ── Core ──────────────────────────────────────────────────────
import llmoptimize
llmoptimize.report() # interactive report
llmoptimize.report(interactive=False) # plain text, no menu
llmoptimize.new_session() # reset all tracking
llmoptimize.track(model, prompt_tokens, completion_tokens)
# ── Dry-run / named sessions ──────────────────────────────────
with llmoptimize.report: # dry-run + report on exit
...
with llmoptimize.report(interactive=False):
...
with llmoptimize.task("pipeline-name"): # named session, live calls
...
with llmoptimize.task("plan", dry_run=True): # dry-run + labelled report
...
# ── Smart utilities ───────────────────────────────────────────
result = llmoptimize.analyze(prompt, model) # instant recommendation
result = llmoptimize.select_model(code) # pick cheapest Groq model
result = llmoptimize.check_loop(actions) # detect agent loops
FAQ
Do I need to configure anything?
No. import llmoptimize is all the setup required.
Will it slow down my app? No. Tracking happens after your response is returned and never blocks the critical path. All server calls are fire-and-forget.
What if the recommendation server is unreachable? It falls back to local pricing data instantly. Your app is never affected.
Does it work with LangChain / LlamaIndex / CrewAI? Yes — they all use the underlying OpenAI/Anthropic/Groq SDKs which are patched automatically.
Does it work with streaming? Yes. Token counts are recorded from the final usage block after streaming completes.
Can I use it without an API key at all?
Yes — use dry_run=True or with llmoptimize.report:. Your code runs end-to-end with mock responses. No API key, no cost, full recommendations.
What's the difference between task() and report()?
task("name") resets the session first and labels the output. report() shows everything tracked since the last reset. Use task() when benchmarking specific pipeline stages.
What does select_model() do?
It sends your code/task description to the server's GroqModelSelector — a hybrid rule-based + caching engine that picks the cheapest Groq model capable of handling the complexity. No API key needed.
What does check_loop() do?
It sends a list of action strings to the server's LoopDetector, which runs rule-based checks for exact repeats, circular patterns (A→B→C→A), and alternating loops. Returns which steps are looping and a recommendation to fix it.
LLMOptimize v3.3.0 — spend less, build more.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmoptimize-3.3.0.tar.gz.
File metadata
- Download URL: llmoptimize-3.3.0.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8dfd1576f1805bfd05af1a660b0962c2a62217c685e4c55124e69fdf64d4ed0
|
|
| MD5 |
ef6d24ca259ffff20fcd1e681426a42f
|
|
| BLAKE2b-256 |
096ebb0991a835217c86f68c693fdd3b12cd8dd943a810bdcc0b542c15f32751
|
File details
Details for the file llmoptimize-3.3.0-py3-none-any.whl.
File metadata
- Download URL: llmoptimize-3.3.0-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6306e40e8edb09fa6e77043c5e1d11713d91a2f209f5cfb601a44251830750d1
|
|
| MD5 |
b9b4d407290fe4861f6bd7198da93008
|
|
| BLAKE2b-256 |
786502a36d8d2c8cd1ea3fa5d390ddd291f10d3d882729d02fc325ed4f43650c
|