Automatically find cheaper LLM alternatives while maintaining performance
Project description
LLMuxer
Find the cheapest LLM that meets your quality bar (Currently supports classification tasks only)
Quick Start
import llmuxer
# Example: Classify sentiment with 90% accuracy requirement
examples = [
{"input": "This product is amazing!", "label": "positive"},
{"input": "Terrible service", "label": "negative"},
{"input": "It's okay", "label": "neutral"}
]
result = llmuxer.optimize_cost(
baseline="gpt-4",
examples=examples,
task="classification", # Currently only classification is supported
options=["positive", "negative", "neutral"],
min_accuracy=0.9 # Require 90% accuracy
)
print(result)
# Takes ~30-60 seconds for small datasets, ~10-15 minutes for 1k samples
Example Output
{
"model": "anthropic/claude-3-haiku",
"accuracy": 0.92,
"cost_per_million": 0.25,
"cost_savings": 0.975, # 97.5% cheaper than GPT-4
"baseline_cost_per_million": 10.0,
"tokens_evaluated": 1500
}
The Problem
You're using GPT-4 for classification. It works well but costs $20/million tokens. Could GPT-3.5 do just as well for $0.50? What about Claude Haiku at $0.25? Or Llama-3.1 at $0.06?
LLMuxer automatically tests your classification task across 18 models to find the cheapest one that maintains your required accuracy.
How It Works
Your Dataset → LLMuxer → Tests 18 Models → Returns Cheapest That Works
↓
Uses OpenRouter API
(unified interface)
LLMuxer:
- Takes your baseline model (e.g., GPT-4) and test dataset
- Evaluates cheaper alternatives via OpenRouter
- Returns the cheapest model meeting your accuracy threshold
- Shows detailed cost breakdown and savings
Installation
Prerequisites
- Python 3.8+
- OpenRouter API key (for model access)
Install
pip install llmuxer
Setup
export OPENROUTER_API_KEY="your-api-key-here"
Key Features
- 18 models tested - OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek
- Smart stopping - Skips smaller models if larger ones fail
- Cost breakdown - See token counts and costs per model
- Fast testing - Use
sample_sizeto test on subset first - Simple API - One function does everything
- Classification only - Support for extraction, generation, and binary tasks coming in v0.2
Benchmarks
Tested Models
Live pricing data from OpenRouter API (updated automatically):
| Provider | Models | Price Range ($/M tokens) |
|---|---|---|
| OpenAI | gpt-4o-mini, gpt-3.5-turbo | $0.75 - $2.00 |
| Anthropic | claude-3-haiku | $1.50 |
| DeepSeek | deepseek-chat | $0.90 |
| Mistral | 3 models | $0.08 - $8.00 |
| Meta | llama-3.1-8b-instruct, llama-3.1-70b-instruct | $0.04 - $0.38 |
Total: 9 models across 5 providers
Reproduce Our Benchmarks
# Test all 9 models on Banking77 dataset
python scripts/prepare_banking77.py
python examples/banking77_test.py
Expected Results: Most models achieve 85-92% accuracy on Banking77. Claude-3-haiku typically provides the best accuracy/cost ratio for classification tasks.
Performance Benchmarks
Fixed Dataset Results (50 job classification samples, tested 2025-08-10)
| Metric | Baseline (GPT-4o) | Best Model (Claude-3-haiku) | Savings |
|---|---|---|---|
| Accuracy | ~95% (assumed) | 92.0% | Quality maintained |
| Cost/Million Tokens | $12.50 | $1.50 | 88.0% cheaper |
| Cost/Request* | $0.001875 | $0.000225 | $0.00165 saved |
| Monthly (1K requests) | $1.88 | $0.23 | $1.65 saved |
Conservative estimate: 150 tokens/request (100 input + 50 output)
📊 Full Benchmark Report | 🔄 Reproduction Guide
Reproduction
# Install and setup
pip install llmuxer
export OPENROUTER_API_KEY="your-key"
# Run exact benchmark
./scripts/bench.sh
# Generates: benchmarks/bench_YYYYMMDD.json + docs/benchmarks.md
Benchmark Notes:
- Fixed dataset:
data/jobs_50.jsonl(8 categories, 50 samples) - Pinned models: 8 specific models with exact API versions
- Conservative estimates: 150 tokens/request assumption
- No cherry-picking: Single test run results
- Quality threshold: 85%+ accuracy required
API Reference
optimize_cost()
Find the cheapest model meeting your requirements for classification tasks.
Parameters:
baseline(str): Your current model (e.g., "gpt-4")examples(list): Test examples with input and labeldataset(str): Path to JSONL file (alternative to examples)task(str): Must be "classification" (other tasks coming soon)options(list): Valid output classes for classificationmin_accuracy(float): Minimum acceptable accuracy (0.0-1.0)sample_size(float): Fraction of dataset to test (0.0-1.0)prompt(str): Optional system prompt
Returns: Dictionary with model name, accuracy, cost, and savings.
Error Handling:
- Returns
{"error": "message"}if no model meets threshold - Retries on API failures
- Validates dataset format
Full Example: Banking Intent Classification
import llmuxer
# Using the Banking77 dataset (77 intent categories)
result = llmuxer.optimize_cost(
baseline="gpt-4",
dataset="data/banking77.jsonl", # Your prepared dataset
task="classification",
min_accuracy=0.8,
sample_size=0.2 # Test on 20% first for speed
)
if "error" in result:
print(f"No model found: {result['error']}")
else:
print(f"Switch from {baseline} to {result['model']}")
print(f"Save {result['cost_savings']:.0%} on costs")
print(f"Accuracy: {result['accuracy']:.1%}")
Dataset Format
JSONL format with input and label fields:
{"input": "What's my account balance?", "label": "balance_inquiry"}
{"input": "I lost my card", "label": "card_lost"}
Performance Notes
Timing Estimates
For a dataset with 1,000 samples:
| Model Type | Time per 1k samples | Token Speed |
|---|---|---|
| GPT-3.5-turbo | ~45-60 seconds | ~2,000 tokens/sec |
| Claude-3-haiku | ~30-45 seconds | ~2,500 tokens/sec |
| Gemini-1.5-flash | ~20-30 seconds | ~3,000 tokens/sec |
| Llama-3.1-8b | ~15-25 seconds | ~3,500 tokens/sec |
| Total for 18 models | ~10-15 minutes | Sequential |
Speed Considerations
- Sequential Processing: Currently tests one model at a time (parallel in v0.2)
- Sample Size: Use
sample_size=0.1to test on 10% first for quick validation - Smart Stopping: Saves 30-50% time by skipping smaller models when larger ones fail
- Rate Limits: Automatic handling with exponential backoff
- Caching: Not yet implemented (coming in v0.2 will reduce re-evaluation time by 90%)
Links
License
MIT - see LICENSE file.
Support
- Issues: GitHub Issues
- Email: mihirahuja09@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmuxer-0.1.0.tar.gz.
File metadata
- Download URL: llmuxer-0.1.0.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb21b7abfa44bc0e2bb6a72fb6e0f33270193e16a962d0b23b7d0eba915c491d
|
|
| MD5 |
f0c571ac7a58f5750a87ec29ce956b77
|
|
| BLAKE2b-256 |
f540d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11
|
File details
Details for the file llmuxer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmuxer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a102850831ceef123ba08d74208e1ed2e417fb9597b2bed8a921ab14eef1c15
|
|
| MD5 |
82522facc3a3144fab398197dfb24238
|
|
| BLAKE2b-256 |
02a9d135d59f39d286191fc781aae04ab21d678db4bdbcdb9740e27e2383382a
|