Skip to main content

Automatically find cheaper LLM alternatives while maintaining performance

Project description

LLMuxer

PyPI version Python 3.8+ License: MIT CI Tests Coverage Downloads GitHub Stars

Find the cheapest LLM that meets your quality bar (Currently supports classification tasks only)

Quick Start

Open In Colab

import llmuxer

# Example: Classify sentiment with 90% accuracy requirement
examples = [
    {"input": "This product is amazing!", "label": "positive"},
    {"input": "Terrible service", "label": "negative"},
    {"input": "It's okay", "label": "neutral"}
]

result = llmuxer.optimize_cost(
    baseline="gpt-4",
    examples=examples,
    task="classification",  # Currently only classification is supported
    options=["positive", "negative", "neutral"],
    min_accuracy=0.9  # Require 90% accuracy
)

print(result)
# Takes ~30-60 seconds for small datasets, ~10-15 minutes for 1k samples

Example Output

{
    "model": "anthropic/claude-3-haiku",
    "accuracy": 0.92,
    "cost_per_million": 0.25,
    "cost_savings": 0.975,  # 97.5% cheaper than GPT-4
    "baseline_cost_per_million": 10.0,
    "tokens_evaluated": 1500
}

The Problem

You're using GPT-4 for classification. It works well but costs $20/million tokens. Could GPT-3.5 do just as well for $0.50? What about Claude Haiku at $0.25? Or Llama-3.1 at $0.06?

LLMuxer automatically tests your classification task across 18 models to find the cheapest one that maintains your required accuracy.

How It Works

Your Dataset → LLMuxer → Tests 18 Models → Returns Cheapest That Works
                  ↓
           Uses OpenRouter API
           (unified interface)

LLMuxer:

  1. Takes your baseline model (e.g., GPT-4) and test dataset
  2. Evaluates cheaper alternatives via OpenRouter
  3. Returns the cheapest model meeting your accuracy threshold
  4. Shows detailed cost breakdown and savings

Installation

Prerequisites

Install

pip install llmuxer

Setup

export OPENROUTER_API_KEY="your-api-key-here"

Key Features

  • 18 models tested - OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek
  • Smart stopping - Skips smaller models if larger ones fail
  • Cost breakdown - See token counts and costs per model
  • Fast testing - Use sample_size to test on subset first
  • Simple API - One function does everything
  • Classification only - Support for extraction, generation, and binary tasks coming in v0.2

Benchmarks

Tested Models

Live pricing data from OpenRouter API (updated automatically):

Provider Models Price Range ($/M tokens)
OpenAI gpt-4o-mini, gpt-3.5-turbo $0.75 - $2.00
Anthropic claude-3-haiku $1.50
DeepSeek deepseek-chat $0.90
Mistral 3 models $0.08 - $8.00
Meta llama-3.1-8b-instruct, llama-3.1-70b-instruct $0.04 - $0.38

Total: 9 models across 5 providers

Reproduce Our Benchmarks

# Test all 9 models on Banking77 dataset
python scripts/prepare_banking77.py
python examples/banking77_test.py

Expected Results: Most models achieve 85-92% accuracy on Banking77. Claude-3-haiku typically provides the best accuracy/cost ratio for classification tasks.

Performance Benchmarks

Fixed Dataset Results (50 job classification samples, tested 2025-08-10)

Metric Baseline (GPT-4o) Best Model (Claude-3-haiku) Savings
Accuracy ~95% (assumed) 92.0% Quality maintained
Cost/Million Tokens $12.50 $1.50 88.0% cheaper
Cost/Request* $0.001875 $0.000225 $0.00165 saved
Monthly (1K requests) $1.88 $0.23 $1.65 saved

Conservative estimate: 150 tokens/request (100 input + 50 output)

📊 Full Benchmark Report | 🔄 Reproduction Guide

Reproduction

# Install and setup
pip install llmuxer
export OPENROUTER_API_KEY="your-key"

# Run exact benchmark  
./scripts/bench.sh

# Generates: benchmarks/bench_YYYYMMDD.json + docs/benchmarks.md

Benchmark Notes:

  • Fixed dataset: data/jobs_50.jsonl (8 categories, 50 samples)
  • Pinned models: 8 specific models with exact API versions
  • Conservative estimates: 150 tokens/request assumption
  • No cherry-picking: Single test run results
  • Quality threshold: 85%+ accuracy required

API Reference

optimize_cost()

Find the cheapest model meeting your requirements for classification tasks.

Parameters:

  • baseline (str): Your current model (e.g., "gpt-4")
  • examples (list): Test examples with input and label
  • dataset (str): Path to JSONL file (alternative to examples)
  • task (str): Must be "classification" (other tasks coming soon)
  • options (list): Valid output classes for classification
  • min_accuracy (float): Minimum acceptable accuracy (0.0-1.0)
  • sample_size (float): Fraction of dataset to test (0.0-1.0)
  • prompt (str): Optional system prompt

Returns: Dictionary with model name, accuracy, cost, and savings.

Error Handling:

  • Returns {"error": "message"} if no model meets threshold
  • Retries on API failures
  • Validates dataset format

Full Example: Banking Intent Classification

import llmuxer

# Using the Banking77 dataset (77 intent categories)
result = llmuxer.optimize_cost(
    baseline="gpt-4",
    dataset="data/banking77.jsonl",  # Your prepared dataset
    task="classification",
    min_accuracy=0.8,
    sample_size=0.2  # Test on 20% first for speed
)

if "error" in result:
    print(f"No model found: {result['error']}")
else:
    print(f"Switch from {baseline} to {result['model']}")
    print(f"Save {result['cost_savings']:.0%} on costs")
    print(f"Accuracy: {result['accuracy']:.1%}")

Dataset Format

JSONL format with input and label fields:

{"input": "What's my account balance?", "label": "balance_inquiry"}
{"input": "I lost my card", "label": "card_lost"}

Performance Notes

Timing Estimates

For a dataset with 1,000 samples:

Model Type Time per 1k samples Token Speed
GPT-3.5-turbo ~45-60 seconds ~2,000 tokens/sec
Claude-3-haiku ~30-45 seconds ~2,500 tokens/sec
Gemini-1.5-flash ~20-30 seconds ~3,000 tokens/sec
Llama-3.1-8b ~15-25 seconds ~3,500 tokens/sec
Total for 18 models ~10-15 minutes Sequential

Speed Considerations

  • Sequential Processing: Currently tests one model at a time (parallel in v0.2)
  • Sample Size: Use sample_size=0.1 to test on 10% first for quick validation
  • Smart Stopping: Saves 30-50% time by skipping smaller models when larger ones fail
  • Rate Limits: Automatic handling with exponential backoff
  • Caching: Not yet implemented (coming in v0.2 will reduce re-evaluation time by 90%)

Links

License

MIT - see LICENSE file.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmuxer-0.1.0.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmuxer-0.1.0-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file llmuxer-0.1.0.tar.gz.

File metadata

  • Download URL: llmuxer-0.1.0.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llmuxer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bb21b7abfa44bc0e2bb6a72fb6e0f33270193e16a962d0b23b7d0eba915c491d
MD5 f0c571ac7a58f5750a87ec29ce956b77
BLAKE2b-256 f540d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11

See more details on using hashes here.

File details

Details for the file llmuxer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llmuxer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llmuxer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a102850831ceef123ba08d74208e1ed2e417fb9597b2bed8a921ab14eef1c15
MD5 82522facc3a3144fab398197dfb24238
BLAKE2b-256 02a9d135d59f39d286191fc781aae04ab21d678db4bdbcdb9740e27e2383382a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page