Skip to main content

Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.

Project description

rightsize-cli

The Biggest Model for Every Task? That's Just Lazy.

Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.

This is the production-grade CLI version of the RightSize web tool.

The Problem

You're probably using Claude Sonnet or GPT-5 for everything. That's expensive. Many tasks work just as well with smaller, cheaper models - you just don't know which ones.

The Solution

RightSize benchmarks your actual prompts against multiple models and shows you:

  • Accuracy - How well each model performs (judged by a strong LLM)
  • Latency - Response time (p95)
  • Cost - Projected cost per 1,000 runs
  • Savings - How much you save vs your baseline

Quick Start

# Install
uv pip install -e .

# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# Run a benchmark
rightsize benchmark data/test_cases.csv \
  --template prompts/classify.j2 \
  --model google/gemma-3-12b-it \
  --model mistralai/mistral-small-3.1-24b-instruct \
  --model deepseek/deepseek-chat-v3.1 \
  --judge google/gemini-2.5-flash \
  --baseline google/gemini-2.5-flash

Output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model                                   ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k  ┃ Savings  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it                   │    71.0% │        4200ms │  $0.0028 │   +93.7% │
│ mistralai/mistral-small-3.1-24b-instruct│    85.0% │        1200ms │  $0.0035 │   +92.2% │
│ deepseek/deepseek-chat-v3.1             │    95.0% │         800ms │  $0.0180 │   +60.0% │
│ google/gemini-2.5-flash                 │   100.0% │        1900ms │  $0.0450 │       —  │
└─────────────────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘

How It Works

  1. You provide test cases - A CSV with inputs and expected outputs
  2. Candidate models compete - All models run the same prompts in parallel
  3. LLM-as-Judge scores - A judge model compares each output to your expected output
  4. You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar

CSV Format

Two columns: input_data and expected_output:

input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high

The judge model compares each model's output to expected_output and scores:

  • 1.0 - Exact or semantic match
  • 0.8 - Very close with minor differences
  • 0.5 - Partially correct
  • 0.0 - Wrong or irrelevant

Best practices for test data

  1. Use minimal output formats - Delimiter-separated (category::confidence) keeps responses short, costs low
  2. Consistent task type - All rows should be the same kind of task
  3. Representative samples - Use real data from your production use case
  4. Clear expected outputs - Unambiguous so the judge can score fairly
  5. 10-20 test cases - Enough to be statistically meaningful, fast to run

Prompt Templates

Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.

Example: Classification template (prompts/classify.j2):

Classify this support ticket.

CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low

OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.

TICKET: {{ input_data }}

OUTPUT:

Example: Extraction template (prompts/extract.j2):

Extract the email from this text.

OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.

TEXT: {{ input_data }}

OUTPUT:

When to use templates

Use Case Template Purpose
Classification Define categories, enforce output format
Extraction Specify what to extract, output format
Summarization Set length constraints, style
Translation Specify target language, formality
Q&A Provide context, format requirements

Template variable

Variable Description
input_data The value from your CSV's input_data column

CLI Reference

rightsize benchmark

rightsize benchmark <csv_file> [OPTIONS]
Option Short Default Description
--template -t (required) Path to prompt template file
--model -m (required) Model ID to test (repeat for multiple)
--judge -j (required) Model for judging outputs
--baseline -b None Baseline model for savings calculation
--concurrency -c 10 Max parallel requests
--output -o table Output format: table, json, csv

rightsize models

List all available models and their pricing:

rightsize models

Configuration

Set via environment variables or .env file:

Variable Required Default Description
RIGHTSIZE_OPENROUTER_API_KEY Yes - Your OpenRouter API key
RIGHTSIZE_MAX_CONCURRENCY No 10 Default concurrency
RIGHTSIZE_TIMEOUT_SECONDS No 60 Request timeout

Examples

Support ticket classification

rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m mistralai/mistral-small-3.1-24b-instruct \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash

Find the cheapest model that works

rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m google/gemma-3-27b-it \
  -m qwen/qwen3-8b \
  -m meta-llama/llama-3.3-70b-instruct \
  -j anthropic/claude-sonnet-4 \
  -b google/gemini-2.5-flash

Export results to JSON

rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m mistralai/mistral-small-3.1-24b-instruct \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash \
  -o json > results.json

Tips

  1. Use minimal output formats - category::confidence is cheaper than JSON, JSON is cheaper than prose
  2. End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
  3. Start with 10-20 test cases - Enough to be representative, fast to iterate
  4. Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
  5. Consider latency - Sometimes a slower cheap model isn't worth it
  6. Iterate on prompts - A better prompt can make cheaper models work better

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rightsize_cli-0.1.0.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rightsize_cli-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file rightsize_cli-0.1.0.tar.gz.

File metadata

  • Download URL: rightsize_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for rightsize_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6db239aac9a8fcb38f66e4ecf67844caa5c36a327fad8721b1e8714ee81af40b
MD5 2a142d968031602bcb78e6c32005c7e1
BLAKE2b-256 fe53cbb506649f9b3115436a7201b10e9880f497433644c2dab9a9c480f85666

See more details on using hashes here.

File details

Details for the file rightsize_cli-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rightsize_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 08bbaab4965f2f001993ba03e2be330316fdb12ad16469b1b93eabced3ade57b
MD5 78e3ca744ed43f8dedbcedda56448cea
BLAKE2b-256 cb6086c63a7abdd82cdff1dfe68d870719c079fdbe461658be997fb21cab33de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page