Skip to main content

Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.

Project description

rightsize-cli

The Biggest Model for Every Task? That's Just Lazy.

Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.

This is the production-grade CLI version of the RightSize web tool.

Installation

# Using pip
pip install rightsize-cli

# Using uv
uv pip install rightsize-cli

Quick Start

# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# List available models
rightsize-cli models

# Run a benchmark
rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash

Run without installing (uvx)

# Set API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# List models
uvx rightsize-cli models

# Run benchmark
uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash

Output

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model                       ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k  ┃ Savings  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it       │    71.0% │        4200ms │  $0.0028 │   +93.7% │
│ deepseek/deepseek-chat-v3.1 │    95.0% │         800ms │  $0.0180 │   +60.0% │
│ google/gemini-2.5-flash     │   100.0% │        1900ms │  $0.0450 │       —  │
└─────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘

How It Works

  1. You provide test cases - A CSV with inputs and expected outputs
  2. Candidate models compete - All models run the same prompts in parallel
  3. LLM-as-Judge scores - A judge model compares each output to your expected output
  4. You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar

CSV Format

Two columns: input_data and expected_output:

input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high

The judge model compares each model's output to expected_output and scores:

  • 1.0 - Exact or semantic match
  • 0.8 - Very close with minor differences
  • 0.5 - Partially correct
  • 0.0 - Wrong or irrelevant

Best practices for test data

  1. Use minimal output formats - Delimiter-separated (category::confidence) keeps responses short, costs low
  2. Consistent task type - All rows should be the same kind of task
  3. Representative samples - Use real data from your production use case
  4. Clear expected outputs - Unambiguous so the judge can score fairly
  5. 10-20 test cases - Enough to be statistically meaningful, fast to run

Prompt Templates

Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.

Example: Classification template (prompt.j2):

Classify this support ticket.

CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low

OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.

TICKET: {{ input_data }}

OUTPUT:

Example: Extraction template (extract.j2):

Extract the email from this text.

OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.

TEXT: {{ input_data }}

OUTPUT:

Template variable

Variable Description
input_data The value from your CSV's input_data column

CLI Reference

rightsize-cli benchmark

rightsize-cli benchmark <csv_file> [OPTIONS]
Option Short Default Description
--template -t (required) Path to prompt template file
--model -m (required) Model ID to test (repeat for multiple)
--judge -j (required) Model for judging outputs
--baseline -b None Baseline model for savings calculation
--concurrency -c 10 Max parallel requests
--output -o table Output format: table, json, csv
--verbose -v False Show detailed outputs and judge scores

rightsize-cli models

List all available models and their pricing:

rightsize-cli models

Configuration

Set via environment variables or .env file:

Variable Required Default Description
RIGHTSIZE_OPENROUTER_API_KEY Yes - Your OpenRouter API key
RIGHTSIZE_MAX_CONCURRENCY No 10 Default concurrency
RIGHTSIZE_TIMEOUT_SECONDS No 60 Request timeout

Examples

Compare cheap models against a baseline

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m google/gemma-3-27b-it \
  -m qwen/qwen3-8b \
  -m meta-llama/llama-3.3-70b-instruct \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash

Use a stronger judge model

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j anthropic/claude-sonnet-4 \
  -b google/gemini-2.5-flash

Export results to JSON

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash \
  -o json > results.json

Debug with verbose mode

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash \
  -v

Tips

  1. Use minimal output formats - category::confidence is cheaper than JSON, JSON is cheaper than prose
  2. End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
  3. Start with 10-20 test cases - Enough to be representative, fast to iterate
  4. Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
  5. Consider latency - Sometimes a slower cheap model isn't worth it
  6. Iterate on prompts - A better prompt can make cheaper models work better

Development

# Clone the repo
git clone https://github.com/NehmeAILabs/rightsize-cli.git
cd rightsize-cli

# Install in dev mode
uv pip install -e .

# Run locally
rightsize-cli models

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rightsize_cli-0.1.1.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rightsize_cli-0.1.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file rightsize_cli-0.1.1.tar.gz.

File metadata

  • Download URL: rightsize_cli-0.1.1.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for rightsize_cli-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e35807918bbb0a7342126fdca1a273690a1c6ffa309777d4ce904e5b1985baa4
MD5 8f130d1d9f775140a176a08ce648acc4
BLAKE2b-256 9f538d56117be07d8ad498cdb5dffa764dc267c5db4440abc697de3a8a700b9d

See more details on using hashes here.

File details

Details for the file rightsize_cli-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for rightsize_cli-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b27b2b17c37631c7b885871e5c6d27c924a409a2d42080511c637bd130719b07
MD5 293f0e57686a4bac580d5db7548a4f71
BLAKE2b-256 a67dddb34a4b10c95a7d14700ea5a917fc916f91498d39deb35b6dd5658a2340

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page