Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.
Project description
rightsize-cli
The Biggest Model for Every Task? That's Just Lazy.
Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.
This is the production-grade CLI version of the RightSize web tool.
Installation
# Using pip
pip install rightsize-cli
# Using uv
uv pip install rightsize-cli
Quick Start
# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."
# List available models
rightsize-cli models
# Run a benchmark
rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash
# Open interactive visualization in the browser
rightsize-cli benchmark data/test_cases.csv \
-t prompts/classify.j2 \
-m google/gemma-3-12b-it \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash \
--visualize
Run without installing (uvx)
# Set API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."
# List models
uvx rightsize-cli models
# Run benchmark
uvx rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash
# Run benchmark + open web visualization
uvx rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash \
--visualize
Output
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k ┃ Savings ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it │ 71.0% │ 4200ms │ $0.0028 │ +93.7% │
│ deepseek/deepseek-chat-v3.1 │ 95.0% │ 800ms │ $0.0180 │ +60.0% │
│ google/gemini-2.5-flash │ 100.0% │ 1900ms │ $0.0450 │ — │
└─────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘
How It Works
- You provide test cases - A CSV with inputs and expected outputs
- Candidate models compete - All models run the same prompts in parallel
- LLM-as-Judge scores - A judge model compares each output to your expected output
- You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar
CSV Format
Two columns: input_data and expected_output:
input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high
The judge model compares each model's output to expected_output and scores:
- 1.0 - Exact or semantic match
- 0.8 - Very close with minor differences
- 0.5 - Partially correct
- 0.0 - Wrong or irrelevant
Best practices for test data
- Use minimal output formats - Delimiter-separated (
category::confidence) keeps responses short, costs low - Consistent task type - All rows should be the same kind of task
- Representative samples - Use real data from your production use case
- Clear expected outputs - Unambiguous so the judge can score fairly
- 10-20 test cases - Enough to be statistically meaningful, fast to run
Prompt Templates
Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.
Example: Classification template (prompt.j2):
Classify this support ticket.
CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low
OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.
TICKET: {{ input_data }}
OUTPUT:
Example: Extraction template (extract.j2):
Extract the email from this text.
OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.
TEXT: {{ input_data }}
OUTPUT:
Template variable
| Variable | Description |
|---|---|
input_data |
The value from your CSV's input_data column |
CLI Reference
rightsize-cli benchmark
rightsize-cli benchmark <csv_file> [OPTIONS]
| Option | Short | Default | Description |
|---|---|---|---|
--template |
-t |
(required) | Path to prompt template file |
--model |
-m |
(required) | Model ID to test (repeat for multiple) |
--judge |
-j |
(required) | Model for judging outputs |
--baseline |
-b |
None | Baseline model for savings calculation |
--concurrency |
-c |
10 | Max parallel requests |
--output |
-o |
table |
Output format: table, json, csv |
--verbose |
-v |
False | Show detailed outputs and judge scores |
--visualize |
-V |
False | Open interactive web visualization |
rightsize-cli models
List all available models and their pricing:
rightsize-cli models
Configuration
Set via environment variables or .env file:
| Variable | Required | Default | Description |
|---|---|---|---|
RIGHTSIZE_OPENROUTER_API_KEY |
Yes | - | Your OpenRouter API key |
RIGHTSIZE_MAX_CONCURRENCY |
No | 10 | Default concurrency |
RIGHTSIZE_TIMEOUT_SECONDS |
No | 60 | Request timeout |
Examples
Compare cheap models against a baseline
uvx rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-m google/gemma-3-27b-it \
-m qwen/qwen3-8b \
-m meta-llama/llama-3.3-70b-instruct \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash
Use a stronger judge model
uvx rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-m deepseek/deepseek-chat-v3.1 \
-j anthropic/claude-sonnet-4 \
-b google/gemini-2.5-flash
Export results to JSON
uvx rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash \
-o json > results.json
Debug with verbose mode
uvx rightsize-cli benchmark test_cases.csv \
-t prompt.j2 \
-m google/gemma-3-12b-it \
-j google/gemini-3-flash-preview \
-b google/gemini-2.5-flash \
-v
Tips
- Use minimal output formats -
category::confidenceis cheaper than JSON, JSON is cheaper than prose - End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
- Start with 10-20 test cases - Enough to be representative, fast to iterate
- Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
- Consider latency - Sometimes a slower cheap model isn't worth it
- Iterate on prompts - A better prompt can make cheaper models work better
Development
# Clone the repo
git clone https://github.com/NehmeAILabs/rightsize-cli.git
cd rightsize-cli
# Install in dev mode
uv pip install -e .
# Run locally
rightsize-cli models
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rightsize_cli-0.1.2.tar.gz.
File metadata
- Download URL: rightsize_cli-0.1.2.tar.gz
- Upload date:
- Size: 29.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb68fcfde48bfa1264d5739467977b924216740ca3b280e573b27a017203fca5
|
|
| MD5 |
d531c512893f30a366af465e453b6296
|
|
| BLAKE2b-256 |
ce87abc6e330e8d2622a66036fa5d0b1db56976f65ba141974ee34969367b2c8
|
File details
Details for the file rightsize_cli-0.1.2-py3-none-any.whl.
File metadata
- Download URL: rightsize_cli-0.1.2-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf24211c0b5c32adc7b2ddd2510ec1d0f17d14182bc62542bc22e11c14d9b32d
|
|
| MD5 |
a06b96bf178eab57cb4b79d758bc92cb
|
|
| BLAKE2b-256 |
48948d0e46ab5c94c5abb24cc69ed61e925d8dba609613a44aef06bb1a531767
|