Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.
Project description
rightsize-cli
The Biggest Model for Every Task? That's Just Lazy.
Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.
This is the production-grade CLI version of the RightSize web tool.
The Problem
You're probably using Claude Sonnet or GPT-5 for everything. That's expensive. Many tasks work just as well with smaller, cheaper models - you just don't know which ones.
The Solution
RightSize benchmarks your actual prompts against multiple models and shows you:
- Accuracy - How well each model performs (judged by a strong LLM)
- Latency - Response time (p95)
- Cost - Projected cost per 1,000 runs
- Savings - How much you save vs your baseline
Quick Start
# Install
uv pip install -e .
# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."
# Run a benchmark
rightsize benchmark data/test_cases.csv \
--template prompts/classify.j2 \
--model google/gemma-3-12b-it \
--model mistralai/mistral-small-3.1-24b-instruct \
--model deepseek/deepseek-chat-v3.1 \
--judge google/gemini-2.5-flash \
--baseline google/gemini-2.5-flash
Output:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k ┃ Savings ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it │ 71.0% │ 4200ms │ $0.0028 │ +93.7% │
│ mistralai/mistral-small-3.1-24b-instruct│ 85.0% │ 1200ms │ $0.0035 │ +92.2% │
│ deepseek/deepseek-chat-v3.1 │ 95.0% │ 800ms │ $0.0180 │ +60.0% │
│ google/gemini-2.5-flash │ 100.0% │ 1900ms │ $0.0450 │ — │
└─────────────────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘
How It Works
- You provide test cases - A CSV with inputs and expected outputs
- Candidate models compete - All models run the same prompts in parallel
- LLM-as-Judge scores - A judge model compares each output to your expected output
- You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar
CSV Format
Two columns: input_data and expected_output:
input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high
The judge model compares each model's output to expected_output and scores:
- 1.0 - Exact or semantic match
- 0.8 - Very close with minor differences
- 0.5 - Partially correct
- 0.0 - Wrong or irrelevant
Best practices for test data
- Use minimal output formats - Delimiter-separated (
category::confidence) keeps responses short, costs low - Consistent task type - All rows should be the same kind of task
- Representative samples - Use real data from your production use case
- Clear expected outputs - Unambiguous so the judge can score fairly
- 10-20 test cases - Enough to be statistically meaningful, fast to run
Prompt Templates
Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.
Example: Classification template (prompts/classify.j2):
Classify this support ticket.
CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low
OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.
TICKET: {{ input_data }}
OUTPUT:
Example: Extraction template (prompts/extract.j2):
Extract the email from this text.
OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.
TEXT: {{ input_data }}
OUTPUT:
When to use templates
| Use Case | Template Purpose |
|---|---|
| Classification | Define categories, enforce output format |
| Extraction | Specify what to extract, output format |
| Summarization | Set length constraints, style |
| Translation | Specify target language, formality |
| Q&A | Provide context, format requirements |
Template variable
| Variable | Description |
|---|---|
input_data |
The value from your CSV's input_data column |
CLI Reference
rightsize benchmark
rightsize benchmark <csv_file> [OPTIONS]
| Option | Short | Default | Description |
|---|---|---|---|
--template |
-t |
(required) | Path to prompt template file |
--model |
-m |
(required) | Model ID to test (repeat for multiple) |
--judge |
-j |
(required) | Model for judging outputs |
--baseline |
-b |
None | Baseline model for savings calculation |
--concurrency |
-c |
10 | Max parallel requests |
--output |
-o |
table |
Output format: table, json, csv |
rightsize models
List all available models and their pricing:
rightsize models
Configuration
Set via environment variables or .env file:
| Variable | Required | Default | Description |
|---|---|---|---|
RIGHTSIZE_OPENROUTER_API_KEY |
Yes | - | Your OpenRouter API key |
RIGHTSIZE_MAX_CONCURRENCY |
No | 10 | Default concurrency |
RIGHTSIZE_TIMEOUT_SECONDS |
No | 60 | Request timeout |
Examples
Support ticket classification
rightsize benchmark data/test_cases.csv \
-t prompts/classify.j2 \
-m google/gemma-3-12b-it \
-m mistralai/mistral-small-3.1-24b-instruct \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-2.5-flash \
-b google/gemini-2.5-flash
Find the cheapest model that works
rightsize benchmark data/test_cases.csv \
-t prompts/classify.j2 \
-m google/gemma-3-12b-it \
-m google/gemma-3-27b-it \
-m qwen/qwen3-8b \
-m meta-llama/llama-3.3-70b-instruct \
-j anthropic/claude-sonnet-4 \
-b google/gemini-2.5-flash
Export results to JSON
rightsize benchmark data/test_cases.csv \
-t prompts/classify.j2 \
-m mistralai/mistral-small-3.1-24b-instruct \
-m deepseek/deepseek-chat-v3.1 \
-j google/gemini-2.5-flash \
-b google/gemini-2.5-flash \
-o json > results.json
Tips
- Use minimal output formats -
category::confidenceis cheaper than JSON, JSON is cheaper than prose - End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
- Start with 10-20 test cases - Enough to be representative, fast to iterate
- Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
- Consider latency - Sometimes a slower cheap model isn't worth it
- Iterate on prompts - A better prompt can make cheaper models work better
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rightsize_cli-0.1.0.tar.gz.
File metadata
- Download URL: rightsize_cli-0.1.0.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6db239aac9a8fcb38f66e4ecf67844caa5c36a327fad8721b1e8714ee81af40b
|
|
| MD5 |
2a142d968031602bcb78e6c32005c7e1
|
|
| BLAKE2b-256 |
fe53cbb506649f9b3115436a7201b10e9880f497433644c2dab9a9c480f85666
|
File details
Details for the file rightsize_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rightsize_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08bbaab4965f2f001993ba03e2be330316fdb12ad16469b1b93eabced3ade57b
|
|
| MD5 |
78e3ca744ed43f8dedbcedda56448cea
|
|
| BLAKE2b-256 |
cb6086c63a7abdd82cdff1dfe68d870719c079fdbe461658be997fb21cab33de
|