Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.

These details have not been verified by PyPI

Project links

Project description

rightsize-cli

The Biggest Model for Every Task? That's Just Lazy.

Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.

This is the production-grade CLI version of the RightSize web tool.

The Problem

You're probably using Claude Sonnet or GPT-5 for everything. That's expensive. Many tasks work just as well with smaller, cheaper models - you just don't know which ones.

The Solution

RightSize benchmarks your actual prompts against multiple models and shows you:

Accuracy - How well each model performs (judged by a strong LLM)
Latency - Response time (p95)
Cost - Projected cost per 1,000 runs
Savings - How much you save vs your baseline

Quick Start

# Install
uv pip install -e .

# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# Run a benchmark
rightsize benchmark data/test_cases.csv \
  --template prompts/classify.j2 \
  --model google/gemma-3-12b-it \
  --model mistralai/mistral-small-3.1-24b-instruct \
  --model deepseek/deepseek-chat-v3.1 \
  --judge google/gemini-2.5-flash \
  --baseline google/gemini-2.5-flash

Output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model                                   ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k  ┃ Savings  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it                   │    71.0% │        4200ms │  $0.0028 │   +93.7% │
│ mistralai/mistral-small-3.1-24b-instruct│    85.0% │        1200ms │  $0.0035 │   +92.2% │
│ deepseek/deepseek-chat-v3.1             │    95.0% │         800ms │  $0.0180 │   +60.0% │
│ google/gemini-2.5-flash                 │   100.0% │        1900ms │  $0.0450 │       —  │
└─────────────────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘

How It Works

You provide test cases - A CSV with inputs and expected outputs
Candidate models compete - All models run the same prompts in parallel
LLM-as-Judge scores - A judge model compares each output to your expected output
You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar

CSV Format

Two columns: input_data and expected_output:

input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high

The judge model compares each model's output to expected_output and scores:

1.0 - Exact or semantic match
0.8 - Very close with minor differences
0.5 - Partially correct
0.0 - Wrong or irrelevant

Best practices for test data

Use minimal output formats - Delimiter-separated (category::confidence) keeps responses short, costs low
Consistent task type - All rows should be the same kind of task
Representative samples - Use real data from your production use case
Clear expected outputs - Unambiguous so the judge can score fairly
10-20 test cases - Enough to be statistically meaningful, fast to run

Prompt Templates

Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.

Example: Classification template (prompts/classify.j2):

Classify this support ticket.

CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low

OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.

TICKET: {{ input_data }}

OUTPUT:

Example: Extraction template (prompts/extract.j2):

Extract the email from this text.

OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.

TEXT: {{ input_data }}

OUTPUT:

When to use templates

Use Case	Template Purpose
Classification	Define categories, enforce output format
Extraction	Specify what to extract, output format
Summarization	Set length constraints, style
Translation	Specify target language, formality
Q&A	Provide context, format requirements

Template variable

Variable	Description
`input_data`	The value from your CSV's `input_data` column

CLI Reference

`rightsize benchmark`

rightsize benchmark <csv_file> [OPTIONS]

Option	Short	Default	Description
`--template`	`-t`	(required)	Path to prompt template file
`--model`	`-m`	(required)	Model ID to test (repeat for multiple)
`--judge`	`-j`	(required)	Model for judging outputs
`--baseline`	`-b`	None	Baseline model for savings calculation
`--concurrency`	`-c`	10	Max parallel requests
`--output`	`-o`	`table`	Output format: table, json, csv

`rightsize models`

List all available models and their pricing:

rightsize models

Configuration

Set via environment variables or .env file:

Variable	Required	Default	Description
`RIGHTSIZE_OPENROUTER_API_KEY`	Yes	-	Your OpenRouter API key
`RIGHTSIZE_MAX_CONCURRENCY`	No	10	Default concurrency
`RIGHTSIZE_TIMEOUT_SECONDS`	No	60	Request timeout

Examples

Support ticket classification

rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m mistralai/mistral-small-3.1-24b-instruct \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash

Find the cheapest model that works

rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m google/gemma-3-27b-it \
  -m qwen/qwen3-8b \
  -m meta-llama/llama-3.3-70b-instruct \
  -j anthropic/claude-sonnet-4 \
  -b google/gemini-2.5-flash

Export results to JSON

rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m mistralai/mistral-small-3.1-24b-instruct \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash \
  -o json > results.json

Tips

Use minimal output formats - category::confidence is cheaper than JSON, JSON is cheaper than prose
End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
Start with 10-20 test cases - Enough to be representative, fast to iterate
Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
Consider latency - Sometimes a slower cheap model isn't worth it
Iterate on prompts - A better prompt can make cheaper models work better

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Jan 29, 2026

0.1.2

Jan 29, 2026

0.1.1

Jan 28, 2026

This version

0.1.0

Jan 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rightsize_cli-0.1.0.tar.gz (29.4 kB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rightsize_cli-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file rightsize_cli-0.1.0.tar.gz.

File metadata

Download URL: rightsize_cli-0.1.0.tar.gz
Upload date: Jan 28, 2026
Size: 29.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for rightsize_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6db239aac9a8fcb38f66e4ecf67844caa5c36a327fad8721b1e8714ee81af40b`
MD5	`2a142d968031602bcb78e6c32005c7e1`
BLAKE2b-256	`fe53cbb506649f9b3115436a7201b10e9880f497433644c2dab9a9c480f85666`

See more details on using hashes here.

File details

Details for the file rightsize_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: rightsize_cli-0.1.0-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for rightsize_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`08bbaab4965f2f001993ba03e2be330316fdb12ad16469b1b93eabced3ade57b`
MD5	`78e3ca744ed43f8dedbcedda56448cea`
BLAKE2b-256	`cb6086c63a7abdd82cdff1dfe68d870719c079fdbe461658be997fb21cab33de`

See more details on using hashes here.

rightsize-cli 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rightsize-cli

The Problem

The Solution

Quick Start

How It Works

CSV Format

Best practices for test data

Prompt Templates

When to use templates

Template variable

CLI Reference

rightsize benchmark

rightsize models

Configuration

Examples

Support ticket classification

Find the cheapest model that works

Export results to JSON

Tips

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`rightsize benchmark`

`rightsize models`