Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.

These details have not been verified by PyPI

Project links

Project description

rightsize-cli

The Biggest Model for Every Task? That's Just Lazy.

Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.

This is the production-grade CLI version of the RightSize web tool.

Installation

# Using pip
pip install rightsize-cli

# Using uv
uv pip install rightsize-cli

Quick Start

# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# List available models
rightsize-cli models

# Run a benchmark
rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash

# Open interactive visualization in the browser
rightsize-cli benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  --visualize

Run without installing (uvx)

# Set API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# List models
uvx rightsize-cli models

# Run benchmark
uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash

# Run benchmark + open web visualization
uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  --visualize

Output

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model                       ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k  ┃ Savings  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it       │    71.0% │        4200ms │  $0.0028 │   +93.7% │
│ deepseek/deepseek-chat-v3.1 │    95.0% │         800ms │  $0.0180 │   +60.0% │
│ google/gemini-2.5-flash     │   100.0% │        1900ms │  $0.0450 │       —  │
└─────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘

How It Works

You provide test cases - A CSV with inputs and expected outputs
Candidate models compete - All models run the same prompts in parallel
LLM-as-Judge scores - A judge model compares each output to your expected output
You see the results - Cost, accuracy, latency - pick the cheapest model that meets your bar

CSV Format

Two columns: input_data and expected_output:

input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high

The judge model compares each model's output to expected_output and scores:

1.0 - Exact or semantic match
0.8 - Very close with minor differences
0.5 - Partially correct
0.0 - Wrong or irrelevant

Best practices for test data

Use minimal output formats - Delimiter-separated (category::confidence) keeps responses short, costs low
Consistent task type - All rows should be the same kind of task
Representative samples - Use real data from your production use case
Clear expected outputs - Unambiguous so the judge can score fairly
10-20 test cases - Enough to be statistically meaningful, fast to run

Prompt Templates

Templates wrap your inputs with instructions. Supports Jinja2 (.j2) or Python f-strings.

Example: Classification template (prompt.j2):

Classify this support ticket.

CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low

OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.

TICKET: {{ input_data }}

OUTPUT:

Example: Extraction template (extract.j2):

Extract the email from this text.

OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.

TEXT: {{ input_data }}

OUTPUT:

Template variable

Variable	Description
`input_data`	The value from your CSV's `input_data` column

CLI Reference

`rightsize-cli benchmark`

rightsize-cli benchmark <csv_file> [OPTIONS]

Option	Short	Default	Description
`--template`	`-t`	(required)	Path to prompt template file
`--model`	`-m`	(required)	Model ID to test (repeat for multiple)
`--judge`	`-j`	(required)	Model for judging outputs
`--baseline`	`-b`	None	Baseline model for savings calculation
`--concurrency`	`-c`	10	Max parallel requests
`--output`	`-o`	`table`	Output format: table, json, csv
`--verbose`	`-v`	False	Show detailed outputs and judge scores
`--visualize`	`-V`	False	Open interactive web visualization

`rightsize-cli models`

List all available models and their pricing:

rightsize-cli models

Configuration

Set via environment variables or .env file:

Variable	Required	Default	Description
`RIGHTSIZE_OPENROUTER_API_KEY`	Yes	-	Your OpenRouter API key
`RIGHTSIZE_MAX_CONCURRENCY`	No	10	Default concurrency
`RIGHTSIZE_TIMEOUT_SECONDS`	No	60	Request timeout

Examples

Compare cheap models against a baseline

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m google/gemma-3-27b-it \
  -m qwen/qwen3-8b \
  -m meta-llama/llama-3.3-70b-instruct \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash

Use a stronger judge model

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j anthropic/claude-sonnet-4 \
  -b google/gemini-2.5-flash

Export results to JSON

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  -o json > results.json

Debug with verbose mode

uvx rightsize-cli benchmark test_cases.csv \
  -t prompt.j2 \
  -m google/gemma-3-12b-it \
  -j google/gemini-3-flash-preview \
  -b google/gemini-2.5-flash \
  -v

Tips

Use minimal output formats - category::confidence is cheaper than JSON, JSON is cheaper than prose
End prompts with "OUTPUT:" - Primes the model to respond immediately without preamble
Start with 10-20 test cases - Enough to be representative, fast to iterate
Set a quality bar - Decide what accuracy % is acceptable (e.g., 95%+)
Consider latency - Sometimes a slower cheap model isn't worth it
Iterate on prompts - A better prompt can make cheaper models work better

Development

# Clone the repo
git clone https://github.com/NehmeAILabs/rightsize-cli.git
cd rightsize-cli

# Install in dev mode
uv pip install -e .

# Run locally
rightsize-cli models

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Jan 29, 2026

This version

0.1.2

Jan 29, 2026

0.1.1

Jan 28, 2026

0.1.0

Jan 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rightsize_cli-0.1.2.tar.gz (29.6 kB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rightsize_cli-0.1.2-py3-none-any.whl (13.4 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file rightsize_cli-0.1.2.tar.gz.

File metadata

Download URL: rightsize_cli-0.1.2.tar.gz
Upload date: Jan 29, 2026
Size: 29.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for rightsize_cli-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`bb68fcfde48bfa1264d5739467977b924216740ca3b280e573b27a017203fca5`
MD5	`d531c512893f30a366af465e453b6296`
BLAKE2b-256	`ce87abc6e330e8d2622a66036fa5d0b1db56976f65ba141974ee34969367b2c8`

See more details on using hashes here.

File details

Details for the file rightsize_cli-0.1.2-py3-none-any.whl.

File metadata

Download URL: rightsize_cli-0.1.2-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for rightsize_cli-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf24211c0b5c32adc7b2ddd2510ec1d0f17d14182bc62542bc22e11c14d9b32d`
MD5	`a06b96bf178eab57cb4b79d758bc92cb`
BLAKE2b-256	`48948d0e46ab5c94c5abb24cc69ed61e925d8dba609613a44aef06bb1a531767`

See more details on using hashes here.

rightsize-cli 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rightsize-cli

Installation

Quick Start

Run without installing (uvx)

Output

How It Works

CSV Format

Best practices for test data

Prompt Templates

Template variable

CLI Reference

rightsize-cli benchmark

rightsize-cli models

Configuration

Examples

Compare cheap models against a baseline

Use a stronger judge model

Export results to JSON

Debug with verbose mode

Tips

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`rightsize-cli benchmark`

`rightsize-cli models`