Skip to main content

Run a prompt across multiple LLMs and compare outputs side by side in the terminal.

Project description

assayer

Send a prompt to multiple language models in parallel and compare their outputs in the terminal. Useful for evaluating which model handles a given task better, measuring semantic similarity between responses, or running an LLM-as-judge evaluation — without leaving the shell.

Installation

pip install assayer

Similarity scoring requires the optional score extra:

pip install "assayer[score]"

Python 3.11 or newer is required.

Contributing? See CONTRIBUTING.md for setup, code style, and PR guidelines.

Supported Providers

  • OpenAI: All GPT models.
  • Anthropic: Claude 4.5 models (Opus, Sonnet, Haiku).
  • Google Gemini: 1.5 Pro and Flash models.
  • Ollama: Local models running on your machine.

Configuration

Assayer looks for API keys in environment variables or a configuration file at ~/.assayer/config.json.

Environment Variables

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GEMINI_API_KEY="your-key"

Configuration File

{
  "OPENAI_API_KEY": "sk-...",
  "ANTHROPIC_API_KEY": "sk-ant-...",
  "GEMINI_API_KEY": "..."
}

Use assayer models check to verify your configuration.

Quickstart

assayer run "Explain recursion in one sentence." --models gpt-4o,claude-haiku-4-5-20251001

Commands

run

assayer run "prompt" --models gpt-4o,claude-sonnet-4-5
assayer run --prompt-file prompt.txt --models gpt-4o,ollama/llama3
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --score
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --judge gpt-4o --judge-criteria "clarity,brevity"
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --output results.json
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --output results.csv
assayer run "prompt with {var}" --models gpt-4o --var key=value
Flag Description
--models Comma-separated model identifiers (required)
--prompt-file Path to a .txt file instead of an inline prompt
--var KEY=VALUE template variable, repeatable
--system System prompt applied to all models
--temperature Sampling temperature
--max-tokens Maximum output tokens
--score Show pairwise similarity matrix
--judge Model to use as judge
--judge-criteria Comma-separated criteria for the judge
--output Save results to .json or .csv

models

assayer models list               # list all supported model identifiers
assayer models check              # check which API keys are configured
assayer models check ollama       # check if Ollama is running and list local models

config

assayer config set OPENAI_API_KEY sk-...
assayer config show

Keys are saved to ~/.assayer/config.json. Environment variables take precedence.

Providers

OpenAI

export OPENAI_API_KEY=sk-...

Supported models: gpt-5.5, gpt-5.5-pro, gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano, gpt-5.2, gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini, o3, o3-mini, o4-mini

Anthropic

export ANTHROPIC_API_KEY=sk-ant-...

Supported models: claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5-20251001, claude-opus-4-6, claude-sonnet-4-5, claude-opus-4-5

Google Gemini

export GEMINI_API_KEY=...

Supported models: gemini-3.1-pro-preview, gemini-3.1-flash-lite, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash, gemini-2.0-flash-lite

Ollama (local)

No API key needed. Start Ollama and use the ollama/ prefix:

ollama serve
assayer run "prompt" --models ollama/llama4-scout,ollama/llama3.2,ollama/qwen3

Scoring

--score embeds all outputs using all-MiniLM-L6-v2 (runs locally, no API call) and displays a pairwise cosine similarity matrix. Values range from 0 (unrelated) to 1 (identical meaning).

LLM-as-judge

--judge <model> sends all outputs to the specified model and asks it to pick a winner. Use --judge-criteria to focus the evaluation:

assayer run "Write a sorting algorithm." \
  --models gpt-4o,claude-sonnet-4-5 \
  --judge gpt-4o \
  --judge-criteria "correctness,readability"

If the judge call fails, a warning is printed to stderr and the run continues normally.

Export

--output results.json saves full results as JSON. --output results.csv saves as CSV. The file format is determined by the extension.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assayer-0.1.1.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assayer-0.1.1-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file assayer-0.1.1.tar.gz.

File metadata

  • Download URL: assayer-0.1.1.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for assayer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6a3b76a39a0a0a7be8e16efce589b1259b2786c2fd1d9b3540f16df24f24da13
MD5 0db8e836201bff370a6bae4d3a839f7e
BLAKE2b-256 7a3e85f7dc91a8fb27bdbce1bf4ccd0f95d9486b99ea6fa28050219ea5f31db3

See more details on using hashes here.

File details

Details for the file assayer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: assayer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for assayer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 464c8a619304b2707624932cbdb996df74c297eac8090d360be9b8c23e326c36
MD5 7e5a35ecafde5bb5148e721298f060bc
BLAKE2b-256 d3fa33aa0c99a108c73a555053a045e0e6729257ba57856e22a9253db96d6162

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page