Test prompt variants across LLM providers with LLM-as-judge evaluation

These details have not been verified by PyPI

Project description

prompt-lab

Test prompt variants across LLM providers with LLM-as-judge evaluation.

Installation

pip install llm-prompt-lab

Or with pipx for isolated installs:

pipx install llm-prompt-lab

API Keys

Set your provider API keys as environment variables:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

Alternatively, create a .env file in your working directory — prompt-lab loads it automatically:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Only the keys for providers you use are required.

Development

uv sync

Quick Start

# Create a new experiment
prompt-lab new
prompt-lab new --config spec.yaml

# Run an experiment
prompt-lab run experiments/my-experiment

# Run a single variant
prompt-lab run experiments/my-experiment/v1

# View results
prompt-lab results experiments/my-experiment/v1

# Compare variants
prompt-lab compare experiments/my-experiment

How It Works

system.md (optional) + prompt.md + inputs.yaml → LLM → response → judge.md → score

prompt.md is the user message sent to the LLM (the prompt you want to evaluate)
system.md (optional) is the system message — persona, instructions, constraints
inputs.yaml provides test cases with variables for both files
The messages are sent to each configured model (LLM)
judge.md evaluates each response and assigns a score

Create multiple variants (v1, v2, etc.) to compare different prompt approaches.

Experiment Structure

my-experiment/
├── experiment.md       # Config: models, runs (required)
├── judge.md            # Evaluator: scoring criteria (required)
├── inputs.yaml         # Shared test cases (optional, used by all variants)
├── v1/                 # Variant (at least one required)
│   ├── prompt.md       # User message (required)
│   ├── system.md       # System message (optional)
│   └── tools.yaml      # Tool definitions (optional)
└── v2/                 # Another variant to compare...
    ├── prompt.md        # Different prompt approach
    └── system.md        # Different system instructions

Both judge.md and inputs.yaml support fallback: if not found in the variant folder, the experiment-level file is used. This allows sharing test cases across variants for fair A/B comparison.

File Formats

experiment.md

Defines the experiment name, description, models, and default number of runs per input.

---
name: my-experiment
description: Testing different prompt styles
hypothesis: Concise prompts will score higher than verbose ones
models:
  - openai:gpt-4o-mini
  - anthropic:claude-sonnet-4-20250514
runs: 5
---

Optional markdown content describing the experiment.

Experiment options:

Option	Default	Description
`name`	folder name	Experiment identifier
`description`	`""`	Brief description
`models`	required	List of models to test
`runs`	`5`	Runs per input (for statistical analysis)
`hypothesis`	`""`	What you're testing (displayed in results)

prompt.md (user message)

The user message sent to the LLM. Use {{ variables }} to inject test data from inputs.yaml.

Generate 5 creative product names.

Product description: {{ description }}
Seed words: {{ seeds }}

Product names:

Each variant folder contains a different prompt.md to compare approaches (e.g., zero-shot vs few-shot, formal vs casual tone, etc.).

For hardcoded prompts without variables, just write the prompt directly:

Tell me a joke about programming.

system.md (optional system message)

If present, becomes the system message for the LLM call. Uses the same {{ variables }} from inputs.yaml.

You are a helpful assistant. You can check the weather using the get_weather tool when users ask about weather conditions.

Only use the weather tool when the user is actually asking about weather. For other questions, just respond normally without using any tools.

When system.md is absent, the prompt is sent as the user message with no system message.

When to use system.md:

Setting a persona or role for the LLM
Providing instructions that frame behavior (e.g., tool usage rules)
Separating "what the model is" from "what the user asks"

inputs.yaml (optional)

Test cases with variables matching the prompt and system templates. If omitted, runs once with empty data (useful for static prompts without variables).

- id: alice
  name: Alice

- id: bob
  name: Bob
  runs: 10  # Override experiment's runs for this input

Each input case can have any number of fields. All fields (except id and runs) are available as {{ variables }} in both prompt.md and system.md.

Location: Can be placed at experiment level (shared across all variants) or in a variant folder (variant-specific). Variant-level inputs take precedence over experiment-level.

Input options:

Field	Default	Description
`id`	`input-N`	Unique identifier for results
`runs`	experiment runs	Override runs for this specific input
(other)	-	Variables available in prompt and system templates

tools.yaml (optional)

Define tools (functions) that the LLM can call during execution. Useful for testing prompts that involve function calling.

- name: get_weather
  description: Get current weather for a location
  parameters:
    type: object
    properties:
      location:
        type: string
        description: City name
      unit:
        type: string
        enum: [celsius, fahrenheit]
    required:
      - location

- name: search
  description: Search the web
  parameters:
    type: object
    properties:
      query:
        type: string

Tool fields:

Field	Required	Description
`name`	yes	Tool/function name
`description`	no	What the tool does
`parameters`	no	JSON Schema for tool parameters

Tool calls made by the model are captured in the response and available for judge evaluation.

judge.md (evaluator)

Defines how to score each LLM response. The judge is another LLM that evaluates quality based on your criteria.

---
model: openai:gpt-4o-mini
score_range: [1, 10]
temperature: 0
---

You are evaluating a greeting response.

## Rubric
- **10**: Uses user's name, warm tone, offers to help
- **8-9**: Uses name and friendly, but generic
- **6-7**: Friendly but doesn't use name
- **4-5**: Cold or overly formal
- **1-3**: Inappropriate or ignores user

**Prompt:** {{ prompt }}
**Model Response:** {{ response }}

Judge options:

Option	Default	Description
`model`	`openai:gpt-4o`	Model to use for judging (single judge)
`models`	-	List of models for multi-judge (opt-in, see below)
`aggregation`	`mean`	Score aggregation: `mean` or `median` (multi-judge only)
`score_range`	`[1, 10]`	Min and max score
`temperature`	`0`	0 = deterministic, higher = more varied
`chain_of_thought`	`true`	Step-by-step reasoning before scoring (disable with `false`)

Multi-Judge Evaluation (Opt-in)

Use multiple LLM models as judges to reduce self-enhancement bias (when a model scores itself favorably). Scores are aggregated using mean or median.

---
models:
  - openai:gpt-4o-mini
  - anthropic:claude-sonnet-4-20250514
aggregation: mean
score_range: [1, 10]
---

## Rubric
...

When to use multi-judge:

Testing responses from GPT models? Add Claude as a judge (and vice versa)
Need more reliable scores? Multiple perspectives reduce bias
High-stakes evaluations where accuracy matters

Trade-offs:

Requires API keys for multiple providers
2x API costs for judging
Slightly slower execution

Note: Use model: (singular) for single judge, models: (plural) for multi-judge.

Chain-of-Thought Evaluation

By default, the judge analyzes responses step-by-step before scoring. This improves alignment with human judgment by reducing anchoring bias.

To disable Chain-of-Thought (for faster/cheaper evaluations):

---
model: openai:gpt-4o-mini
score_range: [1, 10]
chain_of_thought: false
---

## Rubric
...

When enabled, the judge will:

Review each rubric criterion
Analyze how the response meets each criterion
Identify strengths and weaknesses
Only then provide the final score

Multiple Runs & Statistics

For more reliable evaluation, run each input multiple times and get statistical analysis:

# experiment.md
---
name: my-experiment
models:
  - openai:gpt-4o-mini
runs: 5
---

Results show hypothesis and mean with 95% confidence interval:

Hypothesis: Concise prompts will score higher than verbose ones

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Input ┃ Model              ┃ Mean ┃ 95% CI       ┃ Range ┃ Scores        ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ alice │ openai:gpt-4o-mini │ 9.2  │ (8.5-9.9)    │ 8-10  │ 9, 10, 9, 9, 9│
│ bob   │ openai:gpt-4o-mini │ 8.4  │ (7.8-9.0)    │ 8-9   │ 8, 9, 8, 8, 9 │
└───────┴────────────────────┴──────┴──────────────┴───────┴───────────────┘

⚠ Low sample size (3 runs). Consider runs: 5+ for reliable statistics.

When runs > 1:

Cache is disabled to get independent LLM responses
Each input is evaluated N times
95% confidence intervals show the reliability of your results
Warning shown when sample size is too small for reliable statistics

Statistical Significance

When comparing variants, the compare command shows whether differences are statistically significant:

prompt-lab compare experiments/my-experiment

Hypothesis: Concise prompts will score higher than verbose ones

┏━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━┓
┃ Variant ┃ Mean Score ┃ 95% CI       ┃ Avg Latency ┃ Runs ┃
┡━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━┩
│ v1      │ 8.5/10     │ (8.1-8.9)    │ 450ms       │ 2×5  │
│ v2      │ 7.2/10     │ (6.8-7.6)    │ 420ms       │ 2×5  │
└─────────┴────────────┴──────────────┴─────────────┴──────┘

Statistical Significance (Welch's t-test, α=0.05):

  ✓ v1 > v2 (p≤0.01)

This helps you know if v1 is actually better than v2, or if the difference is just noise.

Templating

Variables from inputs.yaml are available in both prompt.md and system.md:

Hello {{ name }}, you are {{ age }} years old.

Literal braces don't need escaping:

Return JSON: {"result": "value"}

CLI Commands

new

Create a new experiment. Runs an interactive wizard or reads from a config file.

# Interactive wizard
prompt-lab new

# From config file
prompt-lab new --config spec.yaml

Options:

Option	Short	Description
`--config`	`-c`	Path to experiment spec YAML

run

Run a prompt experiment. Auto-detects scope from path.

# Run all variants
prompt-lab run experiments/my-experiment

# Run single variant
prompt-lab run experiments/my-experiment/v1

# Run specific model only
prompt-lab run experiments/my-experiment/v1 --model openai:gpt-4o-mini

# Skip cache (fresh API calls)
prompt-lab run experiments/my-experiment/v1 --no-cache

# Hide progress bar
prompt-lab run experiments/my-experiment -q

Options:

Option	Short	Description
`--model`	`-m`	Run only this model
`--no-cache`		Disable response caching
`--quiet`	`-q`	Hide progress bar

results

Show results table for a variant.

prompt-lab results experiments/my-experiment/v1

# Show specific run
prompt-lab results experiments/my-experiment/v1 --run 2026-01-25T19-30-00

compare

Compare results across all variants.

prompt-lab compare experiments/my-experiment

show

Show detailed responses with judge reasoning.

# Show all responses
prompt-lab show experiments/my-experiment/v1

# Filter by input
prompt-lab show experiments/my-experiment/v1 --input alice

# Filter by model
prompt-lab show experiments/my-experiment/v1 --model openai:gpt-4o-mini

# Combine filters
prompt-lab show experiments/my-experiment/v1 --input alice --model openai:gpt-4o-mini

clean

Clean experiment results. Auto-detects scope from path.

# Clean single variant results
prompt-lab clean experiments/my-experiment/v1

# Clean all variants (auto-detected from experiment path)
prompt-lab clean experiments/my-experiment

# Skip confirmation
prompt-lab clean experiments/my-experiment --yes

Options:

Option	Short	Description
`--yes`	`-y`	Skip confirmation prompt

cache

Manage response cache.

# Clear all cached responses
prompt-lab cache clear

Supported Providers

Provider	Model format
OpenAI	`openai:gpt-4o`, `openai:gpt-4o-mini`
Anthropic	`anthropic:claude-sonnet-4-20250514`

References

LLM-as-judge evaluation methodology and best practices:

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.2

Mar 23, 2026

0.3.1

Mar 17, 2026

0.3.0

Feb 13, 2026

This version

0.2.0

Feb 11, 2026

0.1.0

Feb 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_prompt_lab-0.2.0.tar.gz (37.1 kB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_prompt_lab-0.2.0-py3-none-any.whl (37.6 kB view details)

Uploaded Feb 11, 2026 Python 3

File details

Details for the file llm_prompt_lab-0.2.0.tar.gz.

File metadata

Download URL: llm_prompt_lab-0.2.0.tar.gz
Upload date: Feb 11, 2026
Size: 37.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_prompt_lab-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`dfd82e2c04450ad6e1bd06d14d7eb6b8483f878fc55f1bc4f92bb50fbd91c895`
MD5	`2d227986be026ba5db1b08ab29e0113e`
BLAKE2b-256	`68e3cae0cb562e948d44083671374c74b008c643c1ab4d3ec5ed8ac4cd59cabb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_prompt_lab-0.2.0.tar.gz:

Publisher: release.yml on othercodes/prompt-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_prompt_lab-0.2.0.tar.gz
- Subject digest: dfd82e2c04450ad6e1bd06d14d7eb6b8483f878fc55f1bc4f92bb50fbd91c895
- Sigstore transparency entry: 940074262
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: othercodes/prompt-lab@ca5491116a5f536251943fc4fedb83c01c8559fa
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/othercodes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ca5491116a5f536251943fc4fedb83c01c8559fa
- Trigger Event: push

File details

Details for the file llm_prompt_lab-0.2.0-py3-none-any.whl.

File metadata

Download URL: llm_prompt_lab-0.2.0-py3-none-any.whl
Upload date: Feb 11, 2026
Size: 37.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_prompt_lab-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e80abe8cb32721e679d925f68f78055fef73b83da2e38e4b7cd25a248e18d07`
MD5	`74f897009008dfb5387c7d7d64fe93d5`
BLAKE2b-256	`8e2ac95e9f8e26f9e78522ee59858ef0529c2a0b118dc31375ba3b5eff2856c5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_prompt_lab-0.2.0-py3-none-any.whl:

Publisher: release.yml on othercodes/prompt-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_prompt_lab-0.2.0-py3-none-any.whl
- Subject digest: 3e80abe8cb32721e679d925f68f78055fef73b83da2e38e4b7cd25a248e18d07
- Sigstore transparency entry: 940074265
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: othercodes/prompt-lab@ca5491116a5f536251943fc4fedb83c01c8559fa
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/othercodes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ca5491116a5f536251943fc4fedb83c01c8559fa
- Trigger Event: push

llm-prompt-lab 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

prompt-lab

Installation

API Keys

Development

Quick Start

How It Works

Experiment Structure

File Formats

experiment.md

prompt.md (user message)

system.md (optional system message)

inputs.yaml (optional)

tools.yaml (optional)

judge.md (evaluator)

Multi-Judge Evaluation (Opt-in)

Chain-of-Thought Evaluation

Multiple Runs & Statistics

Statistical Significance

Templating

CLI Commands

new

run

results

compare

show

clean

cache

Supported Providers

References

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance