Skip to main content

Lightweight regression testing for LLM prompts

Project description

llm-regtest

A lightweight regression testing framework for LLM prompts. Catch semantic drift before it ships.


What does this tool do?

When you use an AI model, the exact wording of your instructions — called a prompt — has a big effect on the quality of responses. Small wording changes can cause unexpected regressions that are impossible to spot without running tests.

This tool works like a test suite for prompts:

  1. You define test cases with prompts and optional inputs
  2. Run update-baseline — the AI's responses become your golden dataset
  3. Later, after changing a prompt or upgrading a model, run llm-regtest run
  4. The tool compares new responses to the baselines and flags anything that regressed

Each comparison produces a score from 0.0 to 1.0 using one or more methods:

Method How it works Best for
exact Character-for-character match (0 or 1) Classification labels, structured outputs
fuzzy Levenshtein edit-distance ratio General text where minor wording drift is acceptable
semantic Cosine similarity of sentence embeddings (all-MiniLM-L6-v2) Longer outputs where meaning matters more than wording
llm_judge An LLM rates quality similarity 0.0–1.0 Conversational outputs where both fuzzy and semantic are too strict

Results are bucketed into PASS / WARN / FAIL based on configurable score thresholds.


What you need before starting

To check your Python version:

python --version

Installation

# OpenAI support
pip install -e ".[openai]"

# Anthropic (Claude) support
pip install -e ".[anthropic]"

# Semantic similarity scoring (sentence-transformers)
pip install -e ".[semantic]"

# Everything
pip install -e ".[all]"

The [semantic] extra installs sentence-transformers and numpy. The first time you use semantic scoring, the all-MiniLM-L6-v2 model (~80 MB) is downloaded automatically.


Quickstart

1. Initialise the project

llm-regtest init

Creates .promptregtest/config.json and a prompt_cases.json file.

2. Set your API key

# Mac / Linux
export OPENAI_API_KEY="sk-..."

# Windows PowerShell
$env:OPENAI_API_KEY="sk-..."

3. Define test cases

Edit prompt_cases.json:

[
  {
    "id": "summarize-article",
    "prompt": "Summarize the following text in one sentence:",
    "input": "Scientists discovered that short 10-minute walks after meals reduce blood sugar spikes.",
    "baseline_output": "",
    "tags": ["summarization"]
  },
  {
    "id": "sentiment-label",
    "prompt": "Classify the sentiment. Reply with exactly one word: positive, negative, or neutral.",
    "input": "I absolutely loved the product!",
    "baseline_output": "",
    "tags": ["classification"]
  }
]

4. Generate baselines

llm-regtest update-baseline

This calls the AI for every case and saves the responses as baselines inside prompt_cases.json.

5. Run the regression tests

llm-regtest run

Sample output:

  [PASS] summarize-article (fuzzy: 0.94, semantic: 0.97, agg=0.96) [312ms, $0.00018]
  [WARN] email-rewrite (fuzzy: 0.71, semantic: 0.79, agg=0.77) [289ms, $0.00014]
  [FAIL] sentiment-label (exact: 0.00, fuzzy: 0.43, agg=0.21) [198ms, $0.00009]

  ------------------------------------------------
  Results (3 cases): 1 passed, 1 warned, 1 failed
  Cost: $0.00041  |  Latency: 266.3ms avg
  ------------------------------------------------

Each result line now shows latency (ms) and cost (USD) alongside the score. The summary line aggregates total cost and average latency across the run.

6. View a saved report

llm-regtest report

How the Regression Check Works

prompt_cases.json            .promptregtest/baselines/
┌──────────────────┐         ┌───────────────────────────┐
│  id: "summarize" │         │  summarize.txt             │
│  prompt: "..."   │──run──▶ │  "Short 10-min walks..."   │◀── baseline
│  input:  "..."   │         └───────────────────────────┘
└──────────────────┘                      │
                                          │ compare
         New run output ──────────────────┘
         "Brief post-meal walks..."
                  │
                  ▼
         scorer.semantic_similarity(new, baseline)
              = cosine(embed(new), embed(baseline))
              = 0.91  →  PASS (threshold: 0.85)

On each run, the tool:

  1. Loads every case from prompt_cases.json
  2. Sends the prompt (+ optional input) to the configured model
  3. Measures latency (wall-clock ms) and token counts for that call
  4. Computes USD cost from the provider's pricing table
  5. Runs each configured scorer against the stored baseline
  6. Computes a weighted aggregate score
  7. Compares it to the pass / warn thresholds
  8. Saves a JSON report to .promptregtest/reports/

Cost and Latency Tracking

Every run automatically captures:

Metric Where it appears
Per-request latency (ms) Beside each result: [312ms, $0.00018]
Per-request USD cost Beside each result: [312ms, $0.00018]
Total run cost Summary line: Cost: $0.00041
Average latency Summary line: Latency: 266.3ms avg
Input / output token counts Saved in the JSON report

Cost is calculated from built-in pricing tables for common OpenAI and Anthropic models. Unknown models report $0.00000 (no crash). The JSON report stores input_tokens, output_tokens, cost_usd, and latency_ms per case for later analysis.


Semantic Similarity Scoring

The semantic scorer encodes both the new output and the baseline as sentence embeddings, then computes their cosine similarity. Two responses that express the same idea in different words score close to 1.0; responses with entirely different meaning score near 0.0.

Install:

pip install -e ".[semantic]"

Configure:

{
  "scoring": {
    "methods": ["fuzzy", "semantic"],
    "weights": { "fuzzy": 0.3, "semantic": 0.7 },
    "thresholds": { "pass": 0.85, "warn": 0.65 }
  }
}

The all-MiniLM-L6-v2 model is downloaded once and cached in memory across cases in the same run. If sentence-transformers is not installed, the scorer is simply not registered — no crash on import. It will only fail if you explicitly request "semantic" in your config without the package installed.


Comparing Two Versions of a Prompt

The most common workflow: lock in version A, change the prompt, compare.

1. Generate the baseline (version A)

llm-regtest update-baseline

2. Modify your prompt in prompt_cases.json

"prompt": "Write a one-sentence summary focusing on the key finding:"

3. Run the comparison

llm-regtest run --verbose

--verbose also prints a line-by-line diff between the baseline and the new output for each case.

4. Compare two saved reports side-by-side

llm-regtest compare \
  --report-a .promptregtest/reports/report_before.json \
  --report-b .promptregtest/reports/report_after.json

Output:

  Case                score-a  score-b    delta  change
  ------------------------------------------------------
  email-reply           0.73     0.95    +0.22  IMPROVED
  summarize-article     0.91     0.84    -0.07  REGRESSED
  sentiment-label       1.00     1.00    +0.00  unchanged

  1 improved, 1 regressed, 1 unchanged

End-to-End OpenAI Demo

A complete demo script is included that shows the full workflow with real OpenAI API calls — including baseline generation, clean regression run, drift simulation, and A/B comparison.

Prerequisites:

pip install "llm-regtest[openai,semantic]"
export OPENAI_API_KEY="sk-..."

Run:

python examples/demo_openai.py

The script runs four steps automatically:

Step What happens
1/4 Generate baselines Calls gpt-4o-mini for all 8 cases and saves responses
2/4 Regression run Reruns all cases — scores should be near 1.0
3/4 Simulate drift Rewrites two prompt wordings to mimic a real prompt change
4/4 Detect regression Reruns with drifted prompts — expect WARN/FAIL on changed cases
Bonus A/B compare Side-by-side delta table between the two runs

The 8 demo cases cover summarization, tone rewriting, sentiment classification, and factual Q&A — a representative spread for testing how different task types respond to prompt drift.


Configuration Reference

Default config lives at .promptregtest/config.json:

{
  "model": {
    "provider": "openai",
    "model_name": "gpt-4o-mini",
    "temperature": 0.0,
    "max_tokens": 1024,
    "system_prompt": ""
  },
  "scoring": {
    "methods": ["fuzzy", "semantic"],
    "weights": { "fuzzy": 0.3, "semantic": 0.7 },
    "thresholds": {
      "pass": 0.85,
      "warn": 0.65
    },
    "llm_judge_model": null
  },
  "prompt_cases_path": "prompt_cases.json",
  "reports_dir": ".promptregtest/reports",
  "baselines_dir": ".promptregtest/baselines",
  "concurrency": 1
}
Setting What it does
provider "openai", "anthropic", or "stub" (no API key needed, for testing)
model_name Model ID, e.g. "gpt-4o-mini", "claude-sonnet-4-6"
temperature Set to 0.0 for deterministic, repeatable outputs — strongly recommended for testing
max_tokens Maximum response length
system_prompt Global system-level instruction sent with every case
methods Scorers to use: any combination of "exact", "fuzzy", "semantic", "llm_judge"
weights Per-method weights for the aggregate. Omit for equal weighting
thresholds.pass Score at or above this → PASS (default: 0.8)
thresholds.warn Score at or above this → WARN (default: 0.5); below → FAIL
llm_judge_model Model config for the LLM-as-judge scorer (same shape as model)
concurrency Cases to run in parallel (default: 1)

Test Case Fields

{
  "id": "my-test",
  "prompt": "Summarize in one sentence:",
  "prompt_file": "prompts/summarize.md",
  "system_prompt": "You are a concise assistant.",
  "system_prompt_file": "prompts/system/concise.md",
  "input": "Text to summarize goes here.",
  "inputs": ["Input A", "Input B", "Input C"],
  "inputs_file": "fixtures/reviews.json",
  "variables": { "name": "Alice", "role": "engineer" },
  "baseline_output": "",
  "tags": ["smoke", "summarization"]
}
Field Required What it does
id Yes Unique identifier (no spaces)
prompt Yes* The instruction text
prompt_file Yes* Path to a .txt or .md file containing the prompt
system_prompt No Per-case system prompt (overrides global config)
system_prompt_file No Path to a file containing the system prompt
input No Extra text appended to the prompt
inputs No List of inputs — generates id[0], id[1], ... sub-cases
inputs_file No JSON file with a list of input strings
variables No Values for {placeholder} templates in the prompt
baseline_output No Auto-filled by update-baseline — leave blank
tags No Labels for filtering with --tag

*prompt or prompt_file is required, not both.


CLI Reference

Command What it does
llm-regtest init Create the project folder structure
llm-regtest init --ci Also create a GitHub Actions workflow file
llm-regtest update-baseline Run prompts and save responses as new baselines
llm-regtest run Run prompts and compare to existing baselines
llm-regtest report Display the most recent saved report
llm-regtest compare Compare two saved reports side-by-side

Flags for run and update-baseline:

Flag What it does
--config PATH Use a non-default config file
--case ID Run only this case ID (repeatable)
--tag TAG Run only cases with this tag (repeatable, OR logic)
--concurrency N Run N cases in parallel

Flags for run only:

Flag What it does
--verbose / -v Print a unified diff for each case
--format console Default coloured output
--format github GitHub Actions ::error:: / ::warning:: annotations

Advanced Features

Semantic similarity

"scoring": {
  "methods": ["fuzzy", "semantic"],
  "weights": { "fuzzy": 0.3, "semantic": 0.7 }
}

Requires pip install -e ".[semantic]". Uses all-MiniLM-L6-v2.

LLM-as-judge

"scoring": {
  "methods": ["fuzzy", "llm_judge"],
  "weights": { "fuzzy": 0.4, "llm_judge": 0.6 },
  "llm_judge_model": {
    "provider": "openai",
    "model_name": "gpt-4o-mini"
  }
}

Parallel execution

llm-regtest run --concurrency 10

Tag filtering

llm-regtest run --tag smoke           # fast per-PR smoke suite
llm-regtest run --tag smoke --tag critical  # OR logic
llm-regtest update-baseline --tag customer-facing

Parameterized inputs

{
  "id": "classify-sentiment",
  "prompt": "Classify as positive, negative, or neutral:",
  "inputs": [
    "Love it!",
    "Broke after one day.",
    "It's fine."
  ]
}

Generates sub-cases classify-sentiment[0], classify-sentiment[1], classify-sentiment[2].

Prompt files

{
  "id": "legal-analysis",
  "prompt_file": "prompts/analyze_contract.md",
  "system_prompt_file": "prompts/system/legal_analyst.md",
  "input": "..."
}

Paths are relative to the directory containing prompt_cases.json. Prompt files appear as clean diffs in pull requests.

Custom scorers

from llm_regtest.scorer import register_scorer

def keyword_overlap(output: str, baseline: str) -> float:
    out_words = set(output.lower().split())
    base_words = set(baseline.lower().split())
    if not base_words:
        return 1.0
    return len(out_words & base_words) / len(base_words)

register_scorer("keyword_overlap", keyword_overlap)

Then add "keyword_overlap" to methods in your config.


CI / GitHub Actions

Generate a workflow file

llm-regtest init --ci

Creates .github/workflows/prompt-regression.yml — runs on every PR that touches prompt files.

Annotated PR output

llm-regtest run --format github

Emits ::error:: and ::warning:: lines that GitHub renders as inline annotations on the PR diff.

Example workflow

name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'prompt_cases.json'
jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -e ".[openai,semantic]"
      - run: llm-regtest run --format github
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Add OPENAI_API_KEY as a repository secret under Settings → Secrets → Actions.


Using Claude (Anthropic) as the Model

{
  "model": {
    "provider": "anthropic",
    "model_name": "claude-haiku-4-5-20251001",
    "temperature": 0.0,
    "max_tokens": 1024
  }
}
pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY="sk-ant-..."

Available model names: claude-haiku-4-5-20251001, claude-sonnet-4-6, claude-opus-4-6.


Understanding the Results

Status Score range Meaning
PASS thresholds.pass Response is very similar to baseline
WARN thresholds.warn Response has drifted noticeably — worth reviewing
FAIL < thresholds.warn Significant regression detected
SKIP No baseline exists for this case yet

Default thresholds: pass = 0.8, warn = 0.5. Adjust in config.json under scoring.thresholds.


Troubleshooting

"No module named llm_regtest" Run pip install -e . from the project root directory.

"OPENAI_API_KEY not set" or authentication errors Set the environment variable in the same terminal window you're running the tool from.

All tests show "skip" No baselines yet. Run llm-regtest update-baseline first.

Semantic scorer not available Install with pip install -e ".[semantic]". The scorer is silently skipped if sentence-transformers is not installed.

Scores are lower than expected after a small change Switch from exact to fuzzy or semantic scoring, which are more tolerant of minor wording differences.

Runs are slow with many cases Use --concurrency N to run cases in parallel. Start with 5 and increase if your API rate limits allow.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_regtest-0.1.0.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_regtest-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file llm_regtest-0.1.0.tar.gz.

File metadata

  • Download URL: llm_regtest-0.1.0.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for llm_regtest-0.1.0.tar.gz
Algorithm Hash digest
SHA256 728c5e69cc488ce107a9025a6f37928177ca76c5bea52f6faf11182e70059b0d
MD5 08b03f16099d27b09d35ae9c574f7424
BLAKE2b-256 75882f976a18d64550c3f76c354fa62119103b5e96fbf5b20bebe76579844cf5

See more details on using hashes here.

File details

Details for the file llm_regtest-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_regtest-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for llm_regtest-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2ff85fcb6fbd5404d4b61dd13990739cb89ea7cf81560fcfd52073dca39f990
MD5 87b7c1a0fbcadcbd53bcea27eb3bfd3e
BLAKE2b-256 bfaf14107126d1bc1de19e02881100b21b771d5ade02ce2e8d5f51dbffef7e2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page