Lightweight regression testing for LLM prompts

These details have not been verified by PyPI

Project description

llm-regtest

A lightweight regression testing framework for LLM prompts. Catch semantic drift before it ships.

What does this tool do?

When you use an AI model, the exact wording of your instructions — called a prompt — has a big effect on the quality of responses. Small wording changes can cause unexpected regressions that are impossible to spot without running tests.

This tool works like a test suite for prompts:

You define test cases with prompts and optional inputs
Run update-baseline — the AI's responses become your golden dataset
Later, after changing a prompt or upgrading a model, run llm-regtest run
The tool compares new responses to the baselines and flags anything that regressed

Each comparison produces a score from 0.0 to 1.0 using one or more methods:

Method	How it works	Best for
`exact`	Character-for-character match (0 or 1)	Classification labels, structured outputs
`fuzzy`	Levenshtein edit-distance ratio	General text where minor wording drift is acceptable
`semantic`	Cosine similarity of sentence embeddings (`all-MiniLM-L6-v2`)	Longer outputs where meaning matters more than wording
`llm_judge`	An LLM rates quality similarity 0.0–1.0	Conversational outputs where both fuzzy and semantic are too strict

Results are bucketed into PASS / WARN / FAIL based on configurable score thresholds.

What you need before starting

Python 3.10 or newer (download here)
pip — comes with Python automatically
An API key — for OpenAI (get one here) or Anthropic (get one here)

To check your Python version:

python --version

Installation

# OpenAI support
pip install -e ".[openai]"

# Anthropic (Claude) support
pip install -e ".[anthropic]"

# Semantic similarity scoring (sentence-transformers)
pip install -e ".[semantic]"

# Everything
pip install -e ".[all]"

The [semantic] extra installs sentence-transformers and numpy. The first time you use semantic scoring, the all-MiniLM-L6-v2 model (~80 MB) is downloaded automatically.

Quickstart

1. Initialise the project

llm-regtest init

Creates .promptregtest/config.json and a prompt_cases.json file.

2. Set your API key

# Mac / Linux
export OPENAI_API_KEY="sk-..."

# Windows PowerShell
$env:OPENAI_API_KEY="sk-..."

3. Define test cases

Edit prompt_cases.json:

[
  {
    "id": "summarize-article",
    "prompt": "Summarize the following text in one sentence:",
    "input": "Scientists discovered that short 10-minute walks after meals reduce blood sugar spikes.",
    "baseline_output": "",
    "tags": ["summarization"]
  },
  {
    "id": "sentiment-label",
    "prompt": "Classify the sentiment. Reply with exactly one word: positive, negative, or neutral.",
    "input": "I absolutely loved the product!",
    "baseline_output": "",
    "tags": ["classification"]
  }
]

4. Generate baselines

llm-regtest update-baseline

This calls the AI for every case and saves the responses as baselines inside prompt_cases.json.

5. Run the regression tests

llm-regtest run

Sample output:

  [PASS] summarize-article (fuzzy: 0.94, semantic: 0.97, agg=0.96) [312ms, $0.00018]
  [WARN] email-rewrite (fuzzy: 0.71, semantic: 0.79, agg=0.77) [289ms, $0.00014]
  [FAIL] sentiment-label (exact: 0.00, fuzzy: 0.43, agg=0.21) [198ms, $0.00009]

  ------------------------------------------------
  Results (3 cases): 1 passed, 1 warned, 1 failed
  Cost: $0.00041  |  Latency: 266.3ms avg
  ------------------------------------------------

Each result line now shows latency (ms) and cost (USD) alongside the score. The summary line aggregates total cost and average latency across the run.

6. View a saved report

llm-regtest report

How the Regression Check Works

prompt_cases.json            .promptregtest/baselines/
┌──────────────────┐         ┌───────────────────────────┐
│  id: "summarize" │         │  summarize.txt             │
│  prompt: "..."   │──run──▶ │  "Short 10-min walks..."   │◀── baseline
│  input:  "..."   │         └───────────────────────────┘
└──────────────────┘                      │
                                          │ compare
         New run output ──────────────────┘
         "Brief post-meal walks..."
                  │
                  ▼
         scorer.semantic_similarity(new, baseline)
              = cosine(embed(new), embed(baseline))
              = 0.91  →  PASS (threshold: 0.85)

On each run, the tool:

Loads every case from prompt_cases.json
Sends the prompt (+ optional input) to the configured model
Measures latency (wall-clock ms) and token counts for that call
Computes USD cost from the provider's pricing table
Runs each configured scorer against the stored baseline
Computes a weighted aggregate score
Compares it to the pass / warn thresholds
Saves a JSON report to .promptregtest/reports/

Cost and Latency Tracking

Every run automatically captures:

Metric	Where it appears
Per-request latency (ms)	Beside each result: `[312ms, $0.00018]`
Per-request USD cost	Beside each result: `[312ms, $0.00018]`
Total run cost	Summary line: `Cost: $0.00041`
Average latency	Summary line: `Latency: 266.3ms avg`
Input / output token counts	Saved in the JSON report

Cost is calculated from built-in pricing tables for common OpenAI and Anthropic models. Unknown models report $0.00000 (no crash). The JSON report stores input_tokens, output_tokens, cost_usd, and latency_ms per case for later analysis.

Semantic Similarity Scoring

The semantic scorer encodes both the new output and the baseline as sentence embeddings, then computes their cosine similarity. Two responses that express the same idea in different words score close to 1.0; responses with entirely different meaning score near 0.0.

Install:

pip install -e ".[semantic]"

Configure:

{
  "scoring": {
    "methods": ["fuzzy", "semantic"],
    "weights": { "fuzzy": 0.3, "semantic": 0.7 },
    "thresholds": { "pass": 0.85, "warn": 0.65 }
  }
}

The all-MiniLM-L6-v2 model is downloaded once and cached in memory across cases in the same run. If sentence-transformers is not installed, the scorer is simply not registered — no crash on import. It will only fail if you explicitly request "semantic" in your config without the package installed.

Comparing Two Versions of a Prompt

The most common workflow: lock in version A, change the prompt, compare.

1. Generate the baseline (version A)

llm-regtest update-baseline

2. Modify your prompt in prompt_cases.json

"prompt": "Write a one-sentence summary focusing on the key finding:"

3. Run the comparison

llm-regtest run --verbose

--verbose also prints a line-by-line diff between the baseline and the new output for each case.

4. Compare two saved reports side-by-side

llm-regtest compare \
  --report-a .promptregtest/reports/report_before.json \
  --report-b .promptregtest/reports/report_after.json

Output:

  Case                score-a  score-b    delta  change
  ------------------------------------------------------
  email-reply           0.73     0.95    +0.22  IMPROVED
  summarize-article     0.91     0.84    -0.07  REGRESSED
  sentiment-label       1.00     1.00    +0.00  unchanged

  1 improved, 1 regressed, 1 unchanged

End-to-End OpenAI Demo

A complete demo script is included that shows the full workflow with real OpenAI API calls — including baseline generation, clean regression run, drift simulation, and A/B comparison.

Prerequisites:

pip install "llm-regtest[openai,semantic]"
export OPENAI_API_KEY="sk-..."

Run:

python examples/demo_openai.py

The script runs four steps automatically:

Step	What happens
1/4 Generate baselines	Calls `gpt-4o-mini` for all 8 cases and saves responses
2/4 Regression run	Reruns all cases — scores should be near 1.0
3/4 Simulate drift	Rewrites two prompt wordings to mimic a real prompt change
4/4 Detect regression	Reruns with drifted prompts — expect WARN/FAIL on changed cases
Bonus A/B compare	Side-by-side delta table between the two runs

The 8 demo cases cover summarization, tone rewriting, sentiment classification, and factual Q&A — a representative spread for testing how different task types respond to prompt drift.

Configuration Reference

Default config lives at .promptregtest/config.json:

{
  "model": {
    "provider": "openai",
    "model_name": "gpt-4o-mini",
    "temperature": 0.0,
    "max_tokens": 1024,
    "system_prompt": ""
  },
  "scoring": {
    "methods": ["fuzzy", "semantic"],
    "weights": { "fuzzy": 0.3, "semantic": 0.7 },
    "thresholds": {
      "pass": 0.85,
      "warn": 0.65
    },
    "llm_judge_model": null
  },
  "prompt_cases_path": "prompt_cases.json",
  "reports_dir": ".promptregtest/reports",
  "baselines_dir": ".promptregtest/baselines",
  "concurrency": 1
}

Setting	What it does
`provider`	`"openai"`, `"anthropic"`, or `"stub"` (no API key needed, for testing)
`model_name`	Model ID, e.g. `"gpt-4o-mini"`, `"claude-sonnet-4-6"`
`temperature`	Set to `0.0` for deterministic, repeatable outputs — strongly recommended for testing
`max_tokens`	Maximum response length
`system_prompt`	Global system-level instruction sent with every case
`methods`	Scorers to use: any combination of `"exact"`, `"fuzzy"`, `"semantic"`, `"llm_judge"`
`weights`	Per-method weights for the aggregate. Omit for equal weighting
`thresholds.pass`	Score at or above this → PASS (default: `0.8`)
`thresholds.warn`	Score at or above this → WARN (default: `0.5`); below → FAIL
`llm_judge_model`	Model config for the LLM-as-judge scorer (same shape as `model`)
`concurrency`	Cases to run in parallel (default: `1`)

Test Case Fields

{
  "id": "my-test",
  "prompt": "Summarize in one sentence:",
  "prompt_file": "prompts/summarize.md",
  "system_prompt": "You are a concise assistant.",
  "system_prompt_file": "prompts/system/concise.md",
  "input": "Text to summarize goes here.",
  "inputs": ["Input A", "Input B", "Input C"],
  "inputs_file": "fixtures/reviews.json",
  "variables": { "name": "Alice", "role": "engineer" },
  "baseline_output": "",
  "tags": ["smoke", "summarization"]
}

Field	Required	What it does
`id`	Yes	Unique identifier (no spaces)
`prompt`	Yes*	The instruction text
`prompt_file`	Yes*	Path to a `.txt` or `.md` file containing the prompt
`system_prompt`	No	Per-case system prompt (overrides global config)
`system_prompt_file`	No	Path to a file containing the system prompt
`input`	No	Extra text appended to the prompt
`inputs`	No	List of inputs — generates `id[0]`, `id[1]`, ... sub-cases
`inputs_file`	No	JSON file with a list of input strings
`variables`	No	Values for `{placeholder}` templates in the prompt
`baseline_output`	No	Auto-filled by `update-baseline` — leave blank
`tags`	No	Labels for filtering with `--tag`

*prompt or prompt_file is required, not both.

CLI Reference

Command	What it does
`llm-regtest init`	Create the project folder structure
`llm-regtest init --ci`	Also create a GitHub Actions workflow file
`llm-regtest update-baseline`	Run prompts and save responses as new baselines
`llm-regtest run`	Run prompts and compare to existing baselines
`llm-regtest report`	Display the most recent saved report
`llm-regtest compare`	Compare two saved reports side-by-side

Flags for run and update-baseline:

Flag	What it does
`--config PATH`	Use a non-default config file
`--case ID`	Run only this case ID (repeatable)
`--tag TAG`	Run only cases with this tag (repeatable, OR logic)
`--concurrency N`	Run N cases in parallel

Flags for run only:

Flag	What it does
`--verbose` / `-v`	Print a unified diff for each case
`--format console`	Default coloured output
`--format github`	GitHub Actions `::error::` / `::warning::` annotations

Advanced Features

Semantic similarity

"scoring": {
  "methods": ["fuzzy", "semantic"],
  "weights": { "fuzzy": 0.3, "semantic": 0.7 }
}

Requires pip install -e ".[semantic]". Uses all-MiniLM-L6-v2.

LLM-as-judge

"scoring": {
  "methods": ["fuzzy", "llm_judge"],
  "weights": { "fuzzy": 0.4, "llm_judge": 0.6 },
  "llm_judge_model": {
    "provider": "openai",
    "model_name": "gpt-4o-mini"
  }
}

Parallel execution

llm-regtest run --concurrency 10

Tag filtering

llm-regtest run --tag smoke           # fast per-PR smoke suite
llm-regtest run --tag smoke --tag critical  # OR logic
llm-regtest update-baseline --tag customer-facing

Parameterized inputs

{
  "id": "classify-sentiment",
  "prompt": "Classify as positive, negative, or neutral:",
  "inputs": [
    "Love it!",
    "Broke after one day.",
    "It's fine."
  ]
}

Generates sub-cases classify-sentiment[0], classify-sentiment[1], classify-sentiment[2].

Prompt files

{
  "id": "legal-analysis",
  "prompt_file": "prompts/analyze_contract.md",
  "system_prompt_file": "prompts/system/legal_analyst.md",
  "input": "..."
}

Paths are relative to the directory containing prompt_cases.json. Prompt files appear as clean diffs in pull requests.

Custom scorers

from llm_regtest.scorer import register_scorer

def keyword_overlap(output: str, baseline: str) -> float:
    out_words = set(output.lower().split())
    base_words = set(baseline.lower().split())
    if not base_words:
        return 1.0
    return len(out_words & base_words) / len(base_words)

register_scorer("keyword_overlap", keyword_overlap)

Then add "keyword_overlap" to methods in your config.

CI / GitHub Actions

Generate a workflow file

llm-regtest init --ci

Creates .github/workflows/prompt-regression.yml — runs on every PR that touches prompt files.

Annotated PR output

llm-regtest run --format github

Emits ::error:: and ::warning:: lines that GitHub renders as inline annotations on the PR diff.

Example workflow

name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'prompt_cases.json'
jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -e ".[openai,semantic]"
      - run: llm-regtest run --format github
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Add OPENAI_API_KEY as a repository secret under Settings → Secrets → Actions.

Using Claude (Anthropic) as the Model

{
  "model": {
    "provider": "anthropic",
    "model_name": "claude-haiku-4-5-20251001",
    "temperature": 0.0,
    "max_tokens": 1024
  }
}

pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY="sk-ant-..."

Available model names: claude-haiku-4-5-20251001, claude-sonnet-4-6, claude-opus-4-6.

Understanding the Results

Status	Score range	Meaning
PASS	≥ `thresholds.pass`	Response is very similar to baseline
WARN	≥ `thresholds.warn`	Response has drifted noticeably — worth reviewing
FAIL	< `thresholds.warn`	Significant regression detected
SKIP	—	No baseline exists for this case yet

Default thresholds: pass = 0.8, warn = 0.5. Adjust in config.json under scoring.thresholds.

Troubleshooting

"No module named llm_regtest" Run pip install -e . from the project root directory.

"OPENAI_API_KEY not set" or authentication errors Set the environment variable in the same terminal window you're running the tool from.

All tests show "skip" No baselines yet. Run llm-regtest update-baseline first.

Semantic scorer not available Install with pip install -e ".[semantic]". The scorer is silently skipped if sentence-transformers is not installed.

Scores are lower than expected after a small change Switch from exact to fuzzy or semantic scoring, which are more tolerant of minor wording differences.

Runs are slow with many cases Use --concurrency N to run cases in parallel. Start with 5 and increase if your API rate limits allow.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_regtest-0.1.0.tar.gz (34.7 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_regtest-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file llm_regtest-0.1.0.tar.gz.

File metadata

Download URL: llm_regtest-0.1.0.tar.gz
Upload date: Apr 8, 2026
Size: 34.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for llm_regtest-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`728c5e69cc488ce107a9025a6f37928177ca76c5bea52f6faf11182e70059b0d`
MD5	`08b03f16099d27b09d35ae9c574f7424`
BLAKE2b-256	`75882f976a18d64550c3f76c354fa62119103b5e96fbf5b20bebe76579844cf5`

See more details on using hashes here.

File details

Details for the file llm_regtest-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_regtest-0.1.0-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 33.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for llm_regtest-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2ff85fcb6fbd5404d4b61dd13990739cb89ea7cf81560fcfd52073dca39f990`
MD5	`87b7c1a0fbcadcbd53bcea27eb3bfd3e`
BLAKE2b-256	`bfaf14107126d1bc1de19e02881100b21b771d5ade02ce2e8d5f51dbffef7e2a`

See more details on using hashes here.

llm-regtest 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

llm-regtest

What does this tool do?

What you need before starting

Installation

Quickstart

1. Initialise the project

2. Set your API key

3. Define test cases

4. Generate baselines

5. Run the regression tests

6. View a saved report

How the Regression Check Works

Cost and Latency Tracking

Semantic Similarity Scoring

Comparing Two Versions of a Prompt

End-to-End OpenAI Demo

Configuration Reference

Test Case Fields

CLI Reference

Advanced Features

Semantic similarity

LLM-as-judge

Parallel execution

Tag filtering

Parameterized inputs

Prompt files

Custom scorers

CI / GitHub Actions

Generate a workflow file

Annotated PR output

Example workflow

Using Claude (Anthropic) as the Model

Understanding the Results

Troubleshooting

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes