Lightweight regression testing for LLM prompts
Project description
llm-regtest
A lightweight regression testing framework for LLM prompts. Catch semantic drift before it ships.
What does this tool do?
When you use an AI model, the exact wording of your instructions — called a prompt — has a big effect on the quality of responses. Small wording changes can cause unexpected regressions that are impossible to spot without running tests.
This tool works like a test suite for prompts:
- You define test cases with prompts and optional inputs
- Run
update-baseline— the AI's responses become your golden dataset - Later, after changing a prompt or upgrading a model, run
llm-regtest run - The tool compares new responses to the baselines and flags anything that regressed
Each comparison produces a score from 0.0 to 1.0 using one or more methods:
| Method | How it works | Best for |
|---|---|---|
exact |
Character-for-character match (0 or 1) | Classification labels, structured outputs |
fuzzy |
Levenshtein edit-distance ratio | General text where minor wording drift is acceptable |
semantic |
Cosine similarity of sentence embeddings (all-MiniLM-L6-v2) |
Longer outputs where meaning matters more than wording |
llm_judge |
An LLM rates quality similarity 0.0–1.0 | Conversational outputs where both fuzzy and semantic are too strict |
Results are bucketed into PASS / WARN / FAIL based on configurable score thresholds.
What you need before starting
- Python 3.10 or newer (download here)
- pip — comes with Python automatically
- An API key — for OpenAI (get one here) or Anthropic (get one here)
To check your Python version:
python --version
Installation
# OpenAI support
pip install -e ".[openai]"
# Anthropic (Claude) support
pip install -e ".[anthropic]"
# Semantic similarity scoring (sentence-transformers)
pip install -e ".[semantic]"
# Everything
pip install -e ".[all]"
The
[semantic]extra installssentence-transformersandnumpy. The first time you use semantic scoring, theall-MiniLM-L6-v2model (~80 MB) is downloaded automatically.
Quickstart
1. Initialise the project
llm-regtest init
Creates .promptregtest/config.json and a prompt_cases.json file.
2. Set your API key
# Mac / Linux
export OPENAI_API_KEY="sk-..."
# Windows PowerShell
$env:OPENAI_API_KEY="sk-..."
3. Define test cases
Edit prompt_cases.json:
[
{
"id": "summarize-article",
"prompt": "Summarize the following text in one sentence:",
"input": "Scientists discovered that short 10-minute walks after meals reduce blood sugar spikes.",
"baseline_output": "",
"tags": ["summarization"]
},
{
"id": "sentiment-label",
"prompt": "Classify the sentiment. Reply with exactly one word: positive, negative, or neutral.",
"input": "I absolutely loved the product!",
"baseline_output": "",
"tags": ["classification"]
}
]
4. Generate baselines
llm-regtest update-baseline
This calls the AI for every case and saves the responses as baselines inside prompt_cases.json.
5. Run the regression tests
llm-regtest run
Sample output:
[PASS] summarize-article (fuzzy: 0.94, semantic: 0.97, agg=0.96) [312ms, $0.00018]
[WARN] email-rewrite (fuzzy: 0.71, semantic: 0.79, agg=0.77) [289ms, $0.00014]
[FAIL] sentiment-label (exact: 0.00, fuzzy: 0.43, agg=0.21) [198ms, $0.00009]
------------------------------------------------
Results (3 cases): 1 passed, 1 warned, 1 failed
Cost: $0.00041 | Latency: 266.3ms avg
------------------------------------------------
Each result line now shows latency (ms) and cost (USD) alongside the score. The summary line aggregates total cost and average latency across the run.
6. View a saved report
llm-regtest report
How the Regression Check Works
prompt_cases.json .promptregtest/baselines/
┌──────────────────┐ ┌───────────────────────────┐
│ id: "summarize" │ │ summarize.txt │
│ prompt: "..." │──run──▶ │ "Short 10-min walks..." │◀── baseline
│ input: "..." │ └───────────────────────────┘
└──────────────────┘ │
│ compare
New run output ──────────────────┘
"Brief post-meal walks..."
│
▼
scorer.semantic_similarity(new, baseline)
= cosine(embed(new), embed(baseline))
= 0.91 → PASS (threshold: 0.85)
On each run, the tool:
- Loads every case from
prompt_cases.json - Sends the prompt (+ optional input) to the configured model
- Measures latency (wall-clock ms) and token counts for that call
- Computes USD cost from the provider's pricing table
- Runs each configured scorer against the stored baseline
- Computes a weighted aggregate score
- Compares it to the
pass/warnthresholds - Saves a JSON report to
.promptregtest/reports/
Cost and Latency Tracking
Every run automatically captures:
| Metric | Where it appears |
|---|---|
| Per-request latency (ms) | Beside each result: [312ms, $0.00018] |
| Per-request USD cost | Beside each result: [312ms, $0.00018] |
| Total run cost | Summary line: Cost: $0.00041 |
| Average latency | Summary line: Latency: 266.3ms avg |
| Input / output token counts | Saved in the JSON report |
Cost is calculated from built-in pricing tables for common OpenAI and Anthropic models. Unknown models report $0.00000 (no crash). The JSON report stores input_tokens, output_tokens, cost_usd, and latency_ms per case for later analysis.
Semantic Similarity Scoring
The semantic scorer encodes both the new output and the baseline as sentence embeddings, then computes their cosine similarity. Two responses that express the same idea in different words score close to 1.0; responses with entirely different meaning score near 0.0.
Install:
pip install -e ".[semantic]"
Configure:
{
"scoring": {
"methods": ["fuzzy", "semantic"],
"weights": { "fuzzy": 0.3, "semantic": 0.7 },
"thresholds": { "pass": 0.85, "warn": 0.65 }
}
}
The all-MiniLM-L6-v2 model is downloaded once and cached in memory across cases in the same run. If sentence-transformers is not installed, the scorer is simply not registered — no crash on import. It will only fail if you explicitly request "semantic" in your config without the package installed.
Comparing Two Versions of a Prompt
The most common workflow: lock in version A, change the prompt, compare.
1. Generate the baseline (version A)
llm-regtest update-baseline
2. Modify your prompt in prompt_cases.json
"prompt": "Write a one-sentence summary focusing on the key finding:"
3. Run the comparison
llm-regtest run --verbose
--verbose also prints a line-by-line diff between the baseline and the new output for each case.
4. Compare two saved reports side-by-side
llm-regtest compare \
--report-a .promptregtest/reports/report_before.json \
--report-b .promptregtest/reports/report_after.json
Output:
Case score-a score-b delta change
------------------------------------------------------
email-reply 0.73 0.95 +0.22 IMPROVED
summarize-article 0.91 0.84 -0.07 REGRESSED
sentiment-label 1.00 1.00 +0.00 unchanged
1 improved, 1 regressed, 1 unchanged
End-to-End OpenAI Demo
A complete demo script is included that shows the full workflow with real OpenAI API calls — including baseline generation, clean regression run, drift simulation, and A/B comparison.
Prerequisites:
pip install "llm-regtest[openai,semantic]"
export OPENAI_API_KEY="sk-..."
Run:
python examples/demo_openai.py
The script runs four steps automatically:
| Step | What happens |
|---|---|
| 1/4 Generate baselines | Calls gpt-4o-mini for all 8 cases and saves responses |
| 2/4 Regression run | Reruns all cases — scores should be near 1.0 |
| 3/4 Simulate drift | Rewrites two prompt wordings to mimic a real prompt change |
| 4/4 Detect regression | Reruns with drifted prompts — expect WARN/FAIL on changed cases |
| Bonus A/B compare | Side-by-side delta table between the two runs |
The 8 demo cases cover summarization, tone rewriting, sentiment classification, and factual Q&A — a representative spread for testing how different task types respond to prompt drift.
Configuration Reference
Default config lives at .promptregtest/config.json:
{
"model": {
"provider": "openai",
"model_name": "gpt-4o-mini",
"temperature": 0.0,
"max_tokens": 1024,
"system_prompt": ""
},
"scoring": {
"methods": ["fuzzy", "semantic"],
"weights": { "fuzzy": 0.3, "semantic": 0.7 },
"thresholds": {
"pass": 0.85,
"warn": 0.65
},
"llm_judge_model": null
},
"prompt_cases_path": "prompt_cases.json",
"reports_dir": ".promptregtest/reports",
"baselines_dir": ".promptregtest/baselines",
"concurrency": 1
}
| Setting | What it does |
|---|---|
provider |
"openai", "anthropic", or "stub" (no API key needed, for testing) |
model_name |
Model ID, e.g. "gpt-4o-mini", "claude-sonnet-4-6" |
temperature |
Set to 0.0 for deterministic, repeatable outputs — strongly recommended for testing |
max_tokens |
Maximum response length |
system_prompt |
Global system-level instruction sent with every case |
methods |
Scorers to use: any combination of "exact", "fuzzy", "semantic", "llm_judge" |
weights |
Per-method weights for the aggregate. Omit for equal weighting |
thresholds.pass |
Score at or above this → PASS (default: 0.8) |
thresholds.warn |
Score at or above this → WARN (default: 0.5); below → FAIL |
llm_judge_model |
Model config for the LLM-as-judge scorer (same shape as model) |
concurrency |
Cases to run in parallel (default: 1) |
Test Case Fields
{
"id": "my-test",
"prompt": "Summarize in one sentence:",
"prompt_file": "prompts/summarize.md",
"system_prompt": "You are a concise assistant.",
"system_prompt_file": "prompts/system/concise.md",
"input": "Text to summarize goes here.",
"inputs": ["Input A", "Input B", "Input C"],
"inputs_file": "fixtures/reviews.json",
"variables": { "name": "Alice", "role": "engineer" },
"baseline_output": "",
"tags": ["smoke", "summarization"]
}
| Field | Required | What it does |
|---|---|---|
id |
Yes | Unique identifier (no spaces) |
prompt |
Yes* | The instruction text |
prompt_file |
Yes* | Path to a .txt or .md file containing the prompt |
system_prompt |
No | Per-case system prompt (overrides global config) |
system_prompt_file |
No | Path to a file containing the system prompt |
input |
No | Extra text appended to the prompt |
inputs |
No | List of inputs — generates id[0], id[1], ... sub-cases |
inputs_file |
No | JSON file with a list of input strings |
variables |
No | Values for {placeholder} templates in the prompt |
baseline_output |
No | Auto-filled by update-baseline — leave blank |
tags |
No | Labels for filtering with --tag |
*prompt or prompt_file is required, not both.
CLI Reference
| Command | What it does |
|---|---|
llm-regtest init |
Create the project folder structure |
llm-regtest init --ci |
Also create a GitHub Actions workflow file |
llm-regtest update-baseline |
Run prompts and save responses as new baselines |
llm-regtest run |
Run prompts and compare to existing baselines |
llm-regtest report |
Display the most recent saved report |
llm-regtest compare |
Compare two saved reports side-by-side |
Flags for run and update-baseline:
| Flag | What it does |
|---|---|
--config PATH |
Use a non-default config file |
--case ID |
Run only this case ID (repeatable) |
--tag TAG |
Run only cases with this tag (repeatable, OR logic) |
--concurrency N |
Run N cases in parallel |
Flags for run only:
| Flag | What it does |
|---|---|
--verbose / -v |
Print a unified diff for each case |
--format console |
Default coloured output |
--format github |
GitHub Actions ::error:: / ::warning:: annotations |
Advanced Features
Semantic similarity
"scoring": {
"methods": ["fuzzy", "semantic"],
"weights": { "fuzzy": 0.3, "semantic": 0.7 }
}
Requires pip install -e ".[semantic]". Uses all-MiniLM-L6-v2.
LLM-as-judge
"scoring": {
"methods": ["fuzzy", "llm_judge"],
"weights": { "fuzzy": 0.4, "llm_judge": 0.6 },
"llm_judge_model": {
"provider": "openai",
"model_name": "gpt-4o-mini"
}
}
Parallel execution
llm-regtest run --concurrency 10
Tag filtering
llm-regtest run --tag smoke # fast per-PR smoke suite
llm-regtest run --tag smoke --tag critical # OR logic
llm-regtest update-baseline --tag customer-facing
Parameterized inputs
{
"id": "classify-sentiment",
"prompt": "Classify as positive, negative, or neutral:",
"inputs": [
"Love it!",
"Broke after one day.",
"It's fine."
]
}
Generates sub-cases classify-sentiment[0], classify-sentiment[1], classify-sentiment[2].
Prompt files
{
"id": "legal-analysis",
"prompt_file": "prompts/analyze_contract.md",
"system_prompt_file": "prompts/system/legal_analyst.md",
"input": "..."
}
Paths are relative to the directory containing prompt_cases.json. Prompt files appear as clean diffs in pull requests.
Custom scorers
from llm_regtest.scorer import register_scorer
def keyword_overlap(output: str, baseline: str) -> float:
out_words = set(output.lower().split())
base_words = set(baseline.lower().split())
if not base_words:
return 1.0
return len(out_words & base_words) / len(base_words)
register_scorer("keyword_overlap", keyword_overlap)
Then add "keyword_overlap" to methods in your config.
CI / GitHub Actions
Generate a workflow file
llm-regtest init --ci
Creates .github/workflows/prompt-regression.yml — runs on every PR that touches prompt files.
Annotated PR output
llm-regtest run --format github
Emits ::error:: and ::warning:: lines that GitHub renders as inline annotations on the PR diff.
Example workflow
name: Prompt Regression Tests
on:
pull_request:
paths:
- 'prompts/**'
- 'prompt_cases.json'
jobs:
regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install -e ".[openai,semantic]"
- run: llm-regtest run --format github
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Add OPENAI_API_KEY as a repository secret under Settings → Secrets → Actions.
Using Claude (Anthropic) as the Model
{
"model": {
"provider": "anthropic",
"model_name": "claude-haiku-4-5-20251001",
"temperature": 0.0,
"max_tokens": 1024
}
}
pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY="sk-ant-..."
Available model names: claude-haiku-4-5-20251001, claude-sonnet-4-6, claude-opus-4-6.
Understanding the Results
| Status | Score range | Meaning |
|---|---|---|
| PASS | ≥ thresholds.pass |
Response is very similar to baseline |
| WARN | ≥ thresholds.warn |
Response has drifted noticeably — worth reviewing |
| FAIL | < thresholds.warn |
Significant regression detected |
| SKIP | — | No baseline exists for this case yet |
Default thresholds: pass = 0.8, warn = 0.5. Adjust in config.json under scoring.thresholds.
Troubleshooting
"No module named llm_regtest"
Run pip install -e . from the project root directory.
"OPENAI_API_KEY not set" or authentication errors Set the environment variable in the same terminal window you're running the tool from.
All tests show "skip"
No baselines yet. Run llm-regtest update-baseline first.
Semantic scorer not available
Install with pip install -e ".[semantic]". The scorer is silently skipped if sentence-transformers is not installed.
Scores are lower than expected after a small change
Switch from exact to fuzzy or semantic scoring, which are more tolerant of minor wording differences.
Runs are slow with many cases
Use --concurrency N to run cases in parallel. Start with 5 and increase if your API rate limits allow.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_regtest-0.1.0.tar.gz.
File metadata
- Download URL: llm_regtest-0.1.0.tar.gz
- Upload date:
- Size: 34.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
728c5e69cc488ce107a9025a6f37928177ca76c5bea52f6faf11182e70059b0d
|
|
| MD5 |
08b03f16099d27b09d35ae9c574f7424
|
|
| BLAKE2b-256 |
75882f976a18d64550c3f76c354fa62119103b5e96fbf5b20bebe76579844cf5
|
File details
Details for the file llm_regtest-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_regtest-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2ff85fcb6fbd5404d4b61dd13990739cb89ea7cf81560fcfd52073dca39f990
|
|
| MD5 |
87b7c1a0fbcadcbd53bcea27eb3bfd3e
|
|
| BLAKE2b-256 |
bfaf14107126d1bc1de19e02881100b21b771d5ade02ce2e8d5f51dbffef7e2a
|