Git-native prompt regression testing with judge calibration for LLM CI/CD pipelines
Project description
prompt-lock
Git-native prompt regression testing with judge calibration.
Guards at the gaps in your LLM CI/CD pipeline. Fails the build when a prompt change causes a regression — and verifies that your LLM judge actually agrees with humans before trusting it as a gate.
pip install prompt-lock
The problem
You changed a prompt. Did your model outputs get worse?
You probably don't know. 82% of teams have no automated detection for prompt quality regressions. The few that do often use LLM-as-a-judge — but their judge is miscalibrated: it disagrees with human evaluators on 20–40% of examples and they've never measured it.
The solution
prompt-lock does three things no other tool does together in a single pip install:
- Detects changed prompts via git diff — only evaluates what changed, keeping costs low
- Verifies judge calibration — runs your LLM judge against human-labeled examples, measures agreement rate and Spearman correlation, and blocks the CI pipeline if the judge can't be trusted
- Regression gate — fails the build if eval scores drop more than a configurable threshold from baseline
Quick start
pip install prompt-lock
cd your-llm-project
prompt-lock init
init creates .prompt-lock.yml, prompts/, tests/test_cases.jsonl, and tests/human_labels.jsonl.
Fill in your test cases:
{"input": "Summarize this article: ...", "output": "The article discusses ...", "expected_output": "A summary of the article."}
Run:
prompt-lock check
Configuration
# .prompt-lock.yml
version: "1"
model: gpt-4o-mini
# Judge calibration — the key differentiator
judge:
enabled: true
human_labels_file: tests/human_labels.jsonl
model: gpt-4o-mini
criteria: "Rate the quality of this response from 0.0 to 1.0."
min_agreement: 0.80 # 80% of examples must agree (within ±0.15)
min_spearman: 0.70 # Spearman correlation with human scores
prompts:
- path: "prompts/*.txt"
name: "My Prompts"
test_cases_file: tests/test_cases.jsonl
evals:
- type: llm_judge
criteria: "Is the response helpful, accurate, and well-structured?"
threshold: 0.70
- type: semantic_similarity
threshold: 0.80
gate:
mode: regression # hard | regression | soft
regression_threshold: 0.05 # fail if score drops >5% from baseline
Eval types
| Type | What it checks | Requires |
|---|---|---|
llm_judge |
LLM scores output against criteria (0.0–1.0) | criteria |
semantic_similarity |
Cosine similarity to expected output (offline, all-MiniLM-L6-v2) | expected_output in test cases |
exact_match |
Exact string match | expected_output in test cases |
regex |
Output matches a regex pattern | pattern |
custom |
Your own Python function fn(input, output) -> float |
custom_fn |
Works with any LLM provider via LiteLLM:
gpt-4o-mini,gpt-4oclaude-haiku-4-5-20251001,claude-sonnet-4-6mistral/mistral-small- Any local model via Ollama:
ollama/llama3
Gate modes
regression (default) — fail if score drops more than regression_threshold from recent baseline. Good for ongoing development.
hard — fail if score is below hard_threshold. Good for critical prompts with known minimum quality.
soft — never fail, warn only. Good for new prompts without established baselines.
Judge calibration
The unique feature. Before running evals, prompt-lock checks whether your LLM judge actually agrees with human evaluators:
prompt-lock calibrate
┌─────────────────────────────────────────────────────────────┐
│ Calibration Summary │
│ │
│ PASSED │
│ │
│ Agreement rate 87.5% (min: 80%) │
│ Spearman r 0.831 (min: 0.70) │
│ Bias +0.042 (positive = judge inflates scores) │
│ Examples 16 │
└─────────────────────────────────────────────────────────────┘
If calibration fails, prompt-lock check exits with code 2 and blocks deployment. Your CI pipeline doesn't trust an uncalibrated judge.
Create tests/human_labels.jsonl:
{"input": "What is 2+2?", "output": "The answer is 4.", "human_score": 1.0}
{"input": "What is 2+2?", "output": "It's roughly 5.", "human_score": 0.0}
{"input": "Explain Python.", "output": "Python is a high-level language.", "human_score": 0.9}
Minimum 5 examples. More is better.
GitHub Actions
# .github/workflows/prompt-lock.yml
name: Prompt Regression Tests
on: [push, pull_request]
jobs:
prompt-lock:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 2 # needed for git diff detection
- uses: buildworld-ai/prompt-lock@v1
with:
config: .prompt-lock.yml
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Or with other providers:
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
CLI reference
prompt-lock init # initialize config and example files
prompt-lock check # run regression checks (git-diff aware)
prompt-lock check --all-prompts # eval all prompts, not just changed ones
prompt-lock check --no-calibrate # skip calibration check
prompt-lock check -v # verbose: show per-test-case results
prompt-lock calibrate # run calibration and show detailed results
prompt-lock traces show # show recent eval runs from trace ledger
prompt-lock traces show -n 50 # show last 50 runs
prompt-lock traces diff abc123 def456 # compare scores between two commits
Trace ledger
Every eval run is recorded in a local SQLite database (.prompt-lock/traces.db) with the git commit SHA. This is how regression detection works — it compares current scores to recent passing baselines.
prompt-lock traces show
┌───────────────────────┬─────────┬─────────────────┬───────────┬───────┬──────┐
│ Timestamp │ Commit │ Prompt │ Type │ Score │ Pass │
├───────────────────────┼─────────┼─────────────────┼───────────┼───────┼──────┤
│ 2026-03-27T14:32:01 │ a1b2c3d │ prompts/sum.txt │ llm_judge │ 0.841 │ ✓ │
│ 2026-03-27T14:32:00 │ a1b2c3d │ prompts/sum.txt │ semantic │ 0.923 │ ✓ │
│ 2026-03-26T09:15:44 │ e4f5g6h │ prompts/sum.txt │ llm_judge │ 0.710 │ ✓ │
└───────────────────────┴─────────┴─────────────────┴───────────┴───────┴──────┘
Why not Promptfoo / LangSmith / DeepEval?
| Capability | prompt-lock | Promptfoo | LangSmith | DeepEval |
|---|---|---|---|---|
| Git-diff aware (only eval changed prompts) | ✓ | ✗ | ✗ | ✗ |
| Judge calibration against human labels | ✓ | ✗ | partial | ✗ |
| Block CI if judge is miscalibrated | ✓ | ✗ | ✗ | ✗ |
| Regression gate (baseline comparison) | ✓ | ✓ | ✓ | ✓ |
| Commit-linked trace ledger | ✓ | ✗ | ✓ | ✗ |
| Framework-agnostic (LiteLLM) | ✓ | ✓ | ✗ | ✓ |
| Offline semantic similarity | ✓ | ✗ | ✗ | ✓ |
| Zero hosted infrastructure | ✓ | ✓ | ✗ | partial |
pip install in 30 seconds |
✓ | ✗ | ✗ | ✓ |
Promptfoo was acquired by OpenAI in March 2026 — its roadmap is now OpenAI-aligned. prompt-lock is MIT licensed and provider-agnostic.
Contributing
Issues and PRs welcome. See CONTRIBUTING.md.
License
MIT. Built by BuildWorld.
Guards at the gaps. Nehemiah 4:13.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inject_lock-0.1.0.tar.gz.
File metadata
- Download URL: inject_lock-0.1.0.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
794a025b1744905adb559aec138c53ba7125bd6330393428ebf69917393b5cf1
|
|
| MD5 |
39f8a5de1bbe9a74238cdb0ee42babc0
|
|
| BLAKE2b-256 |
93c06f0be73cd0e29860e07d47edfbd425e4b1d92af7b0e096742c28f2cef360
|
File details
Details for the file inject_lock-0.1.0-py3-none-any.whl.
File metadata
- Download URL: inject_lock-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30949bbf44a65f5685a34b9d332ac46a5dea380dd05f3659e335567b902add4c
|
|
| MD5 |
edbed7d0ebefc56c423034165e115e5d
|
|
| BLAKE2b-256 |
ca37a079cb17989c8ece4ea887c1338a4d33110b595cce2aba79dcf5bbf942d7
|