Skip to main content

Git-native prompt regression testing with judge calibration for LLM CI/CD pipelines

Project description

prompt-lock

Git-native prompt regression testing with judge calibration.

PyPI version CI License: MIT Python 3.11+

Guards at the gaps in your LLM CI/CD pipeline. Fails the build when a prompt change causes a regression — and verifies that your LLM judge actually agrees with humans before trusting it as a gate.

pip install prompt-lock

The problem

You changed a prompt. Did your model outputs get worse?

You probably don't know. 82% of teams have no automated detection for prompt quality regressions. The few that do often use LLM-as-a-judge — but their judge is miscalibrated: it disagrees with human evaluators on 20–40% of examples and they've never measured it.

The solution

prompt-lock does three things no other tool does together in a single pip install:

  1. Detects changed prompts via git diff — only evaluates what changed, keeping costs low
  2. Verifies judge calibration — runs your LLM judge against human-labeled examples, measures agreement rate and Spearman correlation, and blocks the CI pipeline if the judge can't be trusted
  3. Regression gate — fails the build if eval scores drop more than a configurable threshold from baseline

Quick start

pip install prompt-lock
cd your-llm-project
prompt-lock init

init creates .prompt-lock.yml, prompts/, tests/test_cases.jsonl, and tests/human_labels.jsonl.

Fill in your test cases:

{"input": "Summarize this article: ...", "output": "The article discusses ...", "expected_output": "A summary of the article."}

Run:

prompt-lock check

Configuration

# .prompt-lock.yml
version: "1"
model: gpt-4o-mini

# Judge calibration — the key differentiator
judge:
  enabled: true
  human_labels_file: tests/human_labels.jsonl
  model: gpt-4o-mini
  criteria: "Rate the quality of this response from 0.0 to 1.0."
  min_agreement: 0.80   # 80% of examples must agree (within ±0.15)
  min_spearman: 0.70    # Spearman correlation with human scores

prompts:
  - path: "prompts/*.txt"
    name: "My Prompts"
    test_cases_file: tests/test_cases.jsonl
    evals:
      - type: llm_judge
        criteria: "Is the response helpful, accurate, and well-structured?"
        threshold: 0.70
      - type: semantic_similarity
        threshold: 0.80

gate:
  mode: regression      # hard | regression | soft
  regression_threshold: 0.05   # fail if score drops >5% from baseline

Eval types

Type What it checks Requires
llm_judge LLM scores output against criteria (0.0–1.0) criteria
semantic_similarity Cosine similarity to expected output (offline, all-MiniLM-L6-v2) expected_output in test cases
exact_match Exact string match expected_output in test cases
regex Output matches a regex pattern pattern
custom Your own Python function fn(input, output) -> float custom_fn

Works with any LLM provider via LiteLLM:

  • gpt-4o-mini, gpt-4o
  • claude-haiku-4-5-20251001, claude-sonnet-4-6
  • mistral/mistral-small
  • Any local model via Ollama: ollama/llama3

Gate modes

regression (default) — fail if score drops more than regression_threshold from recent baseline. Good for ongoing development.

hard — fail if score is below hard_threshold. Good for critical prompts with known minimum quality.

soft — never fail, warn only. Good for new prompts without established baselines.


Judge calibration

The unique feature. Before running evals, prompt-lock checks whether your LLM judge actually agrees with human evaluators:

prompt-lock calibrate
┌─────────────────────────────────────────────────────────────┐
│ Calibration Summary                                         │
│                                                             │
│ PASSED                                                      │
│                                                             │
│ Agreement rate   87.5%  (min: 80%)                         │
│ Spearman r       0.831  (min: 0.70)                         │
│ Bias             +0.042  (positive = judge inflates scores) │
│ Examples         16                                         │
└─────────────────────────────────────────────────────────────┘

If calibration fails, prompt-lock check exits with code 2 and blocks deployment. Your CI pipeline doesn't trust an uncalibrated judge.

Create tests/human_labels.jsonl:

{"input": "What is 2+2?", "output": "The answer is 4.", "human_score": 1.0}
{"input": "What is 2+2?", "output": "It's roughly 5.", "human_score": 0.0}
{"input": "Explain Python.", "output": "Python is a high-level language.", "human_score": 0.9}

Minimum 5 examples. More is better.


GitHub Actions

# .github/workflows/prompt-lock.yml
name: Prompt Regression Tests

on: [push, pull_request]

jobs:
  prompt-lock:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 2   # needed for git diff detection

      - uses: buildworld-ai/prompt-lock@v1
        with:
          config: .prompt-lock.yml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Or with other providers:

env:
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

CLI reference

prompt-lock init                    # initialize config and example files
prompt-lock check                   # run regression checks (git-diff aware)
prompt-lock check --all-prompts     # eval all prompts, not just changed ones
prompt-lock check --no-calibrate    # skip calibration check
prompt-lock check -v                # verbose: show per-test-case results
prompt-lock calibrate               # run calibration and show detailed results
prompt-lock traces show             # show recent eval runs from trace ledger
prompt-lock traces show -n 50       # show last 50 runs
prompt-lock traces diff abc123 def456  # compare scores between two commits

Trace ledger

Every eval run is recorded in a local SQLite database (.prompt-lock/traces.db) with the git commit SHA. This is how regression detection works — it compares current scores to recent passing baselines.

prompt-lock traces show

┌───────────────────────┬─────────┬─────────────────┬───────────┬───────┬──────┐
│ Timestamp              Commit   Prompt           Type       Score  Pass │
├───────────────────────┼─────────┼─────────────────┼───────────┼───────┼──────┤
│ 2026-03-27T14:32:01    a1b2c3d  prompts/sum.txt  llm_judge  0.841      │
│ 2026-03-27T14:32:00    a1b2c3d  prompts/sum.txt  semantic   0.923      │
│ 2026-03-26T09:15:44    e4f5g6h  prompts/sum.txt  llm_judge  0.710      │
└───────────────────────┴─────────┴─────────────────┴───────────┴───────┴──────┘

Why not Promptfoo / LangSmith / DeepEval?

Capability prompt-lock Promptfoo LangSmith DeepEval
Git-diff aware (only eval changed prompts)
Judge calibration against human labels partial
Block CI if judge is miscalibrated
Regression gate (baseline comparison)
Commit-linked trace ledger
Framework-agnostic (LiteLLM)
Offline semantic similarity
Zero hosted infrastructure partial
pip install in 30 seconds

Promptfoo was acquired by OpenAI in March 2026 — its roadmap is now OpenAI-aligned. prompt-lock is MIT licensed and provider-agnostic.


Contributing

Issues and PRs welcome. See CONTRIBUTING.md.


License

MIT. Built by BuildWorld.

Guards at the gaps. Nehemiah 4:13.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inject_lock-0.1.0.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inject_lock-0.1.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file inject_lock-0.1.0.tar.gz.

File metadata

  • Download URL: inject_lock-0.1.0.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for inject_lock-0.1.0.tar.gz
Algorithm Hash digest
SHA256 794a025b1744905adb559aec138c53ba7125bd6330393428ebf69917393b5cf1
MD5 39f8a5de1bbe9a74238cdb0ee42babc0
BLAKE2b-256 93c06f0be73cd0e29860e07d47edfbd425e4b1d92af7b0e096742c28f2cef360

See more details on using hashes here.

File details

Details for the file inject_lock-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: inject_lock-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for inject_lock-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 30949bbf44a65f5685a34b9d332ac46a5dea380dd05f3659e335567b902add4c
MD5 edbed7d0ebefc56c423034165e115e5d
BLAKE2b-256 ca37a079cb17989c8ece4ea887c1338a4d33110b595cce2aba79dcf5bbf942d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page