Snapshot testing for LLM prompts - catch meaningful output drift, ignore harmless rephrasing. Local embeddings, no API key.
Project description
pdrift
You changed a prompt (or a model version) and now you don't know which of your LLM outputs silently changed meaning. pdrift is snapshot testing for prompts — like jest snapshots, but judged semantically instead of byte-by-byte.
pip install pdrift
# pdrift_answers.py
from pdrift import case
@case(inputs="cases.jsonl") # one JSON object per line: {"id": "...", "input": "..."}
def answers(input: str) -> str:
return my_llm_call(input) # any function that returns a string
pdrift snapshot # run the suite, record baseline outputs
pdrift check # re-run and flag *meaningful* drift — exit 1 in CI
See it catch a flipped meaning
A one-word prompt tweak turns "found no link" into "confirmed a strong link". Exact-match snapshots scream at every harmless rephrase; humans skim and miss the real one. pdrift buckets each case by embedding similarity — actual output:
$ pdrift check
pdrift check (threshold 0.9)
+----------------------------------------------------------------------------+
| suite | identical | trivial | meaningful | new case | missing case | error |
|-------+-----------+---------+------------+----------+--------------+-------|
| llm | 0 | 3 | 2 | 0 | 0 | 0 |
+----------------------------------------------------------------------------+
meaningful changes - llm
+-----------------------------------------------------------------------------+
| case | sim | baseline vs current |
|---------------+--------+----------------------------------------------------|
| coffee-study | 0.7449 | - The study found no link between coffee |
| | | consumption and heart disease. |
| | | + The study confirmed a strong link between coffee |
| | | consumption and heart disease. |
| deploy-status | 0.5679 | - The deployment completed successfully and no |
| | | downtime was reported. |
| | | + The deployment failed and caused significant |
| | | downtime across all regions. |
+-----------------------------------------------------------------------------+
Drift detected.
# exit code 1
The three rephrasings ("Hello! How can I help you today?" → "Hi there! How can I assist you today?", sim 0.9684) landed in trivial — exit code stays 0 for those. The two flipped meanings got flagged. That's the whole tool.
No server, no dashboard, no API key, no YAML pipeline — a dev tool, not a platform. Baselines are JSON files in your repo; the check is a CLI command with an exit code.
Why this doesn't drown you in false positives
Two things make pdrift trustworthy where naive semantic diffing isn't:
1. Local embeddings — free, offline, no API key. Similarity is computed with fastembed (ONNX, BAAI/bge-small-en-v1.5) on your machine. Checking costs zero dollars and zero network calls, so you can run it on every commit. Embeddings are cached per suite (keyed by output hash) — repeated checks don't even re-embed.
2. The noise floor — the tool learns each case's natural variance. LLMs at temperature > 0 rephrase themselves constantly. Take multiple baseline samples and pdrift measures how much the baselines differ from each other — the noise floor. A new output is flagged only if it's more different from the baselines than they are from each other:
pdrift snapshot --samples 3
meaningful changes - noisy
+-----------------------------------------------------------------------------+
| case | sim | noise floor | diff |
|------------+--------+-------------+-----------------------------------------|
| water-boil | 0.6407 | 0.9633 | - The boiling point of water at sea |
| | | | level is 100 degrees Celsius. |
| | | | + Water never boils no matter how hot |
| | | | it gets; boiling is impossible. |
+-----------------------------------------------------------------------------+
This case's baseline samples agree with each other at 0.9633; the new output only manages 0.6407 against the closest one — flagged, with the numbers shown so you can see why. Meanwhile an honest paraphrase scoring 0.95 sails through, because that's within the case's own noise. No hand-tuned per-case thresholds.
How it works
Each case lands in exactly one bucket:
| verdict | meaning | exit code |
|---|---|---|
identical |
exact string match to a baseline sample (embeddings skipped entirely) | 0 |
trivial |
differs, but similarity ≥ min(noise floor, threshold) |
0 |
meaningful |
more different from the baselines than they are from each other | 1 |
new case |
in the JSONL but not in the baseline | 0 |
missing case |
in the baseline but gone from the JSONL | 1 |
error |
the target function raised (recorded, never crashes the run) | 1 |
Baselines are pretty-printed, key-sorted JSON in .pdrift/<suite>/baseline.json — commit them, and prompt-output changes show up as reviewable diffs in PRs. pdrift accept promotes the latest check run to the new baseline after you've reviewed a change.
JSON outputs get a structural diff
If both baseline and current outputs parse as JSON, pdrift skips embeddings and diffs the structure — keys added/removed, values changed, with dotted paths:
meaningful changes - jsonapi
+-------------------------------------------------------------------------+
| case | sim | noise floor | diff |
|---------+-----+-------------+-------------------------------------------|
| profile | - | - | ~ user.address.city: "Berlin" -> "Munich" |
| | | | - removed user.email |
+-------------------------------------------------------------------------+
String values longer than 40 chars (summaries, bios) fall back to embedding similarity, so a rephrased description doesn't fail your schema check.
Configuration (optional — zero config needed)
CLI flags override pdrift.toml, which overrides defaults. The report header shows the effective value and where it came from.
# pdrift.toml — everything optional
# threshold = 0.90 # similarity at/above which a change is trivial
# samples = 1 # baseline runs per case (3+ enables the noise floor)
# model = "BAAI/bge-small-en-v1.5" # any fastembed-supported model
# [suite.summarizer] # per-suite overrides
# threshold = 0.85
# samples = 5
pytest plugin
Installed automatically. Each case becomes a pytest test:
pytest --pdrift # MEANINGFUL = fail, TRIVIAL/IDENTICAL = pass
pytest --pdrift --pdrift-path prompts/
Missing baseline → the case is skipped with "run pdrift snapshot first". Failure messages include the similarity, noise floor, and diff. Without --pdrift the plugin does nothing.
CI: fail PRs on meaningful drift
# .github/workflows/prompt-drift.yml
name: prompt-drift
on: pull_request
jobs:
pdrift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install pdrift
- run: pdrift check prompts/ # exits 1 on meaningful changes
Because baselines live in git, the reviewable prompt-output diff is right there in the PR alongside the code change that caused it.
FAQ
Why local embeddings instead of an LLM judge or an embeddings API? Cost and trust. A check that costs money per run doesn't get run. Local ONNX embeddings are free, deterministic, offline, and fast enough to run on every commit. The first check downloads the model (~130 MB) once; after that, no network at all.
My outputs are non-deterministic. Won't every check fail?
That's the noise floor's job. Snapshot with --samples 3 (or more): pdrift measures how much your own baselines disagree and only flags outputs that fall below that self-similarity. Temperature noise passes; meaning flips don't.
What does a check cost?
Zero. No API keys anywhere in the tool. Identical outputs skip embedding entirely, and everything embedded once is cached in .pdrift/<suite>/embeddings.npy.
Embeddings are weak at negation — can a meaning flip sneak through?
Sometimes similarity models score negations higher than humans would. In practice flips we tested score 0.57–0.74 against the 0.90 default — comfortably flagged — but embedding-based comparison is a tradeoff, not magic. Multi-sample baselines tighten the bar further; tune threshold per suite for sensitive cases.
Windows? First-class. Developed on Windows; CI runs the matrix on ubuntu + windows, py3.10–3.12.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdrift-0.1.0.tar.gz.
File metadata
- Download URL: pdrift-0.1.0.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17e7b31fad0904e7a7884f9924b1459d7a92d792698d181fdcba2afb3c769634
|
|
| MD5 |
29310966f9189f268cf3c5ae779d1607
|
|
| BLAKE2b-256 |
a51414826970fb38053292951cc120b2b0c73d0415fa0c2b7fcd2808f8476b23
|
Provenance
The following attestation bundles were made for pdrift-0.1.0.tar.gz:
Publisher:
release.yml on MIthunvasanth/pdrift
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdrift-0.1.0.tar.gz -
Subject digest:
17e7b31fad0904e7a7884f9924b1459d7a92d792698d181fdcba2afb3c769634 - Sigstore transparency entry: 2063384541
- Sigstore integration time:
-
Permalink:
MIthunvasanth/pdrift@917b9e01e3447f86288547dd4f1ff85b9cf3bcee -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/MIthunvasanth
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@917b9e01e3447f86288547dd4f1ff85b9cf3bcee -
Trigger Event:
release
-
Statement type:
File details
Details for the file pdrift-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdrift-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dcfdb17bd503dab98c8090ff63cd5d7b47fd39a4c6726d526fac923ca387cdd
|
|
| MD5 |
c6ead384a4184da1cb9ed751735cfded
|
|
| BLAKE2b-256 |
6a3622534c5726d8c45c2ee08edf7bf2d9afc454b1b0fc4e414eca1173968ac1
|
Provenance
The following attestation bundles were made for pdrift-0.1.0-py3-none-any.whl:
Publisher:
release.yml on MIthunvasanth/pdrift
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdrift-0.1.0-py3-none-any.whl -
Subject digest:
3dcfdb17bd503dab98c8090ff63cd5d7b47fd39a4c6726d526fac923ca387cdd - Sigstore transparency entry: 2063385287
- Sigstore integration time:
-
Permalink:
MIthunvasanth/pdrift@917b9e01e3447f86288547dd4f1ff85b9cf3bcee -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/MIthunvasanth
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@917b9e01e3447f86288547dd4f1ff85b9cf3bcee -
Trigger Event:
release
-
Statement type: