Snapshot testing for LLM prompts - catch meaningful output drift, ignore harmless rephrasing. Local embeddings, no API key.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mithunvasanth

These details have not been verified by PyPI

Project description

pdrift

You changed a prompt (or a model version) and now you don't know which of your LLM outputs silently changed meaning. pdrift is snapshot testing for prompts — like jest snapshots, but judged semantically instead of byte-by-byte.

pip install pdrift

# pdrift_answers.py
from pdrift import case

@case(inputs="cases.jsonl")          # one JSON object per line: {"id": "...", "input": "..."}
def answers(input: str) -> str:
    return my_llm_call(input)        # any function that returns a string

pdrift snapshot   # run the suite, record baseline outputs
pdrift check      # re-run and flag *meaningful* drift — exit 1 in CI

See it catch a flipped meaning

A one-word prompt tweak turns "found no link" into "confirmed a strong link". Exact-match snapshots scream at every harmless rephrase; humans skim and miss the real one. pdrift buckets each case by embedding similarity — actual output:

$ pdrift check
                         pdrift check (threshold 0.9)
+----------------------------------------------------------------------------+
| suite | identical | trivial | meaningful | new case | missing case | error |
|-------+-----------+---------+------------+----------+--------------+-------|
| llm   |         0 |       3 |          2 |        0 |            0 |     0 |
+----------------------------------------------------------------------------+
                           meaningful changes - llm
+-----------------------------------------------------------------------------+
| case          |    sim | baseline vs current                                |
|---------------+--------+----------------------------------------------------|
| coffee-study  | 0.7449 | - The study found no link between coffee           |
|               |        | consumption and heart disease.                     |
|               |        | + The study confirmed a strong link between coffee |
|               |        | consumption and heart disease.                     |
| deploy-status | 0.5679 | - The deployment completed successfully and no     |
|               |        | downtime was reported.                             |
|               |        | + The deployment failed and caused significant     |
|               |        | downtime across all regions.                       |
+-----------------------------------------------------------------------------+
Drift detected.
# exit code 1

The three rephrasings ("Hello! How can I help you today?" → "Hi there! How can I assist you today?", sim 0.9684) landed in trivial — exit code stays 0 for those. The two flipped meanings got flagged. That's the whole tool.

No server, no dashboard, no API key, no YAML pipeline — a dev tool, not a platform. Baselines are JSON files in your repo; the check is a CLI command with an exit code.

Why this doesn't drown you in false positives

Two things make pdrift trustworthy where naive semantic diffing isn't:

1. Local embeddings — free, offline, no API key. Similarity is computed with fastembed (ONNX, BAAI/bge-small-en-v1.5) on your machine. Checking costs zero dollars and zero network calls, so you can run it on every commit. Embeddings are cached per suite (keyed by output hash) — repeated checks don't even re-embed.

2. The noise floor — the tool learns each case's natural variance. LLMs at temperature > 0 rephrase themselves constantly. Take multiple baseline samples and pdrift measures how much the baselines differ from each other — the noise floor. A new output is flagged only if it's more different from the baselines than they are from each other:

pdrift snapshot --samples 3

                          meaningful changes - noisy
+-----------------------------------------------------------------------------+
| case       |    sim | noise floor | diff                                    |
|------------+--------+-------------+-----------------------------------------|
| water-boil | 0.6407 |      0.9633 | - The boiling point of water at sea     |
|            |        |             | level is 100 degrees Celsius.           |
|            |        |             | + Water never boils no matter how hot   |
|            |        |             | it gets; boiling is impossible.         |
+-----------------------------------------------------------------------------+

This case's baseline samples agree with each other at 0.9633; the new output only manages 0.6407 against the closest one — flagged, with the numbers shown so you can see why. Meanwhile an honest paraphrase scoring 0.95 sails through, because that's within the case's own noise. No hand-tuned per-case thresholds.

How it works

Each case lands in exactly one bucket:

verdict	meaning	exit code
`identical`	exact string match to a baseline sample (embeddings skipped entirely)	0
`trivial`	differs, but similarity ≥ `min(noise floor, threshold)`	0
`meaningful`	more different from the baselines than they are from each other	1
`new case`	in the JSONL but not in the baseline	0
`missing case`	in the baseline but gone from the JSONL	1
`error`	the target function raised (recorded, never crashes the run)	1

Baselines are pretty-printed, key-sorted JSON in .pdrift/<suite>/baseline.json — commit them, and prompt-output changes show up as reviewable diffs in PRs. pdrift accept promotes the latest check run to the new baseline after you've reviewed a change.

JSON outputs get a structural diff

If both baseline and current outputs parse as JSON, pdrift skips embeddings and diffs the structure — keys added/removed, values changed, with dotted paths:

                       meaningful changes - jsonapi
+-------------------------------------------------------------------------+
| case    | sim | noise floor | diff                                      |
|---------+-----+-------------+-------------------------------------------|
| profile |   - |           - | ~ user.address.city: "Berlin" -> "Munich" |
|         |     |             | - removed user.email                      |
+-------------------------------------------------------------------------+

String values longer than 40 chars (summaries, bios) fall back to embedding similarity, so a rephrased description doesn't fail your schema check.

Configuration (optional — zero config needed)

CLI flags override pdrift.toml, which overrides defaults. The report header shows the effective value and where it came from.

# pdrift.toml — everything optional
# threshold = 0.90                    # similarity at/above which a change is trivial
# samples = 1                         # baseline runs per case (3+ enables the noise floor)
# model = "BAAI/bge-small-en-v1.5"    # any fastembed-supported model

# [suite.summarizer]                  # per-suite overrides
# threshold = 0.85
# samples = 5

pytest plugin

Installed automatically. Each case becomes a pytest test:

pytest --pdrift                    # MEANINGFUL = fail, TRIVIAL/IDENTICAL = pass
pytest --pdrift --pdrift-path prompts/

Missing baseline → the case is skipped with "run pdrift snapshot first". Failure messages include the similarity, noise floor, and diff. Without --pdrift the plugin does nothing.

CI: fail PRs on meaningful drift

# .github/workflows/prompt-drift.yml
name: prompt-drift
on: pull_request
jobs:
  pdrift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install pdrift
      - run: pdrift check prompts/   # exits 1 on meaningful changes

Because baselines live in git, the reviewable prompt-output diff is right there in the PR alongside the code change that caused it.

FAQ

Why local embeddings instead of an LLM judge or an embeddings API? Cost and trust. A check that costs money per run doesn't get run. Local ONNX embeddings are free, deterministic, offline, and fast enough to run on every commit. The first check downloads the model (~130 MB) once; after that, no network at all.

My outputs are non-deterministic. Won't every check fail? That's the noise floor's job. Snapshot with --samples 3 (or more): pdrift measures how much your own baselines disagree and only flags outputs that fall below that self-similarity. Temperature noise passes; meaning flips don't.

What does a check cost? Zero. No API keys anywhere in the tool. Identical outputs skip embedding entirely, and everything embedded once is cached in .pdrift/<suite>/embeddings.npy.

Embeddings are weak at negation — can a meaning flip sneak through? Sometimes similarity models score negations higher than humans would. In practice flips we tested score 0.57–0.74 against the 0.90 default — comfortably flagged — but embedding-based comparison is a tradeoff, not magic. Multi-sample baselines tighten the bar further; tune threshold per suite for sensitive cases.

Windows? First-class. Developed on Windows; CI runs the matrix on ubuntu + windows, py3.10–3.12.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mithunvasanth

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jul 3, 2026

0.0.1

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdrift-0.1.0.tar.gz (27.2 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdrift-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file pdrift-0.1.0.tar.gz.

File metadata

Download URL: pdrift-0.1.0.tar.gz
Upload date: Jul 3, 2026
Size: 27.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdrift-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`17e7b31fad0904e7a7884f9924b1459d7a92d792698d181fdcba2afb3c769634`
MD5	`29310966f9189f268cf3c5ae779d1607`
BLAKE2b-256	`a51414826970fb38053292951cc120b2b0c73d0415fa0c2b7fcd2808f8476b23`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdrift-0.1.0.tar.gz:

Publisher: release.yml on MIthunvasanth/pdrift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdrift-0.1.0.tar.gz
- Subject digest: 17e7b31fad0904e7a7884f9924b1459d7a92d792698d181fdcba2afb3c769634
- Sigstore transparency entry: 2063384541
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: MIthunvasanth/pdrift@917b9e01e3447f86288547dd4f1ff85b9cf3bcee
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/MIthunvasanth
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@917b9e01e3447f86288547dd4f1ff85b9cf3bcee
- Trigger Event: release

File details

Details for the file pdrift-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdrift-0.1.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdrift-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3dcfdb17bd503dab98c8090ff63cd5d7b47fd39a4c6726d526fac923ca387cdd`
MD5	`c6ead384a4184da1cb9ed751735cfded`
BLAKE2b-256	`6a3622534c5726d8c45c2ee08edf7bf2d9afc454b1b0fc4e414eca1173968ac1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdrift-0.1.0-py3-none-any.whl:

Publisher: release.yml on MIthunvasanth/pdrift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdrift-0.1.0-py3-none-any.whl
- Subject digest: 3dcfdb17bd503dab98c8090ff63cd5d7b47fd39a4c6726d526fac923ca387cdd
- Sigstore transparency entry: 2063385287
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: MIthunvasanth/pdrift@917b9e01e3447f86288547dd4f1ff85b9cf3bcee
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/MIthunvasanth
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@917b9e01e3447f86288547dd4f1ff85b9cf3bcee
- Trigger Event: release

pdrift 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pdrift

See it catch a flipped meaning

Why this doesn't drown you in false positives

How it works

JSON outputs get a structural diff

Configuration (optional — zero config needed)

pytest plugin

CI: fail PRs on meaningful drift

FAQ

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance