Skip to main content

Snapshot testing for LLM prompts - catch meaningful output drift, ignore harmless rephrasing. Local embeddings, no API key.

Project description

pdrift

PyPI CI License: MIT

You changed a prompt (or a model version) and now you don't know which of your LLM outputs silently changed meaning. pdrift is snapshot testing for prompts — like jest snapshots, but judged semantically instead of byte-by-byte.

pip install pdrift
# pdrift_answers.py
from pdrift import case

@case(inputs="cases.jsonl")          # one JSON object per line: {"id": "...", "input": "..."}
def answers(input: str) -> str:
    return my_llm_call(input)        # any function that returns a string
pdrift snapshot   # run the suite, record baseline outputs
pdrift check      # re-run and flag *meaningful* drift — exit 1 in CI

See it catch a flipped meaning

A one-word prompt tweak turns "found no link" into "confirmed a strong link". Exact-match snapshots scream at every harmless rephrase; humans skim and miss the real one. pdrift buckets each case by embedding similarity — actual output:

$ pdrift check
                         pdrift check (threshold 0.9)
+----------------------------------------------------------------------------+
| suite | identical | trivial | meaningful | new case | missing case | error |
|-------+-----------+---------+------------+----------+--------------+-------|
| llm   |         0 |       3 |          2 |        0 |            0 |     0 |
+----------------------------------------------------------------------------+
                           meaningful changes - llm
+-----------------------------------------------------------------------------+
| case          |    sim | baseline vs current                                |
|---------------+--------+----------------------------------------------------|
| coffee-study  | 0.7449 | - The study found no link between coffee           |
|               |        | consumption and heart disease.                     |
|               |        | + The study confirmed a strong link between coffee |
|               |        | consumption and heart disease.                     |
| deploy-status | 0.5679 | - The deployment completed successfully and no     |
|               |        | downtime was reported.                             |
|               |        | + The deployment failed and caused significant     |
|               |        | downtime across all regions.                       |
+-----------------------------------------------------------------------------+
Drift detected.
# exit code 1

The three rephrasings ("Hello! How can I help you today?" → "Hi there! How can I assist you today?", sim 0.9684) landed in trivial — exit code stays 0 for those. The two flipped meanings got flagged. That's the whole tool.

No server, no dashboard, no API key, no YAML pipeline — a dev tool, not a platform. Baselines are JSON files in your repo; the check is a CLI command with an exit code.

Why this doesn't drown you in false positives

Two things make pdrift trustworthy where naive semantic diffing isn't:

1. Local embeddings — free, offline, no API key. Similarity is computed with fastembed (ONNX, BAAI/bge-small-en-v1.5) on your machine. Checking costs zero dollars and zero network calls, so you can run it on every commit. Embeddings are cached per suite (keyed by output hash) — repeated checks don't even re-embed.

2. The noise floor — the tool learns each case's natural variance. LLMs at temperature > 0 rephrase themselves constantly. Take multiple baseline samples and pdrift measures how much the baselines differ from each other — the noise floor. A new output is flagged only if it's more different from the baselines than they are from each other:

pdrift snapshot --samples 3
                          meaningful changes - noisy
+-----------------------------------------------------------------------------+
| case       |    sim | noise floor | diff                                    |
|------------+--------+-------------+-----------------------------------------|
| water-boil | 0.6407 |      0.9633 | - The boiling point of water at sea     |
|            |        |             | level is 100 degrees Celsius.           |
|            |        |             | + Water never boils no matter how hot   |
|            |        |             | it gets; boiling is impossible.         |
+-----------------------------------------------------------------------------+

This case's baseline samples agree with each other at 0.9633; the new output only manages 0.6407 against the closest one — flagged, with the numbers shown so you can see why. Meanwhile an honest paraphrase scoring 0.95 sails through, because that's within the case's own noise. No hand-tuned per-case thresholds.

How it works

Each case lands in exactly one bucket:

verdict meaning exit code
identical exact string match to a baseline sample (embeddings skipped entirely) 0
trivial differs, but similarity ≥ min(noise floor, threshold) 0
meaningful more different from the baselines than they are from each other 1
new case in the JSONL but not in the baseline 0
missing case in the baseline but gone from the JSONL 1
error the target function raised (recorded, never crashes the run) 1

Baselines are pretty-printed, key-sorted JSON in .pdrift/<suite>/baseline.json — commit them, and prompt-output changes show up as reviewable diffs in PRs. pdrift accept promotes the latest check run to the new baseline after you've reviewed a change.

JSON outputs get a structural diff

If both baseline and current outputs parse as JSON, pdrift skips embeddings and diffs the structure — keys added/removed, values changed, with dotted paths:

                       meaningful changes - jsonapi
+-------------------------------------------------------------------------+
| case    | sim | noise floor | diff                                      |
|---------+-----+-------------+-------------------------------------------|
| profile |   - |           - | ~ user.address.city: "Berlin" -> "Munich" |
|         |     |             | - removed user.email                      |
+-------------------------------------------------------------------------+

String values longer than 40 chars (summaries, bios) fall back to embedding similarity, so a rephrased description doesn't fail your schema check.

Configuration (optional — zero config needed)

CLI flags override pdrift.toml, which overrides defaults. The report header shows the effective value and where it came from.

# pdrift.toml — everything optional
# threshold = 0.90                    # similarity at/above which a change is trivial
# samples = 1                         # baseline runs per case (3+ enables the noise floor)
# model = "BAAI/bge-small-en-v1.5"    # any fastembed-supported model

# [suite.summarizer]                  # per-suite overrides
# threshold = 0.85
# samples = 5

pytest plugin

Installed automatically. Each case becomes a pytest test:

pytest --pdrift                    # MEANINGFUL = fail, TRIVIAL/IDENTICAL = pass
pytest --pdrift --pdrift-path prompts/

Missing baseline → the case is skipped with "run pdrift snapshot first". Failure messages include the similarity, noise floor, and diff. Without --pdrift the plugin does nothing.

CI: fail PRs on meaningful drift

# .github/workflows/prompt-drift.yml
name: prompt-drift
on: pull_request
jobs:
  pdrift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install pdrift
      - run: pdrift check prompts/   # exits 1 on meaningful changes

Because baselines live in git, the reviewable prompt-output diff is right there in the PR alongside the code change that caused it.

FAQ

Why local embeddings instead of an LLM judge or an embeddings API? Cost and trust. A check that costs money per run doesn't get run. Local ONNX embeddings are free, deterministic, offline, and fast enough to run on every commit. The first check downloads the model (~130 MB) once; after that, no network at all.

My outputs are non-deterministic. Won't every check fail? That's the noise floor's job. Snapshot with --samples 3 (or more): pdrift measures how much your own baselines disagree and only flags outputs that fall below that self-similarity. Temperature noise passes; meaning flips don't.

What does a check cost? Zero. No API keys anywhere in the tool. Identical outputs skip embedding entirely, and everything embedded once is cached in .pdrift/<suite>/embeddings.npy.

Embeddings are weak at negation — can a meaning flip sneak through? Sometimes similarity models score negations higher than humans would. In practice flips we tested score 0.57–0.74 against the 0.90 default — comfortably flagged — but embedding-based comparison is a tradeoff, not magic. Multi-sample baselines tighten the bar further; tune threshold per suite for sensitive cases.

Windows? First-class. Developed on Windows; CI runs the matrix on ubuntu + windows, py3.10–3.12.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdrift-0.1.0.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdrift-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file pdrift-0.1.0.tar.gz.

File metadata

  • Download URL: pdrift-0.1.0.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdrift-0.1.0.tar.gz
Algorithm Hash digest
SHA256 17e7b31fad0904e7a7884f9924b1459d7a92d792698d181fdcba2afb3c769634
MD5 29310966f9189f268cf3c5ae779d1607
BLAKE2b-256 a51414826970fb38053292951cc120b2b0c73d0415fa0c2b7fcd2808f8476b23

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdrift-0.1.0.tar.gz:

Publisher: release.yml on MIthunvasanth/pdrift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdrift-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdrift-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdrift-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3dcfdb17bd503dab98c8090ff63cd5d7b47fd39a4c6726d526fac923ca387cdd
MD5 c6ead384a4184da1cb9ed751735cfded
BLAKE2b-256 6a3622534c5726d8c45c2ee08edf7bf2d9afc454b1b0fc4e414eca1173968ac1

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdrift-0.1.0-py3-none-any.whl:

Publisher: release.yml on MIthunvasanth/pdrift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page