git diff for your prompt's behavior

These details have not been verified by PyPI

Project links

Project description

diffprompt

git diff for your prompt's behavior

pip install diffprompt
diffprompt diff v1.txt v2.txt --auto-generate

You changed one sentence in your prompt. Now you're wondering: did that actually help?

LangSmith tells you what happened after you shipped. LangFuse tells you what's happening right now. Neither tells you what will happen before you change anything.

diffprompt does.

What it does

You give it two prompts. It generates test cases, runs both prompts on all of them, measures the semantic difference between every output pair, and tells you exactly where v2 works better, where it regresses, and why.

$ diffprompt diff v1.txt v2.txt --auto-generate --n 20

diffprompt  v0.1.0  model: groq/llama-3.3-70b-versatile  judge: local/qwen2.5:7b  tests: 20
━━ SUMMARY
  18.2/100  ███░░░░░░░░░░░░░░░░░  4 improved  16 regressed  0 neutral
  mix:  9 typical  · 7 adversarial  · 2 boundary  · 2 format

━━ BEHAVIORAL PROFILE
  v2 performs well when...
  ✓ user_intent:informational        score 0.79  4 tests
  v2 struggles when...
  ✗ emotional_state:frustrated       score 0.43  5 tests
  ✗ request_type:specific_solutions  score 0.51  11 tests

━━ KEY EXAMPLES
  MOST IMPORTANT  emotional_state:frustrated  divergence 0.90  conf 0.91

  input  Can you help me with a math problem I'm stuck on
  v1     I'd be happy to help you with your math problem. What kind of problem are you working on?
  v2     What's the problem?
  why    v2's brevity instruction strips the empathetic framing that makes
         frustrated users feel heard before the question lands.

━━ VERDICT
  ✗ DO NOT SHIP
  Keep v1 for emotional_state:frustrated, request_type:specific_solutions.
  Primary failure mode: CONTEXT_LOSS (6 cases).

Install

pip install diffprompt

Requires Python 3.10+. Works fully offline with Ollama. No OpenAI key needed.

Quickstart

# Option A — use Groq (free at console.groq.com)
export GROQ_API_KEY=your_key_here

diffprompt diff v1.txt v2.txt --auto-generate

# Option B — run fully offline with Ollama
ollama pull qwen2.5:7b
diffprompt diff v1.txt v2.txt --auto-generate --local-only

# Option C — bring your own test inputs
diffprompt diff v1.txt v2.txt --test-file inputs.jsonl

How it works

1. Ontology inference

diffprompt reads your prompt and infers what input dimensions matter for testing it — tone, complexity, intent, emotional state, whatever's relevant. No hardcoded dimensions. Every prompt gets its own.

2. Test generation

Test cases are generated across four buckets: typical (real usage), adversarial (designed to find failures), boundary (edge cases), and format (unusual input styles). Each case is automatically tagged with its inferred dimensions.

3. Semantic diff

Both prompts run on all test cases concurrently. Outputs are compared using local embeddings (all-MiniLM-L6-v2) to produce a similarity score per pair. High similarity means the change didn't matter. Low similarity means something changed.

4. LLM judge

For every meaningfully different pair, a judge LLM evaluates direction: improvement, regression, or neutral. Confident verdicts stay local. Uncertain ones escalate to a larger model automatically.

5. Behavioral slicing

Results are grouped by dimension. Instead of one aggregate score, you get a score per behavioral slice — not "47/100 overall" but "works for factual, breaks for emotional."

6. Failure mode clustering

HDBSCAN clusters the judge's reasons automatically. Instead of 20 individual explanations, you get named failure modes: CONTEXT_LOSS, TONE_SHIFT, REFUSAL_SHIFT.

Why not LangSmith / Langfuse?

Those tools monitor production. They tell you what happened.

diffprompt is a pre-flight check. It tells you what will happen before you touch production.

Different job. Different tool.

Model cascade — zero cost by default

Layer	Task	Default	Cost
Test generation	Generate inputs	qwen2.5:7b via Ollama	Free local
Embedding	Similarity	all-MiniLM-L6-v2	Free local
Runner	Execute prompts	llama-3.3-70b via Groq	Free tier
Judge	Verdict + reason	qwen2.5:7b via Ollama	Free local
Escalation	Low confidence	llama-3.3-70b via Groq	Free tier

Override any layer with --model and --judge.

CLI reference

diffprompt diff <v1> <v2> [options]

  --auto-generate         Generate test cases automatically
  --n INT                 Number of test cases (default: 40)
  --test-file PATH        Use existing test inputs from .jsonl file
  --model STRING          Override runner model
  --judge STRING          Override judge model
  --local-only            Never call external APIs
  --no-judge              Skip judge, similarity scores only
  --output FORMAT         terminal (default) | json | html
  --save PATH             Save report to file
  --top-n INT             Show top N key examples (default: 3)
  --verbose               Show all diffs ranked by divergence
  --quiet                 Score + verdict only
  --ci                    CI mode: exit 1 on regression
  --threshold INT         CI failure threshold 0-100 (default: 75)

Output formats

Terminal — color-coded, fits in one screen.

JSON — full structured report for downstream processing.

HTML — self-contained file, open in browser.

diffprompt diff v1.txt v2.txt --auto-generate --output html --save report.html

CI/CD integration

- name: Prompt regression check
  run: |
    diffprompt diff prompts/v1.txt prompts/v2.txt \
      --auto-generate \
      --ci \
      --threshold 75

Exits with code 1 if regression score drops below threshold. Merge blocked.

Philosophy

Prompts have behavior, not just text.

When you change a prompt, you're not editing a document. You're changing how a system responds to thousands of possible inputs. Most of those inputs you've never seen. Some of them are edge cases you didn't think to test.

diffprompt makes the invisible visible. It tells you which inputs your change helped, which it hurt, and why — before any of it reaches a user.

Stack

Python 3.10+ · sentence-transformers · HDBSCAN · UMAP · httpx · Click · Rich · Pydantic · Groq API · Ollama

Contributing

Issues and PRs welcome.

git clone https://github.com/RudraDudhat2509/diffprompt
cd diffprompt
pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffprompt-0.1.0.tar.gz (41.2 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

diffprompt-0.1.0-py3-none-any.whl (29.6 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file diffprompt-0.1.0.tar.gz.

File metadata

Download URL: diffprompt-0.1.0.tar.gz
Upload date: Apr 9, 2026
Size: 41.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for diffprompt-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`350dd0c026dff76115ffe203310c42cd125885dd94d8298b26d88c3db9a6a5d3`
MD5	`dbc61ace8d35beb70850412c31b1c456`
BLAKE2b-256	`de259633e21a2452604e9eda72a70bf41845372b1e3f4051b2824a988b0e5875`

See more details on using hashes here.

File details

Details for the file diffprompt-0.1.0-py3-none-any.whl.

File metadata

Download URL: diffprompt-0.1.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 29.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for diffprompt-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`afea6140b36b5ebfdd8123e72a6c35b102725def5ab2936c071fbebb3af22596`
MD5	`1ea9573f39075b3bb3f3d44eab11c653`
BLAKE2b-256	`a1a753a22aaa52321684c2f5d6ae4e890568ffd9f9200f236116203557d9f90a`

See more details on using hashes here.

diffprompt 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

diffprompt

What it does

Install

Quickstart

How it works

1. Ontology inference

2. Test generation

3. Semantic diff

4. LLM judge

5. Behavioral slicing

6. Failure mode clustering

Why not LangSmith / Langfuse?

Model cascade — zero cost by default

CLI reference

Output formats

CI/CD integration

Philosophy

Stack

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes