git diff for your prompt's behavior
Project description
diffprompt
git diff for your prompt's behavior
pip install diffprompt
diffprompt diff v1.txt v2.txt --auto-generate
You changed one sentence in your prompt. Now you're wondering: did that actually help?
LangSmith tells you what happened after you shipped. LangFuse tells you what's happening right now. Neither tells you what will happen before you change anything.
diffprompt does.
What it does
You give it two prompts. It generates test cases, runs both prompts on all of them, measures the semantic difference between every output pair, and tells you exactly where v2 works better, where it regresses, and why.
$ diffprompt diff v1.txt v2.txt --auto-generate --n 20
diffprompt v0.1.0 model: groq/llama-3.3-70b-versatile judge: local/qwen2.5:7b tests: 20
━━ SUMMARY
18.2/100 ███░░░░░░░░░░░░░░░░░ 4 improved 16 regressed 0 neutral
mix: 9 typical · 7 adversarial · 2 boundary · 2 format
━━ BEHAVIORAL PROFILE
v2 performs well when...
✓ user_intent:informational score 0.79 4 tests
v2 struggles when...
✗ emotional_state:frustrated score 0.43 5 tests
✗ request_type:specific_solutions score 0.51 11 tests
━━ KEY EXAMPLES
MOST IMPORTANT emotional_state:frustrated divergence 0.90 conf 0.91
input Can you help me with a math problem I'm stuck on
v1 I'd be happy to help you with your math problem. What kind of problem are you working on?
v2 What's the problem?
why v2's brevity instruction strips the empathetic framing that makes
frustrated users feel heard before the question lands.
━━ VERDICT
✗ DO NOT SHIP
Keep v1 for emotional_state:frustrated, request_type:specific_solutions.
Primary failure mode: CONTEXT_LOSS (6 cases).
Install
pip install diffprompt
Requires Python 3.10+. Works fully offline with Ollama. No OpenAI key needed.
Quickstart
# Option A — use Groq (free at console.groq.com)
export GROQ_API_KEY=your_key_here
diffprompt diff v1.txt v2.txt --auto-generate
# Option B — run fully offline with Ollama
ollama pull qwen2.5:7b
diffprompt diff v1.txt v2.txt --auto-generate --local-only
# Option C — bring your own test inputs
diffprompt diff v1.txt v2.txt --test-file inputs.jsonl
How it works
1. Ontology inference
diffprompt reads your prompt and infers what input dimensions matter for testing it — tone, complexity, intent, emotional state, whatever's relevant. No hardcoded dimensions. Every prompt gets its own.
2. Test generation
Test cases are generated across four buckets: typical (real usage), adversarial (designed to find failures), boundary (edge cases), and format (unusual input styles). Each case is automatically tagged with its inferred dimensions.
3. Semantic diff
Both prompts run on all test cases concurrently. Outputs are compared using local embeddings (all-MiniLM-L6-v2) to produce a similarity score per pair. High similarity means the change didn't matter. Low similarity means something changed.
4. LLM judge
For every meaningfully different pair, a judge LLM evaluates direction: improvement, regression, or neutral. Confident verdicts stay local. Uncertain ones escalate to a larger model automatically.
5. Behavioral slicing
Results are grouped by dimension. Instead of one aggregate score, you get a score per behavioral slice — not "47/100 overall" but "works for factual, breaks for emotional."
6. Failure mode clustering
HDBSCAN clusters the judge's reasons automatically. Instead of 20 individual explanations, you get named failure modes: CONTEXT_LOSS, TONE_SHIFT, REFUSAL_SHIFT.
Why not LangSmith / Langfuse?
Those tools monitor production. They tell you what happened.
diffprompt is a pre-flight check. It tells you what will happen before you touch production.
Different job. Different tool.
Model cascade — zero cost by default
| Layer | Task | Default | Cost |
|---|---|---|---|
| Test generation | Generate inputs | qwen2.5:7b via Ollama | Free local |
| Embedding | Similarity | all-MiniLM-L6-v2 | Free local |
| Runner | Execute prompts | llama-3.3-70b via Groq | Free tier |
| Judge | Verdict + reason | qwen2.5:7b via Ollama | Free local |
| Escalation | Low confidence | llama-3.3-70b via Groq | Free tier |
Override any layer with --model and --judge.
CLI reference
diffprompt diff <v1> <v2> [options]
--auto-generate Generate test cases automatically
--n INT Number of test cases (default: 40)
--test-file PATH Use existing test inputs from .jsonl file
--model STRING Override runner model
--judge STRING Override judge model
--local-only Never call external APIs
--no-judge Skip judge, similarity scores only
--output FORMAT terminal (default) | json | html
--save PATH Save report to file
--top-n INT Show top N key examples (default: 3)
--verbose Show all diffs ranked by divergence
--quiet Score + verdict only
--ci CI mode: exit 1 on regression
--threshold INT CI failure threshold 0-100 (default: 75)
Output formats
Terminal — color-coded, fits in one screen.
JSON — full structured report for downstream processing.
HTML — self-contained file, open in browser.
diffprompt diff v1.txt v2.txt --auto-generate --output html --save report.html
CI/CD integration
- name: Prompt regression check
run: |
diffprompt diff prompts/v1.txt prompts/v2.txt \
--auto-generate \
--ci \
--threshold 75
Exits with code 1 if regression score drops below threshold. Merge blocked.
Philosophy
Prompts have behavior, not just text.
When you change a prompt, you're not editing a document. You're changing how a system responds to thousands of possible inputs. Most of those inputs you've never seen. Some of them are edge cases you didn't think to test.
diffprompt makes the invisible visible. It tells you which inputs your change helped, which it hurt, and why — before any of it reaches a user.
Stack
Python 3.10+ · sentence-transformers · HDBSCAN · UMAP · httpx · Click · Rich · Pydantic · Groq API · Ollama
Contributing
Issues and PRs welcome.
git clone https://github.com/RudraDudhat2509/diffprompt
cd diffprompt
pip install -e ".[dev]"
pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diffprompt-0.1.0.tar.gz.
File metadata
- Download URL: diffprompt-0.1.0.tar.gz
- Upload date:
- Size: 41.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
350dd0c026dff76115ffe203310c42cd125885dd94d8298b26d88c3db9a6a5d3
|
|
| MD5 |
dbc61ace8d35beb70850412c31b1c456
|
|
| BLAKE2b-256 |
de259633e21a2452604e9eda72a70bf41845372b1e3f4051b2824a988b0e5875
|
File details
Details for the file diffprompt-0.1.0-py3-none-any.whl.
File metadata
- Download URL: diffprompt-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afea6140b36b5ebfdd8123e72a6c35b102725def5ab2936c071fbebb3af22596
|
|
| MD5 |
1ea9573f39075b3bb3f3d44eab11c653
|
|
| BLAKE2b-256 |
a1a753a22aaa52321684c2f5d6ae4e890568ffd9f9200f236116203557d9f90a
|