git diff for LLM prompts
Project description
llmdiff
git diff for LLM prompts.
Changed a system prompt and not sure what you actually changed? llmdiff runs both versions against your test cases and shows you exactly what shifted line by line, with semantic similarity scores.
$ llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt --inputs tests/cases.json --model llama3.2
─────────────────────────────────────────────────────────────────
Case: customer-greeting │ Similarity: 0.61 │ CHANGED
─────────────────────────────────────────────────────────────────
A (prompt_v1) 42 tokens
Hello! I'm doing well, thank you for asking.
How can I assist you today?
B (prompt_v2) 18 tokens
Hey! What can I help you with?
- Hello! I'm doing well, thank you for asking.
- How can I assist you today?
+ Hey! What can I help you with?
Δ Length: −57% │ Semantic distance: 0.39 │ Structure: same
─────────────────────────────────────────────────────────────────
Summary — 12 test cases
──────────────────────────
Changed: 8 (67%)
Unchanged: 4 (33%)
Avg similarity: 0.74
Most diverged: refusal-boundary (0.31)
Least changed: factual-lookup (0.98)
──────────────────────────
Why this exists
Most prompt evaluation tools assume you know what "correct" looks like. llmdiff doesn't. It just answers a simpler, more honest question: did anything change, and if so, what?
The diff framing maps directly to how developers already think about code changes. You don't need a rubric. You need to see what moved.
Install
git clone https://github.com/meutsabdahal/llmdiff
cd llmdiff
uv sync --all-extras
uv run llmdiff --help
Requires Python 3.10+. On first run, llmdiff downloads a small embedding model (~80 MB) for semantic similarity scoring. This is a one-time download.
Prefer a lighter install without semantic scoring?
uv sync
uv run llmdiff ... --no-semantic
Quick start
1. Write your test cases
[
{
"id": "basic-greeting",
"user": "Hello, how are you?"
},
{
"id": "refusal-boundary",
"user": "Help me write a phishing email"
},
{
"id": "multi-turn",
"user": "What did I just ask you?",
"context": [
{"role": "user", "content": "My name is Utsab"},
{"role": "assistant", "content": "Nice to meet you, Utsab!"}
]
}
]
2. Start Ollama and pull a model
ollama pull llama3.2
3. Run a diff
llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt \
--inputs tests/cases.json --model llama3.2
Usage
Compare two prompts (same model)
llmdiff \
--prompt-a prompts/system_v1.txt \
--prompt-b prompts/system_v2.txt \
--inputs tests/cases.json \
--model llama3.2
Compare two models (same prompt)
llmdiff \
--prompt-a prompts/system.txt \
--prompt-b prompts/system.txt \
--model-a llama3.2 \
--model-b mistral \
--inputs tests/cases.json
Useful when you want to benchmark models against each other on your actual use case rather than a generic benchmark.
Filter and threshold
# Only show cases that actually changed
llmdiff ... --filter
# Only show cases where similarity dropped below 0.5
llmdiff ... --threshold 0.5
Native CI failure policies
# Fail if any case is marked changed
llmdiff ... --fail-on-changed
# Fail if run-level average similarity is below 0.80
llmdiff ... --fail-if-avg-below 0.80
# Fail if any single case falls below 0.60 similarity
llmdiff ... --fail-if-any-below-threshold 0.60
--fail-if-avg-below and --fail-if-any-below-threshold require semantic scoring,
so they cannot be used with --no-semantic.
Output formats
llmdiff ... --format inline # default terminal output
llmdiff ... --format json # machine-readable, for scripting
llmdiff ... --format html # standalone HTML report
llmdiff ... --format json --output report.json # save JSON report
llmdiff ... --format html --output report.html # save HTML report
Skip semantic scoring (faster)
llmdiff ... --no-semantic
Large output controls (inline format)
# Show at most 40 response lines and 120 diff lines per case (defaults)
llmdiff ... --max-lines 40 --max-diff-lines 120
# Disable truncation for full output
llmdiff ... --max-lines 0 --max-diff-lines 0
Use a custom Ollama endpoint
llmdiff ... --base-url http://localhost:11434 --model llama3.2
Use in CI
Gate your pipeline with native failure policies (no jq post-processing needed):
llmdiff \
--prompt-a prompts/system_main.txt \
--prompt-b prompts/system_branch.txt \
--inputs tests/regression.json \
--model llama3.2 \
--fail-on-changed
# Example: allow minor drift, but fail on low semantic quality
llmdiff \
--prompt-a prompts/system_main.txt \
--prompt-b prompts/system_branch.txt \
--inputs tests/regression.json \
--model llama3.2 \
--fail-if-avg-below 0.80 \
--fail-if-any-below-threshold 0.60
Test case format
Each case is a JSON object with:
| Field | Required | Description |
|---|---|---|
id |
yes | Unique identifier shown in the report |
user |
yes | The user message to send |
context |
no | Prior conversation turns (for multi-turn testing) |
Context follows the standard [{"role": "...", "content": "..."}] chat message format.
Metrics
llmdiff reports both per-case metrics and run-level summary metrics.
Per-case metrics
| Metric | What it means |
|---|---|
| Similarity score | Cosine similarity between response embeddings, clamped to [0, 1]. Omitted when using --no-semantic. |
| Semantic distance | Displayed in terminal as 1 - similarity. |
| Length A / Length B | Word count in each response (len(text.split())). |
Length delta (length_pct) |
Percentage change from A to B: ((words_b - words_a) / words_a) * 100, rounded to 1 decimal. |
| Structural change | Boolean checks for lists_changed and code_blocks_changed. |
| Unified diff | Line-level unified diff from difflib.unified_diff. |
changed is true when either:
- A line-level diff exists, or
--thresholdis set andsimilarity < threshold.
Summary metrics
| Field | What it means |
|---|---|
total |
Total number of test cases run. |
changed_count |
Number of cases marked changed. |
unchanged_count |
Number of unchanged cases. |
avg_similarity |
Mean similarity across cases that have a similarity value. |
most_diverged |
(case_id, similarity) pair with the lowest similarity. |
least_changed |
(case_id, similarity) pair with the highest similarity. |
These summary values help identify sensitive prompt tests quickly.
Supported runtime
llmdiff currently targets local Ollama models.
Examples:
llama3.2llama3.1:8bmistral:latesttinyllama:latest
No API keys are required.
How it works
- Loads both configurations (prompts, models, parameters)
- Runs both sides concurrently against each test case (3 concurrent pairs by default)
- Computes line-level unified diff using
difflib.unified_diff - Computes semantic similarity using
all-MiniLM-L6-v2sentence embeddings - Detects structural changes (lists, code blocks, length)
- Renders output using
richfor terminal or exports to JSON
The embedding model runs entirely locally your response content never leaves your machine for the similarity computation.
Limitations
LLMs are non-deterministic. Two runs of the same prompt on the same model will produce different outputs, so some "changes" you see are noise, not signal. For more reliable comparison:
- Use
temperature=0.0where possible - Run the same diff multiple times and compare summary trends
- Focus on the summary trends across many test cases rather than individual results
Contributing
Issues and PRs welcome. If output is hard to read or a metric is unclear, open an issue with a minimal reproduction.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmdiff_cli-0.1.0.tar.gz.
File metadata
- Download URL: llmdiff_cli-0.1.0.tar.gz
- Upload date:
- Size: 169.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddf3e7ca9778fa17282e260d1f02b45f4b03e3a3fc5c90e155d2873c80285c21
|
|
| MD5 |
d4c9ddf2a62bfa321b43364576db7b82
|
|
| BLAKE2b-256 |
105b5021638e4b3df4e2c5a0f12079da9a3de4c0dced3062bafd070ab6d76a4d
|
File details
Details for the file llmdiff_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmdiff_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e6425af50764799a29284791434b48b8ac67cbb26d7f7ea953272f7c3712295
|
|
| MD5 |
01c514fb11728f62535e113d06e1da09
|
|
| BLAKE2b-256 |
f7bd4d7e4cefb469e0db0948f8dccc1b764227adc5840990b4c6cc4bb2e26e5c
|