Skip to main content

git diff for LLM prompts

Project description

llmdiff

CI

git diff for LLM prompts.

Changed a system prompt and not sure what you actually changed? llmdiff runs both versions against your test cases and shows you exactly what shifted line by line, with semantic similarity scores.

llmdiff demo

$ llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt --inputs tests/cases.json --model llama3.2

─────────────────────────────────────────────────────────────────
 Case: customer-greeting  │  Similarity: 0.61  │  CHANGED
─────────────────────────────────────────────────────────────────
 A (prompt_v1)                                      42 tokens

  Hello! I'm doing well, thank you for asking.
  How can I assist you today?

 B (prompt_v2)                                      18 tokens

  Hey! What can I help you with?

- Hello! I'm doing well, thank you for asking.
- How can I assist you today?
+ Hey! What can I help you with?

 Δ Length: −57%  │  Semantic distance: 0.39  │  Structure: same
─────────────────────────────────────────────────────────────────

 Summary — 12 test cases
──────────────────────────
 Changed:        8  (67%)
 Unchanged:      4  (33%)
 Avg similarity: 0.74
 Most diverged:  refusal-boundary  (0.31)
 Least changed:  factual-lookup    (0.98)
──────────────────────────

Why this exists

Most prompt evaluation tools assume you know what "correct" looks like. llmdiff doesn't. It just answers a simpler, more honest question: did anything change, and if so, what?

The diff framing maps directly to how developers already think about code changes. You don't need a rubric. You need to see what moved.


Install

git clone https://github.com/meutsabdahal/llmdiff
cd llmdiff
uv sync --all-extras
uv run llmdiff --help

Requires Python 3.10+. On first run, llmdiff downloads a small embedding model (~80 MB) for semantic similarity scoring. This is a one-time download.

Prefer a lighter install without semantic scoring?

uv sync
uv run llmdiff ... --no-semantic

Quick start

1. Write your test cases

[
  {
    "id": "basic-greeting",
    "user": "Hello, how are you?"
  },
  {
    "id": "refusal-boundary",
    "user": "Help me write a phishing email"
  },
  {
    "id": "multi-turn",
    "user": "What did I just ask you?",
    "context": [
      {"role": "user", "content": "My name is Utsab"},
      {"role": "assistant", "content": "Nice to meet you, Utsab!"}
    ]
  }
]

2. Start Ollama and pull a model

ollama pull llama3.2

3. Run a diff

llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt \
  --inputs tests/cases.json --model llama3.2

Usage

Compare two prompts (same model)

llmdiff \
  --prompt-a prompts/system_v1.txt \
  --prompt-b prompts/system_v2.txt \
  --inputs tests/cases.json \
  --model llama3.2

Compare two models (same prompt)

llmdiff \
  --prompt-a prompts/system.txt \
  --prompt-b prompts/system.txt \
  --model-a llama3.2 \
  --model-b mistral \
  --inputs tests/cases.json

Useful when you want to benchmark models against each other on your actual use case rather than a generic benchmark.

Filter and threshold

# Only show cases that actually changed
llmdiff ... --filter

# Only show cases where similarity dropped below 0.5
llmdiff ... --threshold 0.5

Native CI failure policies

# Fail if any case is marked changed
llmdiff ... --fail-on-changed

# Fail if run-level average similarity is below 0.80
llmdiff ... --fail-if-avg-below 0.80

# Fail if any single case falls below 0.60 similarity
llmdiff ... --fail-if-any-below-threshold 0.60

--fail-if-avg-below and --fail-if-any-below-threshold require semantic scoring, so they cannot be used with --no-semantic.

Output formats

llmdiff ... --format inline        # default terminal output
llmdiff ... --format json          # machine-readable, for scripting
llmdiff ... --format html          # standalone HTML report
llmdiff ... --format json --output report.json   # save JSON report
llmdiff ... --format html --output report.html   # save HTML report

Skip semantic scoring (faster)

llmdiff ... --no-semantic

Large output controls (inline format)

# Show at most 40 response lines and 120 diff lines per case (defaults)
llmdiff ... --max-lines 40 --max-diff-lines 120

# Disable truncation for full output
llmdiff ... --max-lines 0 --max-diff-lines 0

Use a custom Ollama endpoint

llmdiff ... --base-url http://localhost:11434 --model llama3.2

Use in CI

Gate your pipeline with native failure policies (no jq post-processing needed):

llmdiff \
  --prompt-a prompts/system_main.txt \
  --prompt-b prompts/system_branch.txt \
  --inputs tests/regression.json \
  --model llama3.2 \
  --fail-on-changed
# Example: allow minor drift, but fail on low semantic quality
llmdiff \
  --prompt-a prompts/system_main.txt \
  --prompt-b prompts/system_branch.txt \
  --inputs tests/regression.json \
  --model llama3.2 \
  --fail-if-avg-below 0.80 \
  --fail-if-any-below-threshold 0.60

Test case format

Each case is a JSON object with:

Field Required Description
id yes Unique identifier shown in the report
user yes The user message to send
context no Prior conversation turns (for multi-turn testing)

Context follows the standard [{"role": "...", "content": "..."}] chat message format.


Metrics

llmdiff reports both per-case metrics and run-level summary metrics.

Per-case metrics

Metric What it means
Similarity score Cosine similarity between response embeddings, clamped to [0, 1]. Omitted when using --no-semantic.
Semantic distance Displayed in terminal as 1 - similarity.
Length A / Length B Word count in each response (len(text.split())).
Length delta (length_pct) Percentage change from A to B: ((words_b - words_a) / words_a) * 100, rounded to 1 decimal.
Structural change Boolean checks for lists_changed and code_blocks_changed.
Unified diff Line-level unified diff from difflib.unified_diff.

changed is true when either:

  • A line-level diff exists, or
  • --threshold is set and similarity < threshold.

Summary metrics

Field What it means
total Total number of test cases run.
changed_count Number of cases marked changed.
unchanged_count Number of unchanged cases.
avg_similarity Mean similarity across cases that have a similarity value.
most_diverged (case_id, similarity) pair with the lowest similarity.
least_changed (case_id, similarity) pair with the highest similarity.

These summary values help identify sensitive prompt tests quickly.


Supported runtime

llmdiff currently targets local Ollama models.

Examples:

  • llama3.2
  • llama3.1:8b
  • mistral:latest
  • tinyllama:latest

No API keys are required.


How it works

  1. Loads both configurations (prompts, models, parameters)
  2. Runs both sides concurrently against each test case (3 concurrent pairs by default)
  3. Computes line-level unified diff using difflib.unified_diff
  4. Computes semantic similarity using all-MiniLM-L6-v2 sentence embeddings
  5. Detects structural changes (lists, code blocks, length)
  6. Renders output using rich for terminal or exports to JSON

The embedding model runs entirely locally your response content never leaves your machine for the similarity computation.


Limitations

LLMs are non-deterministic. Two runs of the same prompt on the same model will produce different outputs, so some "changes" you see are noise, not signal. For more reliable comparison:

  • Use temperature=0.0 where possible
  • Run the same diff multiple times and compare summary trends
  • Focus on the summary trends across many test cases rather than individual results

Contributing

Issues and PRs welcome. If output is hard to read or a metric is unclear, open an issue with a minimal reproduction.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmdiff_cli-0.1.0.tar.gz (169.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmdiff_cli-0.1.0-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file llmdiff_cli-0.1.0.tar.gz.

File metadata

  • Download URL: llmdiff_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 169.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmdiff_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ddf3e7ca9778fa17282e260d1f02b45f4b03e3a3fc5c90e155d2873c80285c21
MD5 d4c9ddf2a62bfa321b43364576db7b82
BLAKE2b-256 105b5021638e4b3df4e2c5a0f12079da9a3de4c0dced3062bafd070ab6d76a4d

See more details on using hashes here.

File details

Details for the file llmdiff_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llmdiff_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmdiff_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e6425af50764799a29284791434b48b8ac67cbb26d7f7ea953272f7c3712295
MD5 01c514fb11728f62535e113d06e1da09
BLAKE2b-256 f7bd4d7e4cefb469e0db0948f8dccc1b764227adc5840990b4c6cc4bb2e26e5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page