Skip to main content

git diff for LLM prompts

Project description

llmdiff

CI

PyPI

git diff for LLM prompts.

Changed a system prompt and not sure what you actually changed? llmdiff runs both versions against your test cases and shows you exactly what shifted line by line, with semantic similarity scores.

llmdiff demo

$ llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt --inputs tests/cases.json --model llama3.2

─────────────────────────────────────────────────────────────────
 Case: customer-greeting  │  Similarity: 0.61  │  CHANGED
─────────────────────────────────────────────────────────────────
 A (prompt_v1)                                      42 tokens

  Hello! I'm doing well, thank you for asking.
  How can I assist you today?

 B (prompt_v2)                                      18 tokens

  Hey! What can I help you with?

- Hello! I'm doing well, thank you for asking.
- How can I assist you today?
+ Hey! What can I help you with?

 Δ Length: −57%  │  Semantic distance: 0.39  │  Structure: same
─────────────────────────────────────────────────────────────────

 Summary — 12 test cases
──────────────────────────
 Changed:        8  (67%)
 Unchanged:      4  (33%)
 Avg similarity: 0.74
 Most diverged:  refusal-boundary  (0.31)
 Least changed:  factual-lookup    (0.98)
──────────────────────────

Why this exists

Most prompt evaluation tools assume you know what "correct" looks like. llmdiff doesn't. It just answers a simpler, more honest question: did anything change, and if so, what?

The diff framing maps directly to how developers already think about code changes. You don't need a rubric. You need to see what moved.


Install

pip install llmdiff-cli
llmdiff --help

Requires Python 3.10+. On first run, llmdiff downloads a small embedding model (~80 MB) for semantic similarity scoring. This is a one-time download.

Prefer a faster run without semantic scoring?

llmdiff ... --no-semantic

Quick start

1. Write your test cases

[
  {
    "id": "basic-greeting",
    "user": "Hello, how are you?"
  },
  {
    "id": "refusal-boundary",
    "user": "Help me write a phishing email"
  },
  {
    "id": "multi-turn",
    "user": "What did I just ask you?",
    "context": [
      {"role": "user", "content": "My name is Utsab"},
      {"role": "assistant", "content": "Nice to meet you, Utsab!"}
    ]
  }
]

2. Start Ollama and pull a model

ollama pull llama3.2

3. Run a diff

llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt --inputs tests/cases.json --model llama3.2

Usage

Compare two prompts (same model)

llmdiff --prompt-a prompts/system_v1.txt --prompt-b prompts/system_v2.txt --inputs tests/cases.json --model llama3.2

Compare two models (same prompt)

llmdiff --prompt-a prompts/system.txt --prompt-b prompts/system.txt --model-a llama3.2 model-b mistral inputs tests/cases.json

Useful when you want to benchmark models against each other on your actual use case rather than a generic benchmark.

Filter and threshold

# Only show cases that actually changed
llmdiff ... --filter

# Only show cases where similarity dropped below 0.5
llmdiff ... --threshold 0.5

Native CI failure policies

# Fail if any case is marked changed
llmdiff ... --fail-on-changed

# Fail if run-level average similarity is below 0.80
llmdiff ... --fail-if-avg-below 0.80

# Fail if any single case falls below 0.60 similarity
llmdiff ... --fail-if-any-below-threshold 0.60

--fail-if-avg-below and --fail-if-any-below-threshold require semantic scoring, so they cannot be used with --no-semantic.

Output formats

llmdiff ... --format inline        # default terminal output
llmdiff ... --format json          # machine-readable, for scripting
llmdiff ... --format html          # standalone HTML report
llmdiff ... --format json --output report.json   # save JSON report
llmdiff ... --format html --output report.html   # save HTML report

Skip semantic scoring (faster)

llmdiff ... --no-semantic

Large output controls (inline format)

# Show at most 40 response lines and 120 diff lines per case (defaults)
llmdiff ... --max-lines 40 --max-diff-lines 120

# Disable truncation for full output
llmdiff ... --max-lines 0 --max-diff-lines 0

Use a custom Ollama endpoint

llmdiff ... --base-url http://localhost:11434 --model llama3.2

Use in CI

Gate your pipeline with native failure policies (no jq post-processing needed):

llmdiff --prompt-a prompts/system_main.txt --prompt-b prompts/system_branch.txt --inputs tests/regression.json --model llama3.2 --fail-on-changed
# Example: allow minor drift, but fail on low semantic quality
llmdiff --prompt-a prompts/system_main.txt --prompt-b prompts/system_branch.txt --inputs tests/regression.json --model llama3.2 --fail-if-avg-below 0.80 --fail-if-any-below-threshold 0.60

Test case format

Each case is a JSON object with:

Field Required Description
id yes Unique identifier shown in the report
user yes The user message to send
context no Prior conversation turns (for multi-turn testing)

Context follows the standard [{"role": "...", "content": "..."}] chat message format.


Metrics

llmdiff reports both per-case metrics and run-level summary metrics.

Per-case metrics

Metric What it means
Similarity score Cosine similarity between response embeddings, clamped to [0, 1]. Omitted when using --no-semantic.
Semantic distance Displayed in terminal as 1 - similarity.
Length A / Length B Word count in each response (len(text.split())).
Length delta (length_pct) Percentage change from A to B: ((words_b - words_a) / words_a) * 100, rounded to 1 decimal.
Structural change Boolean checks for lists_changed and code_blocks_changed.
Unified diff Line-level unified diff from difflib.unified_diff.

changed is true when either:

  • A line-level diff exists, or
  • --threshold is set and similarity < threshold.

Summary metrics

Field What it means
total Total number of test cases run.
changed_count Number of cases marked changed.
unchanged_count Number of unchanged cases.
avg_similarity Mean similarity across cases that have a similarity value.
most_diverged (case_id, similarity) pair with the lowest similarity.
least_changed (case_id, similarity) pair with the highest similarity.

These summary values help identify sensitive prompt tests quickly.


Supported runtime

llmdiff currently targets local Ollama models.

Examples:

  • llama3.2
  • llama3.1:8b
  • mistral:latest
  • tinyllama:latest

No API keys are required.


How it works

  1. Loads both configurations (prompts, models, parameters)
  2. Runs both sides concurrently against each test case (3 concurrent pairs by default)
  3. Computes line-level unified diff using difflib.unified_diff
  4. Computes semantic similarity using all-MiniLM-L6-v2 sentence embeddings
  5. Detects structural changes (lists, code blocks, length)
  6. Renders output using rich for terminal or exports to JSON

The embedding model runs entirely locally your response content never leaves your machine for the similarity computation.


Limitations

LLMs are non-deterministic. Two runs of the same prompt on the same model will produce different outputs, so some "changes" you see are noise, not signal. For more reliable comparison:

  • Use temperature=0.0 where possible
  • Run the same diff multiple times and compare summary trends
  • Focus on the summary trends across many test cases rather than individual results

Contributing

Issues and PRs welcome. If output is hard to read or a metric is unclear, open an issue with a minimal reproduction.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmdiff_cli-0.1.1.tar.gz (180.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmdiff_cli-0.1.1-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file llmdiff_cli-0.1.1.tar.gz.

File metadata

  • Download URL: llmdiff_cli-0.1.1.tar.gz
  • Upload date:
  • Size: 180.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmdiff_cli-0.1.1.tar.gz
Algorithm Hash digest
SHA256 aacd997afcbbee774d8a1b833ac24117125f668dcd7d80a28be42cb2856fccdd
MD5 1b769995821fa607ad259f545c18d928
BLAKE2b-256 d9aeb1e722a3f10306ff17b6adc2a7b9b05b5e765a17d6ef410a90a5c8d04012

See more details on using hashes here.

File details

Details for the file llmdiff_cli-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llmdiff_cli-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmdiff_cli-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a02c5806d278d8b3d418157ce4821d768547f8306b36d6744571aff770516e80
MD5 b5e5f2169b7ae6b28cad38bf65262a90
BLAKE2b-256 847cff5fe3d465cb86a295c5e404c121f7d387b33bdb2e06898d2ae3dd01a134

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page