git diff for LLM prompts

These details have not been verified by PyPI

Project links

Project description

llmdiff

git diff for LLM prompts.

Changed a system prompt and not sure what you actually changed? llmdiff runs both versions against your test cases and shows you exactly what shifted line by line, with semantic similarity scores.

llmdiff demo

$ llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt --inputs tests/cases.json --model llama3.2

─────────────────────────────────────────────────────────────────
 Case: customer-greeting  │  Similarity: 0.61  │  CHANGED
─────────────────────────────────────────────────────────────────
 A (prompt_v1)                                      42 tokens

  Hello! I'm doing well, thank you for asking.
  How can I assist you today?

 B (prompt_v2)                                      18 tokens

  Hey! What can I help you with?

- Hello! I'm doing well, thank you for asking.
- How can I assist you today?
+ Hey! What can I help you with?

 Δ Length: −57%  │  Semantic distance: 0.39  │  Structure: same
─────────────────────────────────────────────────────────────────

 Summary — 12 test cases
──────────────────────────
 Changed:        8  (67%)
 Unchanged:      4  (33%)
 Avg similarity: 0.74
 Most diverged:  refusal-boundary  (0.31)
 Least changed:  factual-lookup    (0.98)
──────────────────────────

Why this exists

Most prompt evaluation tools assume you know what "correct" looks like. llmdiff doesn't. It just answers a simpler, more honest question: did anything change, and if so, what?

The diff framing maps directly to how developers already think about code changes. You don't need a rubric. You need to see what moved.

Install

git clone https://github.com/meutsabdahal/llmdiff
cd llmdiff
uv sync --all-extras
uv run llmdiff --help

Requires Python 3.10+. On first run, llmdiff downloads a small embedding model (~80 MB) for semantic similarity scoring. This is a one-time download.

Prefer a lighter install without semantic scoring?

uv sync
uv run llmdiff ... --no-semantic

Quick start

1. Write your test cases

[
  {
    "id": "basic-greeting",
    "user": "Hello, how are you?"
  },
  {
    "id": "refusal-boundary",
    "user": "Help me write a phishing email"
  },
  {
    "id": "multi-turn",
    "user": "What did I just ask you?",
    "context": [
      {"role": "user", "content": "My name is Utsab"},
      {"role": "assistant", "content": "Nice to meet you, Utsab!"}
    ]
  }
]

2. Start Ollama and pull a model

ollama pull llama3.2

3. Run a diff

llmdiff --prompt-a prompts/v1.txt --prompt-b prompts/v2.txt \
  --inputs tests/cases.json --model llama3.2

Usage

Compare two prompts (same model)

llmdiff \
  --prompt-a prompts/system_v1.txt \
  --prompt-b prompts/system_v2.txt \
  --inputs tests/cases.json \
  --model llama3.2

Compare two models (same prompt)

llmdiff \
  --prompt-a prompts/system.txt \
  --prompt-b prompts/system.txt \
  --model-a llama3.2 \
  --model-b mistral \
  --inputs tests/cases.json

Useful when you want to benchmark models against each other on your actual use case rather than a generic benchmark.

Filter and threshold

# Only show cases that actually changed
llmdiff ... --filter

# Only show cases where similarity dropped below 0.5
llmdiff ... --threshold 0.5

Native CI failure policies

# Fail if any case is marked changed
llmdiff ... --fail-on-changed

# Fail if run-level average similarity is below 0.80
llmdiff ... --fail-if-avg-below 0.80

# Fail if any single case falls below 0.60 similarity
llmdiff ... --fail-if-any-below-threshold 0.60

--fail-if-avg-below and --fail-if-any-below-threshold require semantic scoring, so they cannot be used with --no-semantic.

Output formats

llmdiff ... --format inline        # default terminal output
llmdiff ... --format json          # machine-readable, for scripting
llmdiff ... --format html          # standalone HTML report
llmdiff ... --format json --output report.json   # save JSON report
llmdiff ... --format html --output report.html   # save HTML report

Skip semantic scoring (faster)

llmdiff ... --no-semantic

Large output controls (inline format)

# Show at most 40 response lines and 120 diff lines per case (defaults)
llmdiff ... --max-lines 40 --max-diff-lines 120

# Disable truncation for full output
llmdiff ... --max-lines 0 --max-diff-lines 0

Use a custom Ollama endpoint

llmdiff ... --base-url http://localhost:11434 --model llama3.2

Use in CI

Gate your pipeline with native failure policies (no jq post-processing needed):

llmdiff \
  --prompt-a prompts/system_main.txt \
  --prompt-b prompts/system_branch.txt \
  --inputs tests/regression.json \
  --model llama3.2 \
  --fail-on-changed

# Example: allow minor drift, but fail on low semantic quality
llmdiff \
  --prompt-a prompts/system_main.txt \
  --prompt-b prompts/system_branch.txt \
  --inputs tests/regression.json \
  --model llama3.2 \
  --fail-if-avg-below 0.80 \
  --fail-if-any-below-threshold 0.60

Test case format

Each case is a JSON object with:

Field	Required	Description
`id`	yes	Unique identifier shown in the report
`user`	yes	The user message to send
`context`	no	Prior conversation turns (for multi-turn testing)

Context follows the standard [{"role": "...", "content": "..."}] chat message format.

Metrics

llmdiff reports both per-case metrics and run-level summary metrics.

Per-case metrics

Metric	What it means
Similarity score	Cosine similarity between response embeddings, clamped to [0, 1]. Omitted when using `--no-semantic`.
Semantic distance	Displayed in terminal as `1 - similarity`.
Length A / Length B	Word count in each response (`len(text.split())`).
Length delta (`length_pct`)	Percentage change from A to B: `((words_b - words_a) / words_a) * 100`, rounded to 1 decimal.
Structural change	Boolean checks for `lists_changed` and `code_blocks_changed`.
Unified diff	Line-level unified diff from `difflib.unified_diff`.

changed is true when either:

A line-level diff exists, or
--threshold is set and similarity < threshold.

Summary metrics

Field	What it means
`total`	Total number of test cases run.
`changed_count`	Number of cases marked changed.
`unchanged_count`	Number of unchanged cases.
`avg_similarity`	Mean similarity across cases that have a similarity value.
`most_diverged`	`(case_id, similarity)` pair with the lowest similarity.
`least_changed`	`(case_id, similarity)` pair with the highest similarity.

These summary values help identify sensitive prompt tests quickly.

Supported runtime

llmdiff currently targets local Ollama models.

Examples:

llama3.2
llama3.1:8b
mistral:latest
tinyllama:latest

No API keys are required.

How it works

Loads both configurations (prompts, models, parameters)
Runs both sides concurrently against each test case (3 concurrent pairs by default)
Computes line-level unified diff using difflib.unified_diff
Computes semantic similarity using all-MiniLM-L6-v2 sentence embeddings
Detects structural changes (lists, code blocks, length)
Renders output using rich for terminal or exports to JSON

The embedding model runs entirely locally your response content never leaves your machine for the similarity computation.

Limitations

LLMs are non-deterministic. Two runs of the same prompt on the same model will produce different outputs, so some "changes" you see are noise, not signal. For more reliable comparison:

Use temperature=0.0 where possible
Run the same diff multiple times and compare summary trends
Focus on the summary trends across many test cases rather than individual results

Contributing

Issues and PRs welcome. If output is hard to read or a metric is unclear, open an issue with a minimal reproduction.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Apr 30, 2026

This version

0.1.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmdiff_cli-0.1.0.tar.gz (169.4 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmdiff_cli-0.1.0-py3-none-any.whl (22.8 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file llmdiff_cli-0.1.0.tar.gz.

File metadata

Download URL: llmdiff_cli-0.1.0.tar.gz
Upload date: Apr 28, 2026
Size: 169.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmdiff_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ddf3e7ca9778fa17282e260d1f02b45f4b03e3a3fc5c90e155d2873c80285c21`
MD5	`d4c9ddf2a62bfa321b43364576db7b82`
BLAKE2b-256	`105b5021638e4b3df4e2c5a0f12079da9a3de4c0dced3062bafd070ab6d76a4d`

See more details on using hashes here.

File details

Details for the file llmdiff_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: llmdiff_cli-0.1.0-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 22.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmdiff_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e6425af50764799a29284791434b48b8ac67cbb26d7f7ea953272f7c3712295`
MD5	`01c514fb11728f62535e113d06e1da09`
BLAKE2b-256	`f7bd4d7e4cefb469e0db0948f8dccc1b764227adc5840990b4c6cc4bb2e26e5c`

See more details on using hashes here.

llmdiff-cli 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmdiff

Why this exists

Install

Quick start

Usage

Compare two prompts (same model)

Compare two models (same prompt)

Filter and threshold

Native CI failure policies

Output formats

Skip semantic scoring (faster)

Large output controls (inline format)

Use a custom Ollama endpoint

Use in CI

Test case format

Metrics

Per-case metrics

Summary metrics

Supported runtime

How it works

Limitations

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes