Skip to main content

A CLI tool for comparing LLM outputs — semantically, visually, and at scale

Project description

llm-diff

Compare LLM outputs — semantically, visually, and at scale.

PyPI CI Coverage Python License Status


Table of Contents


The Problem

LLMs do not produce deterministic output. When you are evaluating models, iterating on prompts, or trying to understand how a model upgrade affects your application, you are left doing this manually:

  • Copy response A into one text editor
  • Copy response B into another
  • Scroll through them side-by-side trying to spot differences
  • Guess whether the meaning actually changed or just the phrasing
  • Repeat for every prompt, every model pair, every iteration

This is time-consuming, error-prone, and does not scale. If you have 20 prompts and just swapped gpt-4o for gpt-4o-mini, you are looking at an hour of tedious comparison with no way to record results or enforce a quality bar.

Teams hit this wall repeatedly:

  • Model evaluation — "Is claude-3-5-sonnet actually better than gpt-4o for our use case?"
  • Prompt engineering — "Did changing the system prompt improve consistency, or just rearrange words?"
  • Regression testing — "Our model provider pushed an update. Did anything break?"
  • A/B testing — "We are running two prompt variants in production. How different are they really?"

There was no tool built for this job.


How llm-diff Solves It

llm-diff is a CLI tool and Python library that automates LLM output comparison from end to end.

You run one command. llm-diff:

  1. Calls both models in parallel via their APIs (OpenAI, Groq, Ollama, LM Studio, or any OpenAI-compatible endpoint)
  2. Diffs the responses word-by-word — every insertion, deletion, and unchanged span is tagged and highlighted
  3. Scores them semantically using sentence embeddings so you know if the meaning diverged even when the words look different
  4. Renders the result as a coloured terminal diff or exports a self-contained HTML report
  5. Scales to batch workloads — run a YAML file of prompts across two models concurrently and get a combined summary report
  6. Caches responses so iterating on thresholds or report settings does not burn API credits
  7. Gates CI pipelines via --fail-under — fail the build if similarity drops below a threshold

What You Get

Benefit Detail
Instant visual diff Word-level colour highlighting in the terminal — no copy-paste, no manual comparison
Objective similarity score A 0-100% number you can track over time, compare across models, and enforce in CI
Semantic awareness Detects when two responses mean the same thing despite different words
Shareable reports Single self-contained HTML file, no CDN, works offline
Batch evaluation Score all your prompts in one run with a summary table
Zero wasted API calls Response cache means re-running after tweaking settings costs nothing
CI integration --fail-under 0.85 turns a model upgrade into a pass/fail regression gate
Vendor agnostic OpenAI, Groq, Mistral, DeepSeek, Ollama, LM Studio — any OpenAI-compatible API
Scriptable Full Python library API with no shell dependency

Installation

Core (word-diff + HTML reports)

pip install llm-diff

With semantic scoring

pip install "llm-diff[semantic]"

The [semantic] extra installs sentence-transformers. The default model (all-MiniLM-L6-v2, ~80 MB) downloads on first use.

Development install

git clone https://github.com/user/llm-diff.git
cd llm-diff
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -e ".[semantic,dev]"

Configuration

llm-diff needs API keys. Provide them as environment variables or in a .llmdiff TOML config file.

Option A — environment variables

export OPENAI_API_KEY="sk-..."
export GROQ_API_KEY="gsk_..."
export ANTHROPIC_API_KEY="sk-ant-..."

Windows (PowerShell):

$env:OPENAI_API_KEY = "sk-..."

Option B — .llmdiff config file

Create .llmdiff in your project root or ~/.llmdiff for global defaults:

[providers.openai]
api_key = "sk-..."          # overrides OPENAI_API_KEY env var

[providers.groq]
api_key = "gsk_..."         # overrides GROQ_API_KEY env var

[defaults]
temperature = 0.7
max_tokens  = 1024
timeout     = 30
save        = false         # auto-save every report to ./diffs/
no_color    = false

All keys are optional. The file is looked up in the current directory first, then ~/.llmdiff, then environment variables as the final fallback.

Provider auto-detection

llm-diff detects the provider from the model name — no extra flags needed:

Model prefix Provider Key env var
gpt-*, o1-*, o3-* OpenAI OPENAI_API_KEY
llama*, mixtral*, gemma* Groq GROQ_API_KEY
mistral*, codestral* Mistral AI MISTRAL_API_KEY
deepseek* DeepSeek DEEPSEEK_API_KEY
claude* via LiteLLM proxy ANTHROPIC_API_KEY
anything else Custom (base_url required)

Local models (Ollama / LM Studio)

ollama pull llama3.2
ollama serve
[providers.custom]
api_key  = "ollama"
base_url = "http://localhost:11434/v1"

Anthropic (via LiteLLM proxy)

pip install litellm
litellm --model anthropic/claude-3-5-sonnet-20241022 --port 4000
[providers.custom]
api_key  = "sk-ant-..."
base_url = "http://localhost:4000/v1"

Usage

All examples below assume OPENAI_API_KEY is set.


1. Compare two models on the same prompt

llm-diff "Explain recursion in one sentence." -a gpt-4o -b gpt-3.5-turbo

Output:

Model A (gpt-4o):
  Recursion is a technique where a function calls itself with a simpler
  version of the problem until a base case is reached.

Model B (gpt-3.5-turbo):
  Recursion is when a function calls itself repeatedly until a base
  condition is met, solving the problem step by step.

Word diff:
  Recursion is [-a technique where a-] {+when a+} function calls itself
  [-with a simpler version of the problem until a base case is reached.-]
  {+repeatedly until a base condition is met, solving the problem step by step.+}

  Word similarity:  61.3%
  Model A tokens:   28    latency: 0.84s
  Model B tokens:   24    latency: 0.61s

[-deleted-] spans are text removed from A. {+inserted+} spans are text added in B. In the terminal these render as red and green respectively.


2. Add semantic similarity scoring

llm-diff "Explain recursion in one sentence." \
  -a gpt-4o -b gpt-3.5-turbo --semantic

Output:

  Word similarity:     61.3%
  Semantic similarity: 92.7%   <- same meaning, different words
  Primary score:       92.7%

A semantic score above ~85% generally means the responses are equivalent in meaning even if the wording differs. Requires pip install "llm-diff[semantic]".


3. Per-paragraph scoring

llm-diff "Write a 3-paragraph summary of transformer architecture." \
  -a gpt-4o -b gpt-4o-mini --paragraph

Output:

  Word similarity:     54.1%
  Semantic similarity: 88.6%

  Paragraph similarities:
    #1  Transformers rely on self-attention to...    94.2%
    #2  The encoder processes input tokens in...     81.3%
    #3  Training uses masked language modelling...   79.8%

Use --paragraph when responses are long enough that a single score does not tell the whole story.


4. Prompt-diff mode

Compare how two different prompts affect the same model:

llm-diff --model gpt-4o \
  --prompt-a "Explain recursion concisely." \
  --prompt-b "Explain recursion with a real-world analogy." \
  --semantic

Or using files:

llm-diff --model gpt-4o --prompt-a v1.txt --prompt-b v2.txt --semantic

Output:

  Comparing prompts on gpt-4o
  Prompt A: Explain recursion concisely.
  Prompt B: Explain recursion with a real-world analogy.

  Word similarity:     38.4%
  Semantic similarity: 74.1%

This is the primary workflow for prompt engineering — see whether your rewrites actually change the output and by how much.


5. Save an HTML report

llm-diff "Explain recursion." -a gpt-4o -b gpt-3.5-turbo \
  --semantic --out report.html

Output:

  Word similarity:     61.3%
  Semantic similarity: 92.7%
  Report saved -> report.html

Open report.html in any browser. The file is fully self-contained — no server, no internet, no CDN dependencies.

To auto-save every run to ./diffs/ without specifying --out each time:

llm-diff "Explain recursion." -a gpt-4o -b gpt-3.5-turbo --save

Or set save = true in your .llmdiff config.


6. JSON output

llm-diff "Hello." -a gpt-4o -b gpt-3.5-turbo --json

Output:

{
  "prompt_a": "Hello.",
  "prompt_b": "Hello.",
  "model_a": "gpt-4o",
  "model_b": "gpt-3.5-turbo",
  "response_a": "Hello! How can I assist you today?",
  "response_b": "Hello! How can I help you?",
  "diff": [
    {"type": "equal",  "text": "Hello! How can I "},
    {"type": "delete", "text": "assist you today"},
    {"type": "insert", "text": "help you"},
    {"type": "equal",  "text": "?"}
  ],
  "word_similarity": 0.714,
  "semantic_score": null,
  "tokens_a": {"prompt": 2, "completion": 9, "total": 11},
  "tokens_b": {"prompt": 2, "completion": 7, "total": 9},
  "latency_a_ms": 812,
  "latency_b_ms": 594
}

Extract a single field with jq:

llm-diff "Hello." -a gpt-4o -b gpt-3.5-turbo --json | jq '.word_similarity'
# 0.714

7. Batch mode

Evaluate a set of prompts across two models in one command.

Create prompts.yml:

prompts:
  - id: explain-recursion
    text: "Explain recursion in one sentence."

  - id: code-review
    text: "Review this Python function for readability: {input}"
    inputs:
      - examples/func.py

  - id: summarise
    text: "Summarise the following: {input}"
    inputs:
      - examples/article1.txt
      - examples/article2.txt
  • id — label shown in the terminal and HTML report
  • text — prompt template; use {input} as a file placeholder
  • inputs — list of files relative to the YAML; one batch item per file

Run the batch:

llm-diff --batch prompts.yml -a gpt-4o -b gpt-3.5-turbo \
  --semantic --out batch_report.html

Output:

  Fetching responses for 4 prompt(s) (concurrency=4)...

  explain-recursion          word: 61.3%   semantic: 92.7%
  code-review/func.py        word: 74.8%   semantic: 89.2%
  summarise/article1.txt     word: 48.3%   semantic: 83.1%
  summarise/article2.txt     word: 51.0%   semantic: 85.5%

  Average word similarity:      58.9%
  Average semantic similarity:  87.6%
  Report saved -> batch_report.html

By default up to 4 API calls run in parallel. Override with --concurrency:

llm-diff --batch prompts.yml -a gpt-4o -b gpt-3.5-turbo \
  --concurrency 8 --out batch_report.html

A working example is in examples/prompts.yml.


8. CI gate with --fail-under

Fail the command if similarity drops below a threshold:

llm-diff "Summarise the changelog." \
  -a gpt-4o -b gpt-4o-mini \
  --semantic --fail-under 0.85

Passes (exit code 0):

  Semantic similarity: 91.3%   threshold: 85.0%   PASS

Fails (exit code 1):

  Semantic similarity: 72.4%   threshold: 85.0%   FAIL
Error: similarity 0.724 < threshold 0.850

In batch mode, exit 1 is triggered if any single prompt drops below the threshold:

llm-diff --batch prompts.yml -a gpt-4o -b gpt-4o-mini \
  --semantic --fail-under 0.80

Output:

  explain-recursion        87.2%   PASS
  code-review/func.py      74.1%   FAIL  <- below 80%
  summarise/article1.txt   83.6%   PASS

Error: 1 prompt(s) failed the similarity threshold (0.80)

9. Result caching

By default llm-diff caches responses to ~/.cache/llm-diff/ keyed on (model, prompt, temperature, max_tokens). Re-running the same diff is instant and free:

# First run — calls the API
llm-diff "Explain recursion." -a gpt-4o -b gpt-3.5-turbo --semantic
# latency: 0.84s / 0.61s

# Second run — served from cache
llm-diff "Explain recursion." -a gpt-4o -b gpt-3.5-turbo --semantic
# latency: 0.00s / 0.00s  (cached)

Caching is most useful when iterating on thresholds, scoring options, or report formatting without re-paying for the same API calls.

To bypass the cache and always call the API:

llm-diff "Explain recursion." -a gpt-4o -b gpt-3.5-turbo --no-cache

10. Verbose metadata

llm-diff "Hello." -a gpt-4o -b gpt-3.5-turbo --verbose

Output:

  [gpt-4o]
    provider:          openai
    prompt_tokens:     2
    completion_tokens: 9
    total_tokens:      11
    latency:           812ms

  [gpt-3.5-turbo]
    provider:          openai
    prompt_tokens:     2
    completion_tokens: 7
    total_tokens:      9
    latency:           594ms

Full CLI Reference

Usage: llm-diff [OPTIONS] [PROMPT]

  Compare two LLM responses — semantically, visually, and at scale.

Arguments:
  PROMPT               Prompt text sent to both models.

Model selection:
  -a, --model-a MODEL  Model for side A  (e.g. gpt-4o).
  -b, --model-b MODEL  Model for side B  (e.g. gpt-4o-mini).
  -m, --model   MODEL  Same model for both sides (prompt-diff mode).

Prompt sources:
  --prompt-a PATH      Path to a text file used as prompt for model A.
  --prompt-b PATH      Path to a text file used as prompt for model B.
  --batch    PATH      Path to a prompts.yml file for batch comparison.

Scoring:
  -s, --semantic       Compute embedding-based cosine similarity score.
  -p, --paragraph      Per-paragraph similarity (implies --semantic).

Output:
  -j, --json           Output raw JSON to stdout.
  -o, --out PATH       Save HTML report to this path.
  --save               Auto-save HTML report to ./diffs/.

API settings:
  -t, --temperature FLOAT   Temperature for both models  (default 0.7).
  --max-tokens INT          Max tokens per response      (default 1024).
  --timeout SECS            Request timeout in seconds   (default 30).

Quality gate:
  --fail-under FLOAT   Exit 1 if primary score < threshold.

Performance:
  --concurrency INT    Max parallel API calls in batch mode (default 4).
  --no-cache           Skip cache; always call the API.

Display:
  --no-color           Disable terminal colour output.
  -v, --verbose        Show full API metadata per request.

  --version            Show version and exit.
  -h, --help           Show this message and exit.

Programmatic API

Use llm-diff as a Python library:

import asyncio
from llm_diff import compare

report = asyncio.run(
    compare("Explain recursion.", model_a="gpt-4o", model_b="gpt-3.5-turbo")
)

print(f"Word similarity:   {report.word_similarity:.2%}")
print(f"Response A tokens: {report.comparison.response_a.total_tokens}")

With semantic scoring

report = asyncio.run(
    compare(
        "Explain recursion.",
        model_a="gpt-4o",
        model_b="gpt-3.5-turbo",
        semantic=True,
    )
)
print(f"Semantic similarity: {report.semantic_score:.2%}")
print(f"Primary score:       {report.primary_score:.2%}")

Prompt-diff (same model, two prompts)

from llm_diff import compare_prompts

report = asyncio.run(
    compare_prompts(
        "Explain recursion concisely.",
        "Explain recursion with a real-world analogy.",
        model="gpt-4o",
        semantic=True,
    )
)
print(f"Similarity: {report.primary_score:.2%}")

Batch

from llm_diff import compare_batch

reports = asyncio.run(
    compare_batch(
        "examples/prompts.yml",
        model_a="gpt-4o",
        model_b="gpt-3.5-turbo",
        semantic=True,
    )
)
for r in reports:
    print(f"{r.comparison.response_a.model}: {r.word_similarity:.2%}")

ComparisonReport fields

Field Type Description
prompt_a str Prompt sent to model A
prompt_b str Prompt sent to model B
comparison ComparisonResult Raw paired model responses
diff_result DiffResult Word-level diff chunks + similarity
semantic_score float or None Whole-text cosine similarity (0-1)
paragraph_scores list or None Per-paragraph similarity scores
html_report str or None Self-contained HTML (when build_html=True)
word_similarity float Property — diff_result.similarity
primary_score float Property — semantic if set, else word

CI/CD Integration

# .github/workflows/llm-check.yml
name: LLM regression check

on: [push, pull_request]

jobs:
  llm-diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install llm-diff
        run: pip install "llm-diff[semantic]"

      - name: Run similarity check
        run: |
          llm-diff --batch prompts.yml \
            -a gpt-4o -b gpt-4o-mini \
            --semantic --fail-under 0.85 \
            --out diff_report.html
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: diff-report
          path: diff_report.html

This makes model upgrades and prompt changes reviewable, auditable, and enforceable — the same discipline applied to code, now applied to LLM output.


HTML Reports

Pass --out report.html to generate a fully self-contained HTML report. No CDN dependencies — works offline in air-gapped environments.

Single-diff reports include:

  • Side-by-side responses with coloured word-diff
  • Word similarity score and optional semantic score
  • Optional per-paragraph similarity table
  • Token usage and latency metadata

Batch reports add:

  • Summary table with all prompts, scores, and pass/fail highlighting
  • Per-item expandable diff cards
  • Average similarity across the batch

Contributing

See CONTRIBUTING.md for dev setup, testing instructions, and PR guidelines.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_diff-1.0.0.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_diff-1.0.0-py3-none-any.whl (46.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_diff-1.0.0.tar.gz.

File metadata

  • Download URL: llm_diff-1.0.0.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for llm_diff-1.0.0.tar.gz
Algorithm Hash digest
SHA256 dbb62343773f3960c791abaea4ad8dbb40831329d5ef8a755c96090cf7a327dd
MD5 94640bb6dadd7466770d04fce0188287
BLAKE2b-256 9a2c9c51db532c0c509dc335b09906a0abeced7f68ea8d8e2a3a099229e752bd

See more details on using hashes here.

File details

Details for the file llm_diff-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: llm_diff-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 46.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for llm_diff-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6aa4402464dd8bc8b040fc1be618cde31d276e8fd28b6a8e4ba5ed142bd05dd1
MD5 87f2b4381f1db5d8fb5a82b547f68489
BLAKE2b-256 581c2c4063078b284bb3f588a69038d9c519ca7bb90567de0d224f3193782d24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page