Skip to main content

A CLI tool for comparing LLM outputs — semantically, visually, and at scale

Project description

llm-diff

A CLI tool and Python library for comparing LLM outputs — semantically, visually, and at scale.

PyPI Tests Coverage Python License Status


llm-diff calls two LLM models in parallel, diffs their responses word-by-word, scores them semantically, and renders results in the terminal or as a self-contained HTML report. It scales to batch workloads, caches API responses, gates CI pipelines via --fail-under, and emits structured llm-toolkit-schema events for observability tooling.

What is llm-diff?

LLMs do not produce deterministic output. Evaluating models, iterating on prompts, or assessing the impact of a model upgrade all require you to compare responses — and doing that by hand does not scale.

llm-diff automates the entire workflow: it calls both models concurrently, produces a word-level diff, optionally scores semantic similarity via sentence embeddings, and outputs results to the terminal or as a shareable HTML report. It supports batch workloads from a YAML file, caches API calls so iterating on thresholds costs nothing, and emits exit code 1 when similarity falls below a threshold — making it a first-class citizen in CI/CD pipelines.

Version 1.2 adds LLM-as-a-Judge scoring, per-call USD cost tracking, multi-model (3–4 model) comparison, and structured JSON diff.

Version 1.2.3 adds EVAL_REGRESSION_FAILED schema event emission — --fail-under gate failures now emit a structured llm.eval.regression.failed event (via make_eval_regression_event()) in addition to returning exit code 1, providing a full audit trail for CI regression gates.

Version 1.2.2 integrates llm-toolkit-schema as a built-in observability layer: every comparison, model call, cache lookup, cost record, judge evaluation, and --fail-under regression failure now emits a validated schema event that can be collected in memory, exported to JSONL, or forwarded to any custom backend.

Documentation

Guide Description
Getting Started Installation, API keys, first diff
Tutorials Step-by-step learning path from first run to Python API (12 tutorials)
CLI Reference All flags, option groups, exit codes, YAML format
Python API All public functions, dataclasses, and field descriptions
Schema Events Observability integration with llm-toolkit-schema
Configuration .llmdiff TOML schema, env vars, config priority
Provider Setup OpenAI, Groq, Mistral, Ollama, LM Studio, Anthropic
HTML Reports Report anatomy, batch reports, judge card, cost table
CI / CD Integration GitHub Actions examples, threshold recommendations

Quick Start

# Install with semantic scoring support
pip install "llm-diff[semantic]"

# Install with schema-events observability
pip install "llm-diff[semantic]" llm-toolkit-schema

# Set an API key
export OPENAI_API_KEY="sk-..."

# Compare two models on the same prompt
llm-diff "Explain recursion in one sentence." -a gpt-4o -b gpt-4o-mini --semantic

# Save a self-contained HTML report
llm-diff "Explain recursion." -a gpt-4o -b gpt-4o-mini --semantic --out report.html

# Run a batch from a YAML prompt file and gate on similarity
llm-diff --batch prompts.yml -a gpt-4o -b gpt-4o-mini --semantic --fail-under 0.85

See Getting Started for quick examples, or work through the Tutorials for a guided learning path covering prompt engineering, batch evaluation, CI/CD gating, LLM-as-a-Judge, cost tracking, and the Python API.

Getting Help

Bug reports Open an issue
Feature requests Open a feature request
Questions & discussion GitHub Discussions
Open issues github.com/veerarag1973/llmdiff/issues
PyPI project page pypi.org/project/llm-diff
Roadmap IMPLEMENTATION_PLAN.md
Changelog CHANGELOG.md

When filing a bug, please include: llm-diff --version, your OS, Python version, the full command you ran, and the complete error output.

Contributing

See CONTRIBUTING.md for development setup, running the test suite, code style guidelines, and pull request instructions.

License

llm-diff is distributed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_diff-1.2.3.tar.gz (59.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_diff-1.2.3-py3-none-any.whl (68.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_diff-1.2.3.tar.gz.

File metadata

  • Download URL: llm_diff-1.2.3.tar.gz
  • Upload date:
  • Size: 59.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for llm_diff-1.2.3.tar.gz
Algorithm Hash digest
SHA256 8ecfcdb0c387ed33f3909f541574f19b67576db526e5dbf3c45afaf41bb377d4
MD5 5d4cfa12b7471a28017aa61189c14a3e
BLAKE2b-256 d1d0bd65dfd97622d730d972ac3c932d4d1036256ca684a43bdd8c3d0d7dff09

See more details on using hashes here.

File details

Details for the file llm_diff-1.2.3-py3-none-any.whl.

File metadata

  • Download URL: llm_diff-1.2.3-py3-none-any.whl
  • Upload date:
  • Size: 68.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for llm_diff-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5047fda52b555eda09acdae93716ce585e9913e87b8defb60603a731fa42c6e5
MD5 dceca5c59152659d2438e8c5eebe44f5
BLAKE2b-256 1666240d74b0d196ed1dae9314dd51cae4ff7ec97a53bac4dcdff255fb2c8ea1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page