A CLI tool for comparing LLM outputs — semantically, visually, and at scale
Project description
llm-diff
A CLI tool and Python library for comparing LLM outputs — semantically, visually, and at scale.
llm-diff calls two LLM models in parallel, diffs their responses word-by-word,
scores them semantically, and renders results in the terminal or as a
self-contained HTML report. It scales to batch workloads, caches API responses,
gates CI pipelines via --fail-under, and emits structured
AgentOBS events for
observability tooling.
What is llm-diff?
LLMs do not produce deterministic output. Evaluating models, iterating on prompts, or assessing the impact of a model upgrade all require you to compare responses — and doing that by hand does not scale.
llm-diff automates the entire workflow: it calls both models concurrently,
produces a word-level diff, optionally scores semantic similarity via sentence
embeddings, and outputs results to the terminal or as a shareable HTML report.
It supports batch workloads from a YAML file, caches API calls so iterating on
thresholds costs nothing, and emits exit code 1 when similarity falls below a
threshold — making it a first-class citizen in CI/CD pipelines.
Version 1.2 adds LLM-as-a-Judge scoring, per-call USD cost tracking, multi-model (3–4 model) comparison, and structured JSON diff.
Version 1.3.0 adds EVAL_REGRESSION_DETECTED schema event emission — --fail-under
gate failures now emit a structured llm.eval.regression.detected event (via
make_eval_regression_event()) in addition to returning exit code 1,
providing a full audit trail for CI regression gates.
Version 1.2.2 integrates AgentOBS
as a built-in observability layer: every comparison, model call, cache lookup,
cost record, judge evaluation, and --fail-under regression failure now emits a
validated schema event that can be collected in memory, exported to JSONL, or
forwarded to any custom backend.
Documentation
| Guide | Description |
|---|---|
| Getting Started | Installation, API keys, first diff |
| Tutorials | Step-by-step learning path from first run to Python API (12 tutorials) |
| CLI Reference | All flags, option groups, exit codes, YAML format |
| Python API | All public functions, dataclasses, and field descriptions |
| Schema Events | Observability integration with AgentOBS |
| Configuration | .llmdiff TOML schema, env vars, config priority |
| Provider Setup | OpenAI, Groq, Mistral, Ollama, LM Studio, Anthropic |
| HTML Reports | Report anatomy, batch reports, judge card, cost table |
| CI / CD Integration | GitHub Actions examples, threshold recommendations |
Quick Start
# Install with semantic scoring support
pip install "llm-diff[semantic]"
# Install with schema-events observability
pip install "llm-diff[semantic]" agentobs
# Set an API key
export OPENAI_API_KEY="sk-..."
# Compare two models on the same prompt
llm-diff "Explain recursion in one sentence." -a gpt-4o -b gpt-4o-mini --semantic
# Save a self-contained HTML report
llm-diff "Explain recursion." -a gpt-4o -b gpt-4o-mini --semantic --out report.html
# Run a batch from a YAML prompt file and gate on similarity
llm-diff --batch prompts.yml -a gpt-4o -b gpt-4o-mini --semantic --fail-under 0.85
See Getting Started for quick examples, or work through the Tutorials for a guided learning path covering prompt engineering, batch evaluation, CI/CD gating, LLM-as-a-Judge, cost tracking, and the Python API.
Getting Help
| Bug reports | Open an issue |
| Feature requests | Open a feature request |
| Questions & discussion | GitHub Discussions |
| Open issues | github.com/veerarag1973/llmdiff/issues |
| PyPI project page | pypi.org/project/llm-diff |
| Roadmap | IMPLEMENTATION_PLAN.md |
| Changelog | CHANGELOG.md |
When filing a bug, please include: llm-diff --version, your OS, Python
version, the full command you ran, and the complete error output.
Contributing
See CONTRIBUTING.md for development setup, running the test suite, code style guidelines, and pull request instructions.
License
llm-diff is distributed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_diff-1.3.1.tar.gz.
File metadata
- Download URL: llm_diff-1.3.1.tar.gz
- Upload date:
- Size: 60.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8a17671dab2b5e121eca904a8099d08d4db801906d450ec92dc6aaad0f8c77f
|
|
| MD5 |
111b7999467a3702e9ccc4ea7ef242de
|
|
| BLAKE2b-256 |
d112d49506d4d76df55950a7077bf4f22f16bc23c159bad05ea751b0416d32ae
|
File details
Details for the file llm_diff-1.3.1-py3-none-any.whl.
File metadata
- Download URL: llm_diff-1.3.1-py3-none-any.whl
- Upload date:
- Size: 70.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ac712c9b9e95fe151db14b4f767278113b601877c0f747623a03f3648e3494e
|
|
| MD5 |
c056911a69df25077ebc379ca3f7ff5f
|
|
| BLAKE2b-256 |
ed092984eba22a311b764c0080ebe0d5e11d51b30a1dad9df4281483d9098b32
|