Skip to main content

N-way structural & semantic XML diff that generates human-readable Markdown reports, driven by per-dialect recipes (Control-M, sitemaps, and more).

Project description

xmldiffreport

Docs CI PyPI Python License: MIT

📖 Documentation: https://bilouro.github.io/xmldiffreport/ · Português

N-way structural & semantic XML diff that produces human-readable Markdown reports — driven by per-dialect recipes.

xmldiffreport compares two or more XML files at once and tells you what actually changed, element by element and attribute by attribute — not a noisy line-by-line text diff. It aligns elements by a natural key (not by position), ignores volatile attributes, and renders a clean Markdown report with a summary table plus per-element detail.

It was born from a real problem — spotting differences between BMC Control-M job patches flowing through test → uat → bench → prod — and generalized into a recipe-driven engine that works on any XML dialect (Control-M exports, sitemaps, POMs, manifests, …).

Status: early (0.1.0), but already useful. Feedback and recipes welcome.


Why not a normal diff / xmldiff?

A plain diff (or git diff) on XML lies, for three reasons:

  1. Volatile attributesVERSION, CREATION_TIME, JOBISN… change on every export with no functional meaning.
  2. Reordering — children are often unordered; a reorder is not a change.
  3. Attribute order inside a tag is irrelevant.

Text/edit-script diffs (like the excellent xmldiff) solve part of this but are 2-way, algorithm-matched (you can't say "match <JOB> by JOBNAME"), and output an edit script rather than a review-friendly report.

xmldiffreport xmldiff DiffDog / Oxygen DeltaXML
Match by declared natural key ⚠️ limited
N-way (3+ files at once)
Markdown report out of the box ❌ (edit script) ⚠️ GUI ❌ (delta XML)
Open source

When to use which — choose xmldiffreport for N-way, key-aligned, report-first comparison (e.g. "the same folder in uat, bench and prod"); reach for xmldiff to produce a patch/edit script, DiffDog/Oxygen for interactive 2-way merging, DeltaXML for heuristic matching of keyless documents, and git diff for raw line changes on already-normalized XML. Full breakdown: How it compares.


Install

pip install xmldiffreport

Requires Python 3.11+ (uses the standard-library tomllib). No third-party dependencies.

Quickstart

Compare two XML files — that's the core idea:

xmldiffreport old.xml new.xml -o report.md

report.md lists every element that changed, one column per file. No options needed — it uses the generic recipe by default. Pass as many files as you like; the report just grows a column each:

xmldiffreport v1.xml v2.xml v3.xml -o report.md

Prefer an HTML page? Add -f html (or name the output *.html):

xmldiffreport old.xml new.xml -f html -o report.html

Exit code is 1 when a difference is found (handy for CI), 0 otherwise.

No files handy? git clone the repo and try the bundled, synthetic examples/: xmldiffreport examples/sitemap/old/sitemap.xml examples/sitemap/new/sitemap.xml --recipe sitemap

Sharper results: recipes

The default compares any XML, but a recipe teaches the tool how to identify elements in a specific dialect — matching "the same" element by a key (not by position) and ignoring volatile attributes. Built-ins: controlm, sitemap, generic; or write your own.

xmldiffreport old.xml new.xml --recipe sitemap -o report.md

Writing recipes · generate one from your XML with an LLM.

Comparing many files (or whole directories)

Point it at directories too — they're scanned recursively for *.xml, and every file found becomes a source:

xmldiffreport ./dump-a ./dump-b --recipe controlm -o report.md

Mental model: every file is a source (labelled by its path); a unit is the recipe's unit element (e.g. a Control-M SMART_FOLDER); the engine compares each unit across every source that contains it (2+). A unit that appears in only one file is ignored. The tool has no notion of "environments" — if it matters which file is production, name it so.

→ Full, worked guide with directory trees and a complete example: Inputs & file layout.


What the report looks like

For each unit (e.g. a Control-M SMART_FOLDER) present in 2+ sources with differences (names below are from the synthetic examples/):

GLX_INGEST_DAILY (SMART_FOLDER)

Sources: bench/patch-a.xml, uat/patch-b.xml, prod/hotfix-c.xml

**~ JOB GLX_INGEST_LOAD**

Element · attribute bench/patch-a.xml uat/patch-b.xml prod/hotfix-c.xml
CMDLINE --force --retry …%%P_DATE
MAXRERUN 0 5 3
INCOND GLX_INGEST_STAGE-…_OK · AND_OR A O A
OUTCOND GLX_INGEST_LOAD-…_OK · SIGN - + +
ON NOTOK|RERUN present present

Notice: it's N-way (one column per file), it shows attribute-level changes of the same element (the SIGN flip, the AND_OR change), it collapses identical jobs into a count, and the volatile VERSION/CREATION_TIME noise is gone.


Recipes

A recipe is a small TOML file that teaches the generic engine about one XML dialect: the natural key per element and which attributes to ignore.

name = "controlm"

[defaults]
unit = "SMART_FOLDER"           # the unit of comparison
ignore_attrs = ["VERSION", "JOBISN", "CREATION_TIME", "LAST_UPLOAD", "..."]

[elements.JOB]
key = ["@JOBNAME"]

[elements.OUTCOND]
key = ["@NAME"]                 # SIGN / ODATE are compared as attributes

[elements.ON]                   # no clear key → synthesize from CODE + DO actions
key = ["@CODE", "*kinds"]
inline = true                   # treat children as pseudo-attributes

Key mini-language

A key is a list of tokens, joined by |:

Token Meaning
@ATTR value of attribute ATTR
#text the element's own text
*tag the element's tag name (use for singletons compared by their text)
child:TAG@ATTR attribute of a child element
child:TAG#text text of a child element (e.g. sitemap <loc>)
*kinds summary of child kinds / DOACTION actions (for keyless elements like <ON>)

If no key is given, the engine falls back to @NAME, then #text, then a composite of all attributes.

Built-in recipes

  • controlm — BMC Control-M exports (DEFTABLE → SMART_FOLDER → JOB → INCOND/OUTCOND/QUANTITATIVE/CONTROL/ON).
  • sitemapsitemap.xml (identity by <loc> text; compares <lastmod>/<priority>/<changefreq>).
  • generic — no dialect knowledge (default).

Drop a .toml anywhere and pass its path to --recipe to add your own dialect.

Generate & validate a recipe

Don't want to write one by hand? Let an LLM draft it from a sample of your XML:

xmldiffreport-recipe scaffold sample.xml > prompt.txt   # paste prompt.txt into any LLM
xmldiffreport-recipe validate my-dialect.toml           # check the result (ships a JSON Schema)

See Generate a recipe with an LLM.


Project layout — tool vs. your usage

src/xmldiffreport/     the installable TOOL (engine, recipes, CLI) — generic, reusable
examples/              synthetic datasets + generator (no real data)
usage/                 a config-driven HARNESS to run the tool on YOUR files
tests/                 pytest suite

The tool in src/ knows nothing about your folders. The usage/ folder is the thin layer you adapt: a config.toml listing the inputs (files/dirs), a report_dir, and a collect.py that runs the diff and writes the report.

cp usage/config.example.toml usage/config.toml   # then edit the paths
python usage/collect.py                            # writes usage/reports/<timestamp>.md

Your config.toml, reports, and any XML under usage/ are git-ignored — real data and paths never get committed.


Library use

from xmldiffreport import diff

result = diff(["old.xml", "new.xml"], recipe="sitemap")   # a file, files, or dir(s)
print(result.render())                                    # Markdown — or result.render("html")

for unit in result.units:        # what differs
    print(unit.ident, unit.sources)
if result:                       # truthy when anything differs (handy for exit codes)
    ...

Performance

Each file is parsed once into an in-memory tree (xml.etree.ElementTree); the diff cost is roughly linear in the number of nodes. For typical Control-M exports (a few MB) it's instant, and it's fine up to the order of tens of MB. It is not designed for gigabyte-scale files — we deliberately favour simple, maintainable code over incremental/streaming parsing.

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

ruff check . && ruff format --check .
mypy src
pytest

See CONTRIBUTING.md. Examples and tests use synthetic data only — never real exports.

Roadmap

  • Report top-level units that exist in only one source (added/removed units).
  • JSON report format (Markdown and HTML already ship; formats are pluggable).
  • Similarity-based matching fallback for keyless elements.
  • More built-in recipes (Maven POM, Android manifest, RSS/Atom, JUnit).

License

MIT © Victor H. Bilouro — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmldiffreport-0.1.0.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xmldiffreport-0.1.0-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file xmldiffreport-0.1.0.tar.gz.

File metadata

  • Download URL: xmldiffreport-0.1.0.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmldiffreport-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6a94b014b1ff830b0eba5c714f66f2eb2324104fedcc93f4f9def9a1b5d44781
MD5 06d5897ed382912f5bb7ce2c986d2eef
BLAKE2b-256 4332a762f51061c99f6553b90299bd9e31e8767014c26f5446ac5a36a22c411f

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmldiffreport-0.1.0.tar.gz:

Publisher: release.yml on bilouro/xmldiffreport

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xmldiffreport-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xmldiffreport-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmldiffreport-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fbea7663d4f8cbd8d7e916c4e6ca8270e6601d3cba6c7f53b68f5ff5761243fe
MD5 d8d1e0f19f53d93eb5c913592b302bf0
BLAKE2b-256 c296780077ef80e5c46a37b6836b12d48a99a7fefbe230b86c83feaf88f427f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmldiffreport-0.1.0-py3-none-any.whl:

Publisher: release.yml on bilouro/xmldiffreport

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page