Skip to main content

N-way structural & semantic XML diff that generates human-readable Markdown reports, driven by per-dialect recipes (Control-M, sitemaps, and more).

Project description

xmldiffreport

Docs CI PyPI Python License: MIT

📖 Documentation: https://bilouro.github.io/xmldiffreport/ · English
📖 Documentation: https://bilouro.github.io/xmldiffreport/pt/ · Português

N-way structural & semantic XML diff that produces human-readable Markdown reports — driven by per-dialect recipes.

xmldiffreport compares two or more XML files at once — BMC Control-M exports, Maven POMs, JUnit/xUnit reports, sitemaps, or any dialect you teach it with a small recipe — and tells you what actually changed, element by element and attribute by attribute, not a noisy line-by-line text diff. It aligns elements by a natural key (not by position), ignores volatile attributes, and renders a clean Markdown report with a summary table plus per-element detail.

It was born from a real problem — spotting differences between BMC Control-M job patches flowing through test → uat → bench → prod — and generalized into a recipe-driven engine that works on any XML dialect (Control-M exports, sitemaps, POMs, manifests, …).

Status: early (0.1.0), but already useful. Feedback and recipes welcome.


Why not a normal diff / xmldiff?

A plain diff (or git diff) on XML lies, for three reasons:

  1. Volatile attributesVERSION, CREATION_TIME, JOBISN… change on every export with no functional meaning.
  2. Reordering — children are often unordered; a reorder is not a change.
  3. Attribute order inside a tag is irrelevant.

Text/edit-script diffs (like the excellent xmldiff) solve part of this but are 2-way, algorithm-matched (you can't say "match <JOB> by JOBNAME"), and output an edit script rather than a review-friendly report.

xmldiffreport xmldiff DiffDog / Oxygen DeltaXML
Match by declared natural key ⚠️ limited
N-way (3+ files at once)
Markdown report out of the box ❌ (edit script) ⚠️ GUI ❌ (delta XML)
Open source

When to use which — choose xmldiffreport for N-way, key-aligned, report-first comparison (e.g. "the same folder in uat, bench and prod"); reach for xmldiff to produce a patch/edit script, DiffDog/Oxygen for interactive 2-way merging, DeltaXML for heuristic matching of keyless documents, and git diff for raw line changes on already-normalized XML. Full breakdown: How it compares.


Install

pip install xmldiffreport

Requires Python 3.11+ (uses the standard-library tomllib). No third-party dependencies.

Quickstart

Compare two XML files — that's the core idea:

xmldiffreport old.xml new.xml -o report.md

report.md lists every element that changed, one column per file. No options needed — it uses the generic recipe by default. Pass as many files as you like; the report just grows a column each:

xmldiffreport v1.xml v2.xml v3.xml -o report.md

Prefer an HTML page? Add -f html (or name the output *.html):

xmldiffreport old.xml new.xml -f html -o report.html

Exit code is 1 when a difference is found (handy for CI), 0 otherwise.

No files handy? git clone the repo and try the bundled, synthetic examples/: xmldiffreport examples/sitemap/old/sitemap.xml examples/sitemap/new/sitemap.xml --recipe sitemap

Sharper results: recipes

The default compares any XML, but a recipe teaches the tool how to identify elements in a specific dialect — matching "the same" element by a key (not by position) and ignoring volatile attributes. Built-ins: controlm, maven-pom, junit, sitemap, generic; or write your own.

xmldiffreport old.xml new.xml --recipe sitemap -o report.md

Writing recipes · generate one from your XML with an LLM.

Comparing many files (or whole directories)

Point it at directories too — they're scanned recursively for *.xml, and every file found becomes a source:

xmldiffreport ./dump-a ./dump-b --recipe controlm -o report.md

Mental model: every file is a source (labelled by its path); a unit is the recipe's unit element (e.g. a Control-M SMART_FOLDER); the engine compares each unit across every source that contains it (2+). A unit that appears in only one file is ignored. The tool has no notion of "environments" — if it matters which file is production, name it so.

→ Full, worked guide with directory trees and a complete example: Inputs & file layout.


What the report looks like

For each unit (e.g. a Control-M SMART_FOLDER) present in 2+ sources with differences (names below are from the synthetic examples/):

GLX_INGEST_DAILY (SMART_FOLDER)

Sources: bench/patch-a.xml, uat/patch-b.xml, prod/hotfix-c.xml

**~ JOB GLX_INGEST_LOAD**

Element · attribute bench/patch-a.xml uat/patch-b.xml prod/hotfix-c.xml
CMDLINE --force --retry …%%P_DATE
MAXRERUN 0 5 3
INCOND GLX_INGEST_STAGE-…_OK · AND_OR A O A
OUTCOND GLX_INGEST_LOAD-…_OK · SIGN - + +
ON NOTOK|RERUN present present

Notice: it's N-way (one column per file), it shows attribute-level changes of the same element (the SIGN flip, the AND_OR change), it collapses identical jobs into a count, and the volatile VERSION/CREATION_TIME noise is gone.


Recipes

A recipe is a small TOML file that teaches the generic engine about one XML dialect: the natural key per element and which attributes to ignore.

name = "controlm"

[defaults]
unit = "SMART_FOLDER"           # the unit of comparison
ignore_attrs = ["VERSION", "JOBISN", "CREATION_TIME", "LAST_UPLOAD", "..."]

[elements.JOB]
key = ["@JOBNAME"]

[elements.OUTCOND]
key = ["@NAME"]                 # SIGN / ODATE are compared as attributes

[elements.ON]                   # no clear key → synthesize from CODE + DO actions
key = ["@CODE", "*kinds"]
inline = true                   # treat children as pseudo-attributes

Key mini-language

A key is a list of tokens, joined by |:

Token Meaning
@ATTR value of attribute ATTR
#text the element's own text
*tag the element's tag name (use for singletons compared by their text)
child:TAG@ATTR attribute of a child element
child:TAG#text text of a child element (e.g. sitemap <loc>)
*kinds summary of child kinds / DOACTION actions (for keyless elements like <ON>)

If no key is given, the engine falls back to @NAME, then #text, then a composite of all attributes.

Built-in recipes

  • controlm — BMC Control-M exports (DEFTABLE → SMART_FOLDER → JOB → INCOND/OUTCOND/QUANTITATIVE/CONTROL/ON).
  • maven-pom — Maven pom.xml: dependency & plugin drift, keyed by coordinates (groupId:artifactId). Reports version/scope changes and added/removed entries across <dependencies>, <dependencyManagement> and <build>.
  • junit — JUnit/xUnit reports (Surefire, Gradle, pytest, …): keyed by classname+name. Surfaces pass↔fail↔skip transitions and added/removed tests, ignoring time/timestamp/hostname.
  • sitemapsitemap.xml (identity by <loc> text; compares <lastmod>/<priority>/<changefreq>).
  • generic — no dialect knowledge (default).

Drop a .toml anywhere and pass its path to --recipe to add your own dialect.

Generate & validate a recipe

Don't want to write one by hand? Let an LLM draft it from a sample of your XML:

xmldiffreport-recipe scaffold sample.xml > prompt.txt   # paste prompt.txt into any LLM
xmldiffreport-recipe validate my-dialect.toml           # check the result (ships a JSON Schema)
xmldiffreport-recipe show controlm                      # print a built-in recipe to learn from

See Generate a recipe with an LLM.


Project layout — tool vs. your usage

src/xmldiffreport/     the installable TOOL (engine, recipes, CLI) — generic, reusable
examples/              synthetic datasets + generator (no real data)
usage/                 a config-driven HARNESS to run the tool on YOUR files
tests/                 pytest suite

The tool in src/ knows nothing about your folders. The usage/ folder is the thin layer you adapt: a config.toml listing the inputs (files/dirs), a report_dir, and a collect.py that runs the diff and writes the report.

cp usage/config.example.toml usage/config.toml   # then edit the paths
python usage/collect.py                            # writes usage/reports/<timestamp>.md

Your config.toml, reports, and any XML under usage/ are git-ignored — real data and paths never get committed.


Library use

from xmldiffreport import diff

result = diff(["old.xml", "new.xml"], recipe="sitemap")   # a file, files, or dir(s)
print(result.render())                                    # Markdown — or result.render("html")

for unit in result.units:        # what differs
    print(unit.ident, unit.sources)
if result:                       # truthy when anything differs (handy for exit codes)
    ...

Performance

Each file is parsed once into an in-memory tree (xml.etree.ElementTree); the diff cost is roughly linear in the number of nodes. For typical Control-M exports (a few MB) it's instant, and it's fine up to the order of tens of MB. It is not designed for gigabyte-scale files — we deliberately favour simple, maintainable code over incremental/streaming parsing.

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

ruff check . && ruff format --check .
mypy src
pytest

See CONTRIBUTING.md. Examples and tests use synthetic data only — never real exports.

Roadmap

  • Report top-level units that exist in only one source (added/removed units).
  • JSON report format (Markdown and HTML already ship; formats are pluggable).
  • Similarity-based matching fallback for keyless elements.
  • More built-in recipes (Android manifest, RSS/Atom, .NET web.config, …).

License

MIT © Victor H. Bilouro — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmldiffreport-0.3.1.tar.gz (41.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xmldiffreport-0.3.1-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file xmldiffreport-0.3.1.tar.gz.

File metadata

  • Download URL: xmldiffreport-0.3.1.tar.gz
  • Upload date:
  • Size: 41.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmldiffreport-0.3.1.tar.gz
Algorithm Hash digest
SHA256 ae7c0463e9cbdaa0098bd6ef94d1e806c3ca8e448a352a26903ad9a662f24ef7
MD5 d4c1cdceffc0fae9d85f7f59fcdaa039
BLAKE2b-256 ea94a0654caca8eee29c85c6b11c81ea36352c27c86d8be1faa487a59e5880ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmldiffreport-0.3.1.tar.gz:

Publisher: release.yml on bilouro/xmldiffreport

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xmldiffreport-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: xmldiffreport-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmldiffreport-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e7c5c86bf0365424527ea3e8c1d0cf6e20554ed692d6cccaabafd49a7bd2a00d
MD5 5bfcaf277ed25bc1d2eadb3601fd00c3
BLAKE2b-256 e21f5e2e5a5630a63081e44de4fd07fac4edb43aa9ab9f05ff6eb686807bf5c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmldiffreport-0.3.1-py3-none-any.whl:

Publisher: release.yml on bilouro/xmldiffreport

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page