Skip to main content

Compares two XML documents by diffing their text, ignoring structure, and wraps changed text in <del>/<ins> tags.

Project description

xml_diff

Compares the text inside two XML documents and marks up the differences with <del> and <ins> tags.

This is the result of about 7 years of trying to get this right and coded simply. I've used code like this in one form or another to compare bill text on GovTrack.us <https://www.govtrack.us>.

The comparison is completely blind to the structure of the two XML documents. It does a word-by-word comparison on the text content only, and then it goes back into the original documents and wraps changed text in new <del> and <ins> wrapper elements.

The documents are then concatenated to form a new document and the new document is printed on standard output. Or use this as a library and call compare yourself with two lxml.etree.Element nodes (the roots of your documents).

The script is written in Python 3.

Example

Comparing these two documents:

<html>
    Here is <b>some bold</b> text.
</html>

and:

<html>
    Here is <i>some italic</i> content that shows how <tt>xml_diff</tt> works.
</html> 

Yields:

<documents>
    <html>
        Here is <b>some <del>bold</del></b><del> text</del>.
    </html>
    <html>
        Here is <i>some <ins>italic</ins></i><ins> content that shows how </ins><tt><ins>xml_diff</ins></tt><ins> works</ins>.
    </html>
</documents>

On Ubuntu, get dependencies with:

apt-get install python3-lxml libxml2-dev libxslt1-dev

For really fast comparisons, get Google's Diff Match Patch library <https://code.google.com/p/google-diff-match-patch/>, as re-written and sped-up by @leutloff <https://github.com/leutloff/diff-match-patch-cpp-stl> and then turned into a Python extension module by me <https://github.com/JoshData/diff_match_patch-python>:

pip3 install diff_match_patch_python

Or if you can't install that for any reason, use the pure-Python library:

pip3 install diff-match-patch

This is also at <https://code.google.com/p/google-diff-match-patch/source/browse/trunk/python3/diff_match_patch.py>. xml_diff will use whichever is installed.

Finally, install this module:

pip3 install xml_diff

Then call the module from the command line:

python3 -m xml_diff  --tags del,ins doc1.xml doc2.xml > changes.xml

Or use the module from Python:

import lxml.etree
from xml_diff import compare

dom1 = lxml.etree.parse("doc1.xml").getroot()
dom2 = lxml.etree.parse("doc2.xml").getroot()
comparison = compare(dom1, dom2)

The two DOMs are modified in-place.

Optional Arguments

The compare function takes other optional keyword arguments:

merge is a boolean (default false) that indicates whether the comparison function should perform a merge. If true, dom1 will contain not just <del> nodes but also <ins> nodes and, similarly, dom2 will contain not just <ins> nodes but also <del> nodes. Although the two DOMs will now contain the same semantic information about changes, and the same text content, each preserves their original structure --- since the comparison is only over text and not structure. The new ins/del nodes contain content from the other document (including whole subtrees), and so there's no guarantee that the final documents will conform to any particular structural schema after this operation.

word_separator_regex (default r"\s+|[^\s\w]") is a regular expression for how to separate words. The default splits on one or more spaces in a row and single instances of non-word characters.

differ is a function that takes two arguments (text1, text2) and returns an iterator over difference operations given as tuples of the form (operation, text_length), where operation is one of "=" (no change in text), "+" (text inserted into text2), or "-" (text deleted from text1). (See xml_diff/__init__.py's default_differ function for how the default differ works.)

tags is a two-tuple of tag names to use for deleted and inserted content. The default is ('del', 'ins').

make_tag_func is a function that takes one argument, which is either "ins" or "del", and returns a new lxml.etree.Element to be inserted into the DOM to wrap changed content. If given, the tags argument is ignored.

License

This project is free of any copyright restrictions per the [Unlicense](https://unlicense.org/). (Prior to May 3, 2025, the project was made available under the terms of the [CC0 1.0 Universal public domain dedication](http://creativecommons.org/publicdomain/zero/1.0/).) See [LICENSE](LICENSE) and [CONTRIBUTING.md](CONTRIBUTING.md).

For Project Maintainers

To release:

  • Update the version number in setup.py.
  • Push to github.
  • Publish a source and wheel distribution to pypi (see command below).
source venv/bin/activate
pip3 install --upgrade build twine
rm -rf dist
python3 -m build
twine upload -u __token__ dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_diff-0.7.1.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

xml_diff-0.7.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file xml_diff-0.7.1.tar.gz.

File metadata

  • Download URL: xml_diff-0.7.1.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for xml_diff-0.7.1.tar.gz
Algorithm Hash digest
SHA256 44681b8c9ccd7ce861d474a2aba543f9fde3b20a8398b347acf4b4d30931a0f3
MD5 b98da85ff9974b3d201fd35fd52f8225
BLAKE2b-256 d63b85312ae1ec054783211898b9f27e877061260725c2ae5c8ff54ce15982ec

See more details on using hashes here.

File details

Details for the file xml_diff-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: xml_diff-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for xml_diff-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 51ee802b11ea9399d5c4e685133441f70aeb0ca4c0faac64da2f71150b0eb37f
MD5 d9f30da938e73bbeb1812ba708de375f
BLAKE2b-256 2a5953a89515e08caf183fd24f271255667eb60ced3548659c40ceda8736634c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page