Skip to main content

Compares two XML documents by diffing their text, ignoring structure, and wraps changed text in <del>/<ins> tags.

Project description

Compares the text inside two XML documents and marks up the differences with <del> and <ins> tags.

This is the result of about 7 years of trying to get this right and coded simply. I’ve used code like this in one form or another to compare bill text on GovTrack.us <https://www.govtrack.us>.

The comparison is completely blind to the structure of the two XML documents. It does a word-by-word comparison on the text content only, and then it goes back into the original documents and wraps changed text in new <del> and <ins> wrapper elements.

The documents are then concatenated to form a new document and the new document is printed on standard output. Or use this as a library and call compare yourself with two lxml.etree.Element nodes (the roots of your documents).

The script is written in Python 3 and uses Google’s Diff Match Patch library <https://code.google.com/p/google-diff-match-patch/>, as re-written and sped-up by @leutloff <https://github.com/leutloff/diff-match-patch-cpp-stl> and then turned into a Python extension module by me <https://github.com/JoshData/diff_match_patch-python>. (A great pull request would be to replace that dependency with Python’s built-in difflib <https://docs.python.org/3/library/difflib.html> module. It’ll be slower but then won’t have any unusual dependencies.)

Example

Comparing these two documents:

<html>
        Here is <b>some bold</b> text.
</html>

and:

<html>
        Here is <i>some italic</i> content that shows how <tt>xml_diff</tt> works.
</html>

Yields:

<documents>
        <html>
                Here is <b>some <del>bold</del></b><del> text</del>.
        </html>
        <html>
                Here is <i>some <ins>italic</ins></i><ins> content that shows how </ins><tt><ins>xml_diff</ins></tt><ins> works</ins>.
        </html>
</documents>

Installation

To install on Ubuntu follow these steps:

sudo apt-get install python3-lxml
# or
sudo apt-get install libxml2-dev libxslt1-dev
sudo pip3 install lxml

# get my Python extension module for the Google Diff Match Patch library
# so we can compute differences in text very quickly
git clone --recursive https://github.com/JoshData/diff_match_patch-python
cd diff_match_patch-python
sudo apt-get install python3-dev
python3 setup.py build
sudo python3 setup.py install

Running

Compare two XML documents:

python3 xml_diff.py --tags del,ins doc1.xml doc2.xml > with_changes.xml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_diff-0.5.0.tar.gz (8.5 kB view details)

Uploaded Source

File details

Details for the file xml_diff-0.5.0.tar.gz.

File metadata

  • Download URL: xml_diff-0.5.0.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for xml_diff-0.5.0.tar.gz
Algorithm Hash digest
SHA256 a24908770b168591de49a1d9293490749653ba1fd2dfc543a5a1462585957174
MD5 02f2032b4e982f9c83e0e8ebdb301bc8
BLAKE2b-256 6b2ab2e9c90bf0796b5353ef6131b74a22217d2dc62c43f8cc4d2a12e571c1c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page