Skip to main content

Compares two XML documents by diffing their text, ignoring structure, and wraps changed text in <del>/<ins> tags.

Project description

Compares the text inside two XML documents and marks up the differences with <del> and <ins> tags.

This is the result of about 7 years of trying to get this right and coded simply. I’ve used code like this in one form or another to compare bill text on GovTrack.us <https://www.govtrack.us>.

The comparison is completely blind to the structure of the two XML documents. It does a word-by-word comparison on the text content only, and then it goes back into the original documents and wraps changed text in new <del> and <ins> wrapper elements.

The documents are then concatenated to form a new document and the new document is printed on standard output. Or use this as a library and call compare yourself with two lxml.etree.Element nodes (the roots of your documents).

The script is written in Python 3 and uses Google’s Diff Match Patch library <https://code.google.com/p/google-diff-match-patch/>, as re-written and sped-up by @leutloff <https://github.com/leutloff/diff-match-patch-cpp-stl> and then turned into a Python extension module by me <https://github.com/JoshData/diff_match_patch-python>. (A great pull request would be to replace that dependency with Python’s built-in difflib <https://docs.python.org/3/library/difflib.html> module. It’ll be slower but then won’t have any unusual dependencies.)

Example

Comparing these two documents:

<html>
        Here is <b>some bold</b> text.
</html>

and:

<html>
        Here is <i>some italic</i> content that shows how <tt>xml_diff</tt> works.
</html>

Yields:

<documents>
        <html>
                Here is <b>some <del>bold</del></b><del> text</del>.
        </html>
        <html>
                Here is <i>some <ins>italic</ins></i><ins> content that shows how </ins><tt><ins>xml_diff</ins></tt><ins> works</ins>.
        </html>
</documents>

Installation

To install on Ubuntu follow these steps:

sudo apt-get install python3-lxml
# or
sudo apt-get install libxml2-dev libxslt1-dev
sudo pip3 install lxml

# get my Python extension module for the Google Diff Match Patch library
# so we can compute differences in text very quickly
git clone --recursive https://github.com/JoshData/diff_match_patch-python
cd diff_match_patch-python
sudo apt-get install python3-dev
python3 setup.py build
sudo python3 setup.py install

Running

Compare two XML documents:

python3 xml_diff.py --tags del,ins doc1.xml doc2.xml > with_changes.xml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_diff-0.5.0.tar.gz (8.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page