Skip to main content

A convenient way to calculate the edit distance between html files

Project description

Vanguard kit

A convenient way to calculate the edit distance between html files to scrape with confidence

Sometimes, scraping becomes a hard task, because the web sites are in continous changing. What about if there was a way to prevent those changes before scrape a site? Vanguard is a tool kit that provides a way to calculate the edit distance between two html files by the Zhang-Shasha algorithm. This package is based on zss.

Installation

OS X & Linux:

From PYPI

$ pip3 install vanguardkit

from the source

$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard-kit
$ python3 setup.py install

Usage example

With vanguard, it is possible to convert html content into a tree (graph) of nodes. The create_html_tree function is the responsible to do that, it returns an instance of the VanguardNode class that inherits from the zss.Node class:

from vanguardkit import create_html_tree

with open("target_website.html") as target_website:
    thml_tree = create_html_tree(target_website)

It is possible to segment specific parts of an html file.

By tag:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="footer"
    )

By tag and class:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="div",
        class_="main-div"
    )

By tag and id:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="div",
        id="super-div"
    )

Calculating distance

As previously said, the used algorithm is the Zhang-Shasha, that computes the edit distance between the two given trees. Ths is possible with the zss package behind the scenes; vanguard only provides a way to convert html files into trees.

from vanguard_kit import create_html_tree, calcuate_html_tree_distance

with open("stored_target_website.html") as stored_file:
    with open("current_target_website.html") as current_file:
        previous_tree = create_html_tree(stored_file)
        current_tree = create_html_tree(current_file)
        print(calcuate_html_tree_distance(previous_tree, current_tree))
        # Rrints 1

Due to the VanguardNode class implements the sub dunder method, the next way to calculate the edit distance is possible:

from vanguard_kit import create_html_tree, calcuate_html_tree_distance

with open("stored_target_website.html") as stored_file:
    with open("current_target_website.html") as current_file:
        previous_tree = create_html_tree(stored_file)
        current_tree = create_html_tree(current_file)
        print(previous_tree - current_tree)
        # Rrints 1

Then, the next statement returns True:

calcuate_html_tree_distance(previous_tree, current_tree) == previous_tree - current_tree

Development setup

This project uses pipenv for dependecy resolution. It's a kind of mix between pip and virtualenv. Follow the next instructions to setup the development enviroment.

$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard_kit
$ pipenv shell
$ pipenv install -e .

To run the test-suite, inside the pybundler directory:

$ pytest tests/ -vv

Meta

Daniel Omar Vergara Pérez – @dan1_netdaniel.omar.vergara@gmail.com

https://github.com/dany2691

Contributing

  1. Fork it (https://github.com/dany2691/vanguard-kit)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for vanguardkit, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size vanguardkit-0.1.0-py3-none-any.whl (5.2 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size vanguardkit-0.1.0.tar.gz (4.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page