Skip to main content

A convenient way to calculate the edit distance between html files

Project description

Vanguard kit

PyPI version Tests Codecov

A convenient way to calculate the edit distance between html files to scrape with confidence

Sometimes, scraping becomes a hard task, because the web sites are in continous changing. What about if there was a way to prevent those changes before scrape a site? Vanguard is a tool kit that provides a way to calculate the edit distance between two html files by the Zhang-Shasha algorithm. This package is based on zss.

Installation

OS X & Linux:

From PYPI

$ pip3 install vanguardkit

from the source

$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard-kit
$ python3 setup.py install

Usage example

With vanguard, it is possible to convert html content into a tree (graph) of nodes. The create_html_tree function is the responsible to do that, it returns an instance of the VanguardNode class that inherits from the zss.Node class:

from vanguardkit import create_html_tree

with open("target_website.html") as target_website:
    thml_tree = create_html_tree(target_website)

It is possible to segment specific parts of an html file.

By tag:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="footer"
    )

By tag and class:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="div",
        class_="main-div"
    )

By tag and id:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="div",
        id="super-div"
    )

Calculating distance

As previously said, the used algorithm is the Zhang-Shasha, that computes the edit distance between the two given trees. Ths is possible with the zss package behind the scenes; vanguard only provides a way to convert html files into trees.

from vanguard_kit import create_html_tree, calcuate_html_tree_distance

with open("stored_target_website.html") as stored_file:
    with open("current_target_website.html") as current_file:
        previous_tree = create_html_tree(stored_file)
        current_tree = create_html_tree(current_file)
        print(calcuate_html_tree_distance(previous_tree, current_tree))
        # Prints 1

Due to the VanguardNode class implements the sub dunder method, the next way to calculate the edit distance is possible:

from vanguard_kit import create_html_tree, calcuate_html_tree_distance

with open("stored_target_website.html") as stored_file:
    with open("current_target_website.html") as current_file:
        previous_tree = create_html_tree(stored_file)
        current_tree = create_html_tree(current_file)
        print(previous_tree - current_tree)
        # Prints 1

Then, the next statement returns True:

calcuate_html_tree_distance(previous_tree, current_tree) == previous_tree - current_tree

Development setup

This project uses Poetry for dependecy resolution. It's a kind of mix between pip and virtualenv. Follow the next instructions to setup the development enviroment.

First of all, install Poetry:

$ pip install poetry
$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard_kit
$ poetry install

To run the test-suite, inside the pybundler directory:

$ poetry run pytest test/ -vv

Meta

Daniel Omar Vergara Pérez – @__danvergara __daniel.omar.vergara@gmail.com -- github.com/danvergara

Valery Briz - @valerybriz -- github.com/valerybriz

Contributing

  1. Fork it (https://github.com/BentoBox-Project/vanguard-kit)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vanguardkit-0.2.0.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

vanguardkit-0.2.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file vanguardkit-0.2.0.tar.gz.

File metadata

  • Download URL: vanguardkit-0.2.0.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.2 Linux/5.0.0-1035-azure

File hashes

Hashes for vanguardkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b6d38db3269d1a91dc2e9d89a08b84ff15b3a9ed740e7cadb2c78dce79984cc9
MD5 0e2a4941c9724a0bbf57865d39a5ed62
BLAKE2b-256 3bdf573a1bce71b6132f9a4dcb8bad2b457edd7690d6ca0c8a01897401e764c4

See more details on using hashes here.

File details

Details for the file vanguardkit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: vanguardkit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.2 Linux/5.0.0-1035-azure

File hashes

Hashes for vanguardkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3fdfb5b8ee62531cf58a42b447ee3ede1af65df2a6441aade45f9cc3dec72759
MD5 66faca31487141959716a45e629498c1
BLAKE2b-256 59da7aa5470c2e2a090ab97570c070fa3c887966f170124f30c960048a3dd315

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page