A convenient way to calculate the edit distance between html files
Project description
Vanguard kit
A convenient way to calculate the edit distance between html files to scrape with confidence
Sometimes, scraping becomes a hard task, because the web sites are in continous changing. What about if there was a way to prevent those changes before scrape a site? Vanguard is a tool kit that provides a way to calculate the edit distance between two html files by the Zhang-Shasha algorithm. This package is based on zss.
Installation
OS X & Linux:
From PYPI
$ pip3 install vanguardkit
from the source
$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard-kit
$ python3 setup.py install
Usage example
With vanguard, it is possible to convert html content into a tree (graph) of nodes. The create_html_tree function is the responsible to do that, it returns an instance of the VanguardNode class that inherits from the zss.Node class:
from vanguardkit import create_html_tree
with open("target_website.html") as target_website:
thml_tree = create_html_tree(target_website)
It is possible to segment specific parts of an html file.
By tag:
with open("target_website.html") as target_website:
html_tree = create_html_tree(
html_file=target_website,
specific_tag="footer"
)
By tag and class:
with open("target_website.html") as target_website:
html_tree = create_html_tree(
html_file=target_website,
specific_tag="div",
class_="main-div"
)
By tag and id:
with open("target_website.html") as target_website:
html_tree = create_html_tree(
html_file=target_website,
specific_tag="div",
id="super-div"
)
Calculating distance
As previously said, the used algorithm is the Zhang-Shasha, that computes the edit distance between the two given trees. Ths is possible with the zss package behind the scenes; vanguard only provides a way to convert html files into trees.
from vanguard_kit import create_html_tree, calcuate_html_tree_distance
with open("stored_target_website.html") as stored_file:
with open("current_target_website.html") as current_file:
previous_tree = create_html_tree(stored_file)
current_tree = create_html_tree(current_file)
print(calcuate_html_tree_distance(previous_tree, current_tree))
# Prints 1
Due to the VanguardNode class implements the sub dunder method, the next way to calculate the edit distance is possible:
from vanguard_kit import create_html_tree, calcuate_html_tree_distance
with open("stored_target_website.html") as stored_file:
with open("current_target_website.html") as current_file:
previous_tree = create_html_tree(stored_file)
current_tree = create_html_tree(current_file)
print(previous_tree - current_tree)
# Prints 1
Then, the next statement returns True:
calcuate_html_tree_distance(previous_tree, current_tree) == previous_tree - current_tree
Development setup
This project uses Poetry for dependecy resolution. It's a kind of mix between pip and virtualenv. Follow the next instructions to setup the development enviroment.
First of all, install Poetry:
$ pip install poetry
$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard_kit
$ poetry install
To run the test-suite, inside the pybundler directory:
$ poetry run pytest test/ -vv
Meta
Daniel Omar Vergara Pérez – @__danvergara __ – daniel.omar.vergara@gmail.com -- github.com/danvergara
Valery Briz - @valerybriz -- github.com/valerybriz
Contributing
- Fork it (https://github.com/BentoBox-Project/vanguard-kit)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vanguardkit-0.2.0.tar.gz
.
File metadata
- Download URL: vanguardkit-0.2.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.8.2 Linux/5.0.0-1035-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6d38db3269d1a91dc2e9d89a08b84ff15b3a9ed740e7cadb2c78dce79984cc9 |
|
MD5 | 0e2a4941c9724a0bbf57865d39a5ed62 |
|
BLAKE2b-256 | 3bdf573a1bce71b6132f9a4dcb8bad2b457edd7690d6ca0c8a01897401e764c4 |
File details
Details for the file vanguardkit-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: vanguardkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.8.2 Linux/5.0.0-1035-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3fdfb5b8ee62531cf58a42b447ee3ede1af65df2a6441aade45f9cc3dec72759 |
|
MD5 | 66faca31487141959716a45e629498c1 |
|
BLAKE2b-256 | 59da7aa5470c2e2a090ab97570c070fa3c887966f170124f30c960048a3dd315 |