A set of similarity metricts to compare html files.
Project description
This package provides a set of functions to measure the similarity between web pages.
Install
The quick way:
pip install html-similarity
How it works?
Structural Similarity
Uses sequence comparison of the html tags to compute the similarity.
We not implement the similarity based on tree edit distance because it is slower than sequence comparison.
Style Similarity
Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.
Joint Similarity (Structural Similarity and Style Similarity)
The joint similarity metric is calculated as:
k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)
All the similarity metrics takes values between 0 and 1.
Recommendations for joint similarity
Using k=0.3 give use better results. The style similarity gives more information about the similarity rather than the structural similarity.
Examples
Here is a example:
In [1]: html_1 = ''' <h1 class="title">First Document</h1> <ul class="menu"> <li class="active">Documents</li> <li>Extra</li> </ul> ''' In [2]: html_2 = ''' <h1 class="title">Second document Document</h1> <ul class="menu"> <li class="active">Extra Documents</li> </ul> ''' In [3] from html_similarity import style_similarity, structural_similarity, similarity In [4]: style_similarity(html_1, html_2) Out[4]: 1.0 In [7]: structural_similarity(html_1, html_2) Out[7]: 0.9090909090909091 In [8]: similarity(html_1, html_2) Out[8]: 0.9545454545454546
References
The idea of sequence comparision was taken from Page Compare.
The other ideas were taken from T. Gowda and C. A. Mattmann, Clustering Web Pages Based on Structure and Style Similarity, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA, 2016, pp. 175-180.
Use case Clustering web pages based on structure and style similarity
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file html-similarity-0.3.3.tar.gz
.
File metadata
- Download URL: html-similarity-0.3.3.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d132b32f0906e91fe635118cb13c44f7f31b72b06e2d17a84054dff8ffbdca7c |
|
MD5 | 1c0b81c9244e7323e1b09de6ec82de63 |
|
BLAKE2b-256 | bbfca26aaaf6d68c3981aabd655d40df80468f3dc4bdf1155c5d63ef9cda4b58 |
File details
Details for the file html_similarity-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: html_similarity-0.3.3-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba0eb5801a600ca53e185fc0eb6b800bd29ce61cf0a8d291aa1f9f75a530f887 |
|
MD5 | 2fb239ff071a71ef8d323f49702de37f |
|
BLAKE2b-256 | fcdc9b01c726a9a3193e10f85ce70f43be98613376deedffed8d7a5c0ddc4f0c |