A set of similarity metricts to compare html files.
Project description
===============
HTML Similarity
===============
[![Build Status](https://travis-ci.org/matiskay/html-similarity.svg?branch=master)](https://travis-ci.org/matiskay/html-similarity)
This package provides a set of functions to measure the similarity between web pages.
Install
=======
The quick way::
pip install html-similarity
How it works?
=============
Structural Similarity
---------------------
We use sequence comparison fo the html tags to compute the structural similarity instead of
tree edit distance because tree edit distance is slower than sequence comparison.
The idea of sequence comparison was taken from [Page Compare](https://github.com/TeamHG-Memex/page-compare).
Style Similarity
----------------
Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.
The idea was taken from [1]_
Joint Similarity (Structural Similarity and Style Similarity)
-------------------------------------------------------------
The joint similarity metric is calculated as::
k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)
This was taken from [1]_
The value is in the interval of 0 and 1.
Recommendations for joint similarity
------------------------------------
Using `k=0.3` give use better results. The style similarity can gives more information
about the similarity rather than the style.
References
==========
.. [1] [T. Gowda and C. A. Mattmann, "Clustering Web Pages Based on Structure and Style Similarity (Application Paper)," 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA, 2016, pp. 175-180.
doi: 10.1109/IRI.2016.30Clustering Web Pages Based on Structure and Style Similarity](http://ieeexplore.ieee.org/document/7785739/)
Development
===========
See `CONTRIBUTING.md` file
TODO
====
* [ ] Add information about the package in pypi
* [ ] Add documentation
* [ ] Add examples
HTML Similarity
===============
[![Build Status](https://travis-ci.org/matiskay/html-similarity.svg?branch=master)](https://travis-ci.org/matiskay/html-similarity)
This package provides a set of functions to measure the similarity between web pages.
Install
=======
The quick way::
pip install html-similarity
How it works?
=============
Structural Similarity
---------------------
We use sequence comparison fo the html tags to compute the structural similarity instead of
tree edit distance because tree edit distance is slower than sequence comparison.
The idea of sequence comparison was taken from [Page Compare](https://github.com/TeamHG-Memex/page-compare).
Style Similarity
----------------
Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.
The idea was taken from [1]_
Joint Similarity (Structural Similarity and Style Similarity)
-------------------------------------------------------------
The joint similarity metric is calculated as::
k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)
This was taken from [1]_
The value is in the interval of 0 and 1.
Recommendations for joint similarity
------------------------------------
Using `k=0.3` give use better results. The style similarity can gives more information
about the similarity rather than the style.
References
==========
.. [1] [T. Gowda and C. A. Mattmann, "Clustering Web Pages Based on Structure and Style Similarity (Application Paper)," 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA, 2016, pp. 175-180.
doi: 10.1109/IRI.2016.30Clustering Web Pages Based on Structure and Style Similarity](http://ieeexplore.ieee.org/document/7785739/)
Development
===========
See `CONTRIBUTING.md` file
TODO
====
* [ ] Add information about the package in pypi
* [ ] Add documentation
* [ ] Add examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
html-similarity-0.3.1.tar.gz
(3.1 kB
view hashes)
Built Distribution
Close
Hashes for html_similarity-0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1587c52293292e48cae5eb5ffb5f02b8f95b446344f22d2ba371f832999ff0c2 |
|
MD5 | cf8d0ddf665e60197ff7912f29b88ded |
|
BLAKE2b-256 | 1f669a411c32ed3408e3e3bc9f43ad561c2f10b2eeab83700cd2c63ad5a70436 |