A set of similarity metrics to compare html files
Project description
niteru
This package provides a set of functions to measure the similarity between HTMLs.
Note: This is a fork of html-similarity.
Key differences
- Type hints
- All functions have proper type hints
- Dependency free
- Works along with plain Python
Installation
pip install niteru
How it works
Structural Similarity
Uses sequence comparison of the html tags to compute the similarity.
We do not implement the similarity based on tree edit distance because it is slower than sequence comparison.
Style Similarity
Extracts CSS classes of each html document and calculates the jaccard similarity of the sets of classes.
Joint Similarity (Structural Similarity and Style Similarity)
The joint similarity metric is calculated as::
k * structural_similarity(html1, html2) + (1 - k) * style_similarity(html1, html2)
All the similarity metrics take values between 0.0 and 1.0.
Recommendations for joint similarity
Using k=0.3
gives better results. The style similarity gives more information about the similarity rather than the structural similarity.
Examples
Here is an example:
html1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
<li class="active">Documents</li>
<li>Extra</li>
</ul>
'''
html2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
<li class="active">Extra Documents</li>
</ul>
'''
from niteru import style_similarity, structural_similarity, similarity
style_similarity(html1, html2) # => 1.0
structural_similarity(html1, html2) # => 0.8571428571428571
similarity(html1, html2) # => 0.9285714285714286
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file niteru-0.2.1.tar.gz
.
File metadata
- Download URL: niteru-0.2.1.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.8.6 Darwin/20.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2487f3a6e7bb75629111ecef9fa64207f1e92599c933d34defce47c94227013 |
|
MD5 | 34863350369b604453a209c6674f3e19 |
|
BLAKE2b-256 | 546603dd8aa3bf879f5791f8880aee3e390e7d598d26992546d76ab6a05f43f0 |
File details
Details for the file niteru-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: niteru-0.2.1-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.8.6 Darwin/20.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b7518cfe7ce7328690d6b2fec96c586b8c0a44aed03b3293112a194e2fd8c6c |
|
MD5 | 8268f70a1df17c68309d5ba5f29bf141 |
|
BLAKE2b-256 | 991bb034bfce9b37df587115a186800490c76c1c10a410f4d8693e5b3dfa0087 |