Skip to main content

A set of similarity metrics to compare html files

Project description

niteru

PyPI version Python CI Coverage Status Documentation

This package provides a set of functions to measure the similarity between HTMLs.

Note: This is a fork of html-similarity.

Key differences

  • Type hints
    • All functions have proper type hints
  • Dependency free
    • Works along with plain Python

Installation

pip install niteru

How it works

Structural Similarity

Uses sequence comparison of the html tags to compute the similarity.

We do not implement the similarity based on tree edit distance because it is slower than sequence comparison.

Style Similarity

Extracts CSS classes of each html document and calculates the jaccard similarity of the sets of classes.

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as::

k * structural_similarity(html1, html2) + (1 - k) * style_similarity(html1, html2)

All the similarity metrics take values between 0.0 and 1.0.

Recommendations for joint similarity

Using k=0.3 gives better results. The style similarity gives more information about the similarity rather than the structural similarity.

Examples

Here is an example:

html1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
  <li class="active">Documents</li>
  <li>Extra</li>
</ul>
 '''

html2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
  <li class="active">Extra Documents</li>
</ul>
'''

from niteru import style_similarity, structural_similarity, similarity

style_similarity(html1, html2) # => 1.0
structural_similarity(html1, html2) # => 0.8571428571428571
similarity(html1, html2) # => 0.9285714285714286

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

niteru-0.2.1.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

niteru-0.2.1-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file niteru-0.2.1.tar.gz.

File metadata

  • Download URL: niteru-0.2.1.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.8.6 Darwin/20.6.0

File hashes

Hashes for niteru-0.2.1.tar.gz
Algorithm Hash digest
SHA256 d2487f3a6e7bb75629111ecef9fa64207f1e92599c933d34defce47c94227013
MD5 34863350369b604453a209c6674f3e19
BLAKE2b-256 546603dd8aa3bf879f5791f8880aee3e390e7d598d26992546d76ab6a05f43f0

See more details on using hashes here.

File details

Details for the file niteru-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: niteru-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.8.6 Darwin/20.6.0

File hashes

Hashes for niteru-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4b7518cfe7ce7328690d6b2fec96c586b8c0a44aed03b3293112a194e2fd8c6c
MD5 8268f70a1df17c68309d5ba5f29bf141
BLAKE2b-256 991bb034bfce9b37df587115a186800490c76c1c10a410f4d8693e5b3dfa0087

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page