Skip to main content

A set of similarity metricts to compare html files.

Project description

https://travis-ci.org/matiskay/html-similarity.svg?branch=master https://codebeat.co/badges/304915eb-48a3-46a8-9ce9-2790c82dc2b8

This package provides a set of functions to measure the similarity between web pages.

Install

The quick way:

pip install html-similarity

How it works?

Structural Similarity

Uses sequence comparison of the html tags to compute the similarity.

We not implement the similarity based on tree edit distance because it is slower than sequence comparison.

Style Similarity

Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as:

k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)

All the similarity metrics takes values between 0 and 1.

Recommendations for joint similarity

Using k=0.3 give use better results. The style similarity gives more information about the similarity rather than the structural similarity.

Examples

Here is a example:

In [1]: html_1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
    <li class="active">Documents</li>
    <li>Extra</li>
</ul>
'''

In [2]: html_2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
    <li class="active">Extra Documents</li>
</ul>
'''

In [3] from html_similarity import style_similarity, structural_similarity, similarity

In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0

In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091

In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html-similarity-0.3.3.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

html_similarity-0.3.3-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file html-similarity-0.3.3.tar.gz.

File metadata

  • Download URL: html-similarity-0.3.3.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0

File hashes

Hashes for html-similarity-0.3.3.tar.gz
Algorithm Hash digest
SHA256 d132b32f0906e91fe635118cb13c44f7f31b72b06e2d17a84054dff8ffbdca7c
MD5 1c0b81c9244e7323e1b09de6ec82de63
BLAKE2b-256 bbfca26aaaf6d68c3981aabd655d40df80468f3dc4bdf1155c5d63ef9cda4b58

See more details on using hashes here.

File details

Details for the file html_similarity-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: html_similarity-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0

File hashes

Hashes for html_similarity-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ba0eb5801a600ca53e185fc0eb6b800bd29ce61cf0a8d291aa1f9f75a530f887
MD5 2fb239ff071a71ef8d323f49702de37f
BLAKE2b-256 fcdc9b01c726a9a3193e10f85ce70f43be98613376deedffed8d7a5c0ddc4f0c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page