Skip to main content
Python Software Foundation 20th Year Anniversary Fundraiser  Donate today!

A set of similarity metricts to compare html files.

Project description

This package provides a set of functions to measure the similarity between web pages.


The quick way:

pip install html-similarity

How it works?

Structural Similarity

Uses sequence comparison of the html tags to compute the similarity.

We not implement the similarity based on tree edit distance because it is slower than sequence comparison.

Style Similarity

Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as:

k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)

All the similarity metrics takes values between 0 and 1.

Recommendations for joint similarity

Using k=0.3 give use better results. The style similarity gives more information about the similarity rather than the structural similarity.


Here is a example:

In [1]: html_1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
    <li class="active">Documents</li>

In [2]: html_2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
    <li class="active">Extra Documents</li>

In [3] from html_similarity import style_similarity, structural_similarity, similarity

In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0

In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091

In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for html-similarity, version 0.3.3
Filename, size File type Python version Upload date Hashes
Filename, size html_similarity-0.3.3-py3-none-any.whl (5.3 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size html-similarity-0.3.3.tar.gz (3.5 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page