html-similarity

A set of similarity metricts to compare html files.

These details have not been verified by PyPI

Project links

Homepage

Project description

This package provides a set of functions to measure the similarity between web pages.

Install

The quick way:

pip install html-similarity

How it works?

Structural Similarity

Uses sequence comparison of the html tags to compute the similarity, by default.

We not implement the similarity based on tree edit distance because it is slower than sequence comparison.

structural_similarity accepts an algorithm keyword to pick the comparison strategy:

indel (default): flat tag-sequence comparison using rapidfuzz’s bit-parallel Indel/LCS implementation. Fastest option, but blind to nesting (e.g. moving an element to a different parent without changing the overall tag order won’t affect the score).
pq_gram: tree-structure aware. Compares pq-gram profiles, which approximate Tree Edit Distance in roughly linear time while still capturing parent/child relationships. Slower than indel but catches structural changes that a flat sequence misses.
difflib: legacy flat tag-sequence comparison (the original implementation), kept mainly for benchmarking against indel.

See notebooks/structural_similarity_benchmark.ipynb for a notebook that compares the speed and the structural sensitivity of all three.

Style Similarity

Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as:

k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)

All the similarity metrics takes values between 0 and 1.

Recommendations for joint similarity

Using k=0.3 give use better results. The style similarity gives more information about the similarity rather than the structural similarity.

Examples

Here is a example:

In [1]: html_1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
    <li class="active">Documents</li>
    <li>Extra</li>
</ul>
'''

In [2]: html_2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
    <li class="active">Extra Documents</li>
</ul>
'''

In [3] from html_similarity import style_similarity, structural_similarity, similarity

In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0

In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091

In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546

References

The idea of sequence comparision was taken from Page Compare.
The other ideas were taken from T. Gowda and C. A. Mattmann, Clustering Web Pages Based on Structure and Style Similarity, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA, 2016, pp. 175-180.
Use case Clustering web pages based on structure and style similarity

Thanks

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.5.0

Jul 4, 2026

0.4.1

Jun 30, 2026

0.4.0

Jun 30, 2026

0.3.3

Sep 22, 2020

0.3.2

Oct 27, 2017

0.3.1

Oct 26, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_similarity-0.5.0.tar.gz (11.5 kB view details)

Uploaded Jul 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

html_similarity-0.5.0-py3-none-any.whl (10.7 kB view details)

Uploaded Jul 4, 2026 Python 3

File details

Details for the file html_similarity-0.5.0.tar.gz.

File metadata

Download URL: html_similarity-0.5.0.tar.gz
Upload date: Jul 4, 2026
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for html_similarity-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`ccbee139dd27d213546af10e608438b3c698258e000fa8f3ea9d2d9dbaf87e8d`
MD5	`0f8bc9103009d2b7ce7b05fd5f2a482a`
BLAKE2b-256	`b1f7ff70237f0707da0f50bced78566f36f2ab6eee645a1723517c519fa433dd`

See more details on using hashes here.

File details

Details for the file html_similarity-0.5.0-py3-none-any.whl.

File metadata

Download URL: html_similarity-0.5.0-py3-none-any.whl
Upload date: Jul 4, 2026
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for html_similarity-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1dd4f3156558038416946e66d04342036561dea1dd69b8e818e9cda0157c5254`
MD5	`61f0b5dbf174db2aa8e7d6507da68462`
BLAKE2b-256	`6495b044bc81ad7a3281b1803c71688288b38044abbea432db257161381f487d`

See more details on using hashes here.

html-similarity 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Install

How it works?

Structural Similarity

Style Similarity

Joint Similarity (Structural Similarity and Style Similarity)

Recommendations for joint similarity

Examples

References

Thanks

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes