Skip to main content

A set of similarity metricts to compare html files.

Project description

https://travis-ci.org/matiskay/html-similarity.svg?branch=master

This package provides a set of functions to measure the similarity between web pages.

Install

The quick way:

pip install html-similarity

How it works?

Structural Similarity

We use sequence comparison fo the html tags to compute the structural similarity instead of tree edit distance because tree edit distance is slower than sequence comparison.

The idea of sequence comparison was taken from Page Compare.

Style Similarity

Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes. The idea was taken from [1]

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as:

k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)

This was taken from [1]

The value is in the interval of 0 and 1.

Recommendations for joint similarity

Using k=0.3 give use better results. The style similarity can gives more information about the similarity rather than the style.

Development

See CONTRIBUTING.md file

TODO

  • [ ] Add information about the package in pypi
  • [ ] Add documentation
  • [ ] Add examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
html_similarity-0.3.2-py3-none-any.whl (5.6 kB) Copy SHA256 hash SHA256 Wheel py3
html-similarity-0.3.2.tar.gz (3.1 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page