Skip to main content

Algorithms to find similarity between HTML pages.

Project description

HTML Matcher

The pypi package in Python contains a series of functions for calculating the similarity ratio between pages in websites or web applications.

How to INSTALL?

pip install html-matcher

How to USE?

By comparing the HTML structure, style, or both, the similarity ratio can be computed. Two techniques are available for structure comparison: Matching Subsequences (MS) and All Path Tree Edit Distance (APTED). One algorithm that uses the jaccard similarity metric is offered for style comparison.

Structure

MS

Example

ms = MatchingSubsequences()
ratio = ms.similarity(page1,page2)

or you can use our improved method of MS that provides better results

ms = MatchingSubsequencesOptimized()
ratio = ms.similarity(page1,page2)

APTED

Example

apted = AllPathTreeEditDistance()
ratio = apted.similarity(page1,page2)

or you can use our improved method of APTED that reduces computational time

apted = AllPathTreeEditDistanceOptimized()
ratio = apted.similarity(page1,page2)

Style

Each html document's css classes are extracted, and the jaccard similarity of the sets of classes is calculated.

Jaccard Similarity

J(A,B) = |A ∩ B| / |A U B|

Example

style = StyleSimilarity()
ratio = style.similarity(page1,page2)

Structure & Style

We must pass a weight for each metric when combining similarity measures (k). The default value is 0.5, but in our experiments, we found that, when comparing web pages based on their similarity ratio, structure takes precedence over style, so 0.7 produces better results.

k * similarity(page_1, page_2) + (1 - k) * similarity(page_1, page_2)

Example

style = StyleSimilarity()
apted = AllPathTreeEditDistanceOptimized()
method = MixedSimilarity(apted, style, k=0.7)
ratio = method.similarity(page1,page2)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html-matcher-0.1.1.tar.gz (7.9 kB view hashes)

Uploaded Source

Built Distribution

html_matcher-0.1.1-py3-none-any.whl (8.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page