Skip to main content

Algorithms to find similarity between HTML pages.

Project description

HTML Matcher

The pypi package in Python contains a series of functions for calculating the similarity ratio between pages in websites or web applications.

How to INSTALL?

pip install html-matcher

How to USE?

By comparing the HTML structure, style, or both, the similarity ratio can be computed. Two techniques are available for structure comparison: Matching Subsequences (MS) and All Path Tree Edit Distance (APTED). One algorithm that uses the jaccard similarity metric is offered for style comparison.

Structure

MS

Example

ms = MatchingSubsequences()
ratio = ms.similarity(page1,page2)

or you can use our improved method of MS that provides better results

ms = MatchingSubsequencesOptimized()
ratio = ms.similarity(page1,page2)

APTED

Example

apted = AllPathTreeEditDistance()
ratio = apted.similarity(page1,page2)

or you can use our improved method of APTED that reduces computational time

apted = AllPathTreeEditDistanceOptimized()
ratio = apted.similarity(page1,page2)

Style

Each html document's css classes are extracted, and the jaccard similarity of the sets of classes is calculated.

Jaccard Similarity

J(A,B) = |A ∩ B| / |A U B|

Example

style = StyleSimilarity()
ratio = style.similarity(page1,page2)

Structure & Style

We must pass a weight for each metric when combining similarity measures (k). The default value is 0.5, but in our experiments, we found that, when comparing web pages based on their similarity ratio, structure takes precedence over style, so 0.7 produces better results.

k * similarity(page_1, page_2) + (1 - k) * similarity(page_1, page_2)

Example

style = StyleSimilarity()
apted = AllPathTreeEditDistanceOptimized()
method = MixedSimilarity(apted, style, k=0.7)
ratio = method.similarity(page1,page2)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html-matcher-0.1.1.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

html_matcher-0.1.1-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file html-matcher-0.1.1.tar.gz.

File metadata

  • Download URL: html-matcher-0.1.1.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.11

File hashes

Hashes for html-matcher-0.1.1.tar.gz
Algorithm Hash digest
SHA256 045ed9150d43921949ba403822aa0e7d2a5d9868b46dae2600324c9afe9271f8
MD5 55b3e80e245a685084568a48b17b58fd
BLAKE2b-256 6e566060159da62cb3942a53c0ecc76203c9c6919399e3eea81319e36d173332

See more details on using hashes here.

File details

Details for the file html_matcher-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for html_matcher-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f71c1dc157b71fdf906f67fdc1e04d5e801ac3be6128e7ed1b987b0f7bde56fb
MD5 90cd06dba359e0b4840f2c0c40302f03
BLAKE2b-256 de0d13544d03218d42a84aff9686fb4aeb3367b3d64d904900ef65019320c493

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page