Skip to main content

Compare web page and evaluate the level of similarity.

Project description

Similarius

Similarius is a Python library to compare web page and evaluate the level of similarity.

The tool can be used as a stand-alone tool or to feed other systems.

Requirements

Installation

Source install

Similarius can be install with poetry. If you don't have poetry installed, you can do the following curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python.

$ poetry install
$ poetry shell
$ similarius -h

pip installation

$ pip3 install similarius

Usage

dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]

optional arguments:
  -h, --help            show this help message and exit
  -o ORIGINAL, --original ORIGINAL
                        Website to compare
  -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
                        Website to compare

Usage example

dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu

Used as a library

import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio

parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()

# Original
original = get_website(args.original)

if not original:
    print("[-] The original website is unreachable...")
    exit(1)

original_text, original_ressource = extract_text_ressource(original.text)

for website in args.website:
    print(f"\n********** {args.original} <-> {website} **********")

    # Compare
    compare = get_website(website)

    if not compare:
        print(f"[-] {website} is unreachable...")
        continue

    compare_text, compare_ressource = extract_text_ressource(compare.text)

    # Calculate
    sim = str(sk_similarity(compare_text, original_text))
    print(f"\nSimilarity: {sim}")

    ressource_diff = ressource_difference(original_ressource, compare_ressource)
    print(f"Ressource Difference: {ressource_diff}")

    ratio_compare = ratio(ressource_diff, sim)
    print(f"Ratio: {ratio_compare}")

Acknowledgment

The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

similarius-0.0.2.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

similarius-0.0.2-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file similarius-0.0.2.tar.gz.

File metadata

  • Download URL: similarius-0.0.2.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-57-generic

File hashes

Hashes for similarius-0.0.2.tar.gz
Algorithm Hash digest
SHA256 beacbcab8c29dde55b2fccc3cee69e01a5a69f19def923fa37ca0ae4dd9ad4bb
MD5 4c3e8fbc5d873f9402a9e7b2ad46ecb7
BLAKE2b-256 b5dc0f174acdb919951d8985a8f858b10a8575f955d98d63558c715dfe560cf0

See more details on using hashes here.

File details

Details for the file similarius-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: similarius-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-57-generic

File hashes

Hashes for similarius-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6b9a2eafaa01b47775c4caede8f41c90a3d05c385ff1d79b52a3c072ab969a77
MD5 ab030b82006b05c3ebc89ddf62802ccf
BLAKE2b-256 0f92f9b589cd9c895421922f6bd4603e8847fcf9bd497783f935eef57bc77814

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page