Skip to main content

Compare web page and evaluate the level of similarity.

Project description

Similarius

Similarius is a Python library to compare web page and evaluate the level of similarity.

The tool can be used as a stand-alone tool or to feed other systems.

Requirements

Installation

Source install

Similarius can be install with poetry. If you don't have poetry installed, you can do the following curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python.

$ poetry install
$ poetry shell
$ similarius -h

pip installation

$ pip3 install similarius

Usage

dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]

optional arguments:
  -h, --help            show this help message and exit
  -o ORIGINAL, --original ORIGINAL
                        Website to compare
  -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
                        Website to compare

Usage example

dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu

Used as a library

import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio

parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()

# Original
original = get_website(args.original)

if not original:
    print("[-] The original website is unreachable...")
    exit(1)

original_text, original_ressource = extract_text_ressource(original.text)

for website in args.website:
    print(f"\n********** {args.original} <-> {website} **********")

    # Compare
    compare = get_website(website)

    if not compare:
        print(f"[-] {website} is unreachable...")
        continue

    compare_text, compare_ressource = extract_text_ressource(compare.text)

    # Calculate
    sim = str(sk_similarity(compare_text, original_text))
    print(f"\nSimilarity: {sim}")

    ressource_diff = ressource_difference(original_ressource, compare_ressource)
    print(f"Ressource Difference: {ressource_diff}")

    ratio_compare = ratio(ressource_diff, sim)
    print(f"Ratio: {ratio_compare}")

Acknowledgment

The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

similarius-0.0.4.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

similarius-0.0.4-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file similarius-0.0.4.tar.gz.

File metadata

  • Download URL: similarius-0.0.4.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.17.0-14-generic

File hashes

Hashes for similarius-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3bb6b538709bf53c1689e320d42dbafd225194fa358cd938b62bd658ddf6c309
MD5 042950c7dd9c472f30f18d7dd203a5ab
BLAKE2b-256 18577a9ef36227b7953974ae9c346d741283377f5beac2dc11ad9fe65773d049

See more details on using hashes here.

File details

Details for the file similarius-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: similarius-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.17.0-14-generic

File hashes

Hashes for similarius-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4f3ad06ba0d124f51e13332f65292cb7a10b3c26e91749e9267eb2098faededd
MD5 ee282c8b51b8592bd3b407724c994644
BLAKE2b-256 25d849eea3a7666a776b5f7c0c6edc7d564aa36b06d8d85edeb0530ac079c06c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page