Skip to main content

Compare web page and evaluate the level of similarity.

Project description

Similarius

Similarius is a Python library to compare web page and evaluate the level of similarity.

The tool can be used as a stand-alone tool or to feed other systems.

Requirements

Installation

Source install

Similarius can be install with poetry. If you don't have poetry installed, you can do the following curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python.

$ poetry install
$ poetry shell
$ similarius -h

pip installation

$ pip3 install similarius

Usage

dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]

optional arguments:
  -h, --help            show this help message and exit
  -o ORIGINAL, --original ORIGINAL
                        Website to compare
  -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
                        Website to compare

Usage example

dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu

Used as a library

import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio

parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()

# Original
original = get_website(args.original)

if not original:
    print("[-] The original website is unreachable...")
    exit(1)

original_text, original_ressource = extract_text_ressource(original.text)

for website in args.website:
    print(f"\n********** {args.original} <-> {website} **********")

    # Compare
    compare = get_website(website)

    if not compare:
        print(f"[-] {website} is unreachable...")
        continue

    compare_text, compare_ressource = extract_text_ressource(compare.text)

    # Calculate
    sim = str(sk_similarity(compare_text, original_text))
    print(f"\nSimilarity: {sim}")

    ressource_diff = ressource_difference(original_ressource, compare_ressource)
    print(f"Ressource Difference: {ressource_diff}")

    ratio_compare = ratio(ressource_diff, sim)
    print(f"Ratio: {ratio_compare}")

Acknowledgment

The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

similarius-0.0.5.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

similarius-0.0.5-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file similarius-0.0.5.tar.gz.

File metadata

  • Download URL: similarius-0.0.5.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.17.0-14-generic

File hashes

Hashes for similarius-0.0.5.tar.gz
Algorithm Hash digest
SHA256 a9b02d9e91428700ec2ea2ef7eac3e683b0d25968876f7690a627200cbfef381
MD5 cab5273558b2bd4c0fddfc7b86c5fbdc
BLAKE2b-256 02a9b000fb8a228024655277364f5aef0aef86f3c218a698f280b585737a316e

See more details on using hashes here.

File details

Details for the file similarius-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: similarius-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.17.0-14-generic

File hashes

Hashes for similarius-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 439cd34b2cbf24a89e714bcf9c9aebbe1a7c7e0d2e47de3daa5053f0b69b2e69
MD5 49cae898aa48d6112fdf665241e25428
BLAKE2b-256 92e05fc831458d3d48591eac8cc8e225189c21b0dadacf2a48b8a157573c8946

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page