Compare web page and evaluate the level of similarity.
Project description
Similarius
Similarius is a Python library to compare web page and evaluate the level of similarity.
The tool can be used as a stand-alone tool or to feed other systems.
Requirements
- Python 3.8+
- Requests
- Scikit-learn
- Beautifulsoup4
- nltk
Installation
Source install
Similarius can be install with poetry. If you don't have poetry installed, you can do the following curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
.
$ poetry install
$ poetry shell
$ similarius -h
pip installation
$ pip3 install similarius
Usage
dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]
optional arguments:
-h, --help show this help message and exit
-o ORIGINAL, --original ORIGINAL
Website to compare
-w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
Website to compare
Usage example
dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu
Used as a library
import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio
parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()
# Original
original = get_website(args.original)
if not original:
print("[-] The original website is unreachable...")
exit(1)
original_text, original_ressource = extract_text_ressource(original.text)
for website in args.website:
print(f"\n********** {args.original} <-> {website} **********")
# Compare
compare = get_website(website)
if not compare:
print(f"[-] {website} is unreachable...")
continue
compare_text, compare_ressource = extract_text_ressource(compare.text)
# Calculate
sim = str(sk_similarity(compare_text, original_text))
print(f"\nSimilarity: {sim}")
ressource_diff = ressource_difference(original_ressource, compare_ressource)
print(f"Ressource Difference: {ressource_diff}")
ratio_compare = ratio(ressource_diff, sim)
print(f"Ratio: {ratio_compare}")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
similarius-0.0.1.tar.gz
(5.2 kB
view details)
Built Distribution
File details
Details for the file similarius-0.0.1.tar.gz
.
File metadata
- Download URL: similarius-0.0.1.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-57-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 398826bfa359518d318a2f004cf16cb137a4671959f8f63cdb1d4279a8d2ea77 |
|
MD5 | 2037f6543a14defe471d1ad0dca5af5e |
|
BLAKE2b-256 | a37075a950e7006f4da0d364e68311b9bc5d800ce4707ec86463ba0c074a1e2c |
File details
Details for the file similarius-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: similarius-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-57-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b49fe0ccc766d574d9034420c262a700290deb6bf51324c3d4ba5e496b550a7 |
|
MD5 | 47c9e62f1fad01062ebd8fa5ff31025c |
|
BLAKE2b-256 | fcf221cb53f70c27481be0593e2e3015b0f1676628e7ec63299a7d9b48f22b2b |