Compare web page and evaluate the level of similarity.
Project description
Similarius
Similarius is a Python library to compare web page and evaluate the level of similarity.
The tool can be used as a stand-alone tool or to feed other systems.
Requirements
- Python 3.8+
- Requests
- Scikit-learn
- Beautifulsoup4
- nltk
Installation
Source install
Similarius can be install with poetry. If you don't have poetry installed, you can do the following curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python.
$ poetry install
$ poetry shell
$ similarius -h
pip installation
$ pip3 install similarius
Usage
dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]
optional arguments:
-h, --help show this help message and exit
-o ORIGINAL, --original ORIGINAL
Website to compare
-w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
Website to compare
Usage example
dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu
Used as a library
import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio
parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()
# Original
original = get_website(args.original)
if not original:
print("[-] The original website is unreachable...")
exit(1)
original_text, original_ressource = extract_text_ressource(original.text)
for website in args.website:
print(f"\n********** {args.original} <-> {website} **********")
# Compare
compare = get_website(website)
if not compare:
print(f"[-] {website} is unreachable...")
continue
compare_text, compare_ressource = extract_text_ressource(compare.text)
# Calculate
sim = str(sk_similarity(compare_text, original_text))
print(f"\nSimilarity: {sim}")
ressource_diff = ressource_difference(original_ressource, compare_ressource)
print(f"Ressource Difference: {ressource_diff}")
ratio_compare = ratio(ressource_diff, sim)
print(f"Ratio: {ratio_compare}")
Acknowledgment
The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file similarius-0.0.5.tar.gz.
File metadata
- Download URL: similarius-0.0.5.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.17.0-14-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9b02d9e91428700ec2ea2ef7eac3e683b0d25968876f7690a627200cbfef381
|
|
| MD5 |
cab5273558b2bd4c0fddfc7b86c5fbdc
|
|
| BLAKE2b-256 |
02a9b000fb8a228024655277364f5aef0aef86f3c218a698f280b585737a316e
|
File details
Details for the file similarius-0.0.5-py3-none-any.whl.
File metadata
- Download URL: similarius-0.0.5-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.17.0-14-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
439cd34b2cbf24a89e714bcf9c9aebbe1a7c7e0d2e47de3daa5053f0b69b2e69
|
|
| MD5 |
49cae898aa48d6112fdf665241e25428
|
|
| BLAKE2b-256 |
92e05fc831458d3d48591eac8cc8e225189c21b0dadacf2a48b8a157573c8946
|