Skip to main content

Website to vector representation library

Project description

Web2Vec: Website to Vector Library

Overview

Web2Vec is a comprehensive library designed to convert websites into vector parameters. It provides ready-to-use implementations of web crawlers using Scrapy, making it accessible for less experienced researchers. This tool is invaluable for website analysis tasks, including SEO, disinformation detection, and phishing identification.

Website analysis is crucial in various fields, such as SEO, where it helps improve website ranking, and in security, where it aids in identifying phishing sites. By building datasets based on known safe and malicious websites, Web2Vec facilitates the collection and analysis of their parameters, making it an ideal solution for these tasks.

Crucial factors:

  • All-in-One Solution: Web2Vec is an all-in-one solution that allows for the collection of a wide range of information about websites.
  • Efficiency and Expertise: Building a similar solution independently would be very time-consuming and require specialized knowledge. Web2Vec not only integrates with available APIs but also scrapes results from services like Google Index using Selenium.
  • Open Source Advantage: Publishing this tool as open source will facilitate many studies, making them simpler and allowing researchers and industry professionals to focus on more advanced tasks.
  • Continuous Improvement: New techniques will be added successively, ensuring continuous growth in this area.

Features

  • Crawler Implementation: Easily crawl specified websites with customizable depth and allowed domains.
  • Network Analysis: Build and analyze networks of connected websites.
  • Parameter Extraction: Extract a wide range of features for detailed analysis, each providerer returns Python dataclass for maintainability and easier process of adding new parameters, including:
  • HTML Content
  • DNS
  • HTTP Response
  • SSL Certificate
  • URL related geographical location
  • URL Lexical Analysis
  • WHOIS Integration
  • Google Index
  • Open Page Rank
  • Open Phish
  • Phish Tank
  • Similar Web
  • URL House

By using this library, you can easily collect and analyze almost 200 parameters to describe a website comprehensively.

Html Content parameters

@dataclass
class HtmlBodyFeatures:
    contains_forms: bool
    contains_obfuscated_scripts: bool
    contains_suspicious_keywords: bool
    body_length: int
    num_titles: int
    num_images: int
    num_links: int
    script_length: int
    special_characters: int
    script_to_special_chars_ratio: float
    script_to_body_ratio: float
    body_to_special_char_ratio: float
    iframe_redirection: int
    mouse_over_effect: int
    right_click_disabled: int
    num_scripts_http: int
    num_styles_http: int
    num_iframes_http: int
    num_external_scripts: int
    num_external_styles: int
    num_external_iframes: int
    num_meta_tags: int
    num_forms: int
    num_forms_post: int
    num_forms_get: int
    num_forms_external_action: int
    num_hidden_elements: int
    num_safe_anchors: int
    num_media_http: int
    num_media_external: int
    num_email_forms: int
    num_internal_links: int
    favicon_url: Optional[str]
    logo_url: Optional[str]
    found_forms: List[Dict[str, Any]] = field(default_factory=list)
    found_images: List[Dict[str, Any]] = field(default_factory=list)
    found_anchors: List[Dict[str, Any]] = field(default_factory=list)
    found_media: List[Dict[str, Any]] = field(default_factory=list)
    copyright: Optional[str] = None

DNS parameters

@dataclass
class DNSRecordFeatures:
    record_type: str
    ttl: int
    values: List[str]

HTTP Response parameters

@dataclass
class HttpResponseFeatures:
    redirects: bool
    redirect_count: int
    contains_forms: bool
    contains_obfuscated_scripts: bool
    contains_suspicious_keywords: bool
    uses_https: bool
    missing_x_frame_options: bool
    missing_x_xss_protection: bool
    missing_content_security_policy: bool
    missing_strict_transport_security: bool
    missing_x_content_type_options: bool
    is_live: bool
    server_version: Optional[str] = None
    body_length: int = 0
    num_titles: int = 0
    num_images: int = 0
    num_links: int = 0
    script_length: int = 0
    special_characters: int = 0
    script_to_special_chars_ratio: float = 0.0
    script_to_body_ratio: float = 0.0
    body_to_special_char_ratio: float = 0.0

SSLCertificate

@dataclass
class CertificateFeatures:
    subject: Dict[str, Any]
    issuer: Dict[str, Any]
    not_before: datetime
    not_after: datetime
    is_valid: bool
    validity_message: str
    is_trusted: bool
    trust_message: str

URL related geographical location

@dataclass
class URLGeoFeatures:
    url: str
    country_code: str
    asn: int

URL Lexical Analysis

@dataclass
class URLLexicalFeatures:
    count_dot_url: int
    count_dash_url: int
    count_underscore_url: int
    count_slash_url: int
    count_question_url: int
    count_equals_url: int
    count_at_url: int
    count_ampersand_url: int
    count_exclamation_url: int
    count_space_url: int
    count_tilde_url: int
    count_comma_url: int
    count_plus_url: int
    count_asterisk_url: int
    count_hash_url: int
    count_dollar_url: int
    count_percent_url: int
    url_length: int
    tld_amount_url: int
    count_dot_domain: int
    count_dash_domain: int
    count_underscore_domain: int
    count_slash_domain: int
    count_question_domain: int
    count_equals_domain: int
    count_at_domain: int
    count_ampersand_domain: int
    count_exclamation_domain: int
    count_space_domain: int
    count_tilde_domain: int
    count_comma_domain: int
    count_plus_domain: int
    count_asterisk_domain: int
    count_hash_domain: int
    count_dollar_domain: int
    count_percent_domain: int
    domain_length: int
    vowel_count_domain: int
    domain_in_ip_format: bool
    domain_contains_keywords: bool
    count_dot_directory: int
    count_dash_directory: int
    count_underscore_directory: int
    count_slash_directory: int
    count_question_directory: int
    count_equals_directory: int
    count_at_directory: int
    count_ampersand_directory: int
    count_exclamation_directory: int
    count_space_directory: int
    count_tilde_directory: int
    count_comma_directory: int
    count_plus_directory: int
    count_asterisk_directory: int
    count_hash_directory: int
    count_dollar_directory: int
    count_percent_directory: int
    directory_length: int
    count_dot_parameters: int
    count_dash_parameters: int
    count_underscore_parameters: int
    count_slash_parameters: int
    count_question_parameters: int
    count_equals_parameters: int
    count_at_parameters: int
    count_ampersand_parameters: int
    count_exclamation_parameters: int
    count_space_parameters: int
    count_tilde_parameters: int
    count_comma_parameters: int
    count_plus_parameters: int
    count_asterisk_parameters: int
    count_hash_parameters: int
    count_dollar_parameters: int
    count_percent_parameters: int
    parameters_length: int
    tld_presence_in_arguments: int
    number_of_parameters: int
    email_present_in_url: bool
    domain_entropy: float
    url_depth: int
    uses_shortening_service: Optional[str]
    is_ip: bool = False

WHOIS Integration

@dataclass
class WhoisFeatures:
    domain_name: List[str]
    registrar: Optional[str]
    whois_server: Optional[str]
    referral_url: Optional[str]
    updated_date: Optional[datetime]
    creation_date: Optional[datetime]
    expiration_date: Optional[datetime]
    name_servers: List[str]
    status: List[str]
    emails: List[str]
    dnssec: Optional[str]
    name: Optional[str]
    org: Optional[str]
    address: Optional[str]
    city: Optional[str]
    state: Optional[str]
    zipcode: Optional[str]
    country: Optional[str]
    raw: Dict = field(default_factory=dict)

Google Index

@dataclass
class GoogleIndexFeatures:
    is_indexed: Optional[bool]
    position: Optional[int] = None

Open Page Rank

@dataclass
class OpenPageRankFeatures:
    domain: str
    page_rank_decimal: Optional[float]
    updated_date: Optional[str]

Open Phish

@dataclass
class OpenPhishFeatures:
    is_phishing: bool

Phish Tank

@dataclass
class PhishTankFeatures:
    phish_id: str
    url: str
    phish_detail_url: str
    submission_time: str
    verified: str
    verification_time: str
    online: str
    target: str

Similar Web

class SimilarWebFeatures:
    Version: int
    SiteName: str
    Description: str
    TopCountryShares: List[TopCountryShare]
    Title: str
    Engagements: Engagements
    EstimatedMonthlyVisits: List[EstimatedMonthlyVisit]
    GlobalRank: int
    CountryRank: int
    CountryCode: str
    CategoryRank: str
    Category: str
    LargeScreenshot: str
    TrafficSources: TrafficSource
    TopKeywords: List[TopKeyword]
    RawData: dict = field(default_factory=dict)

URL Haus

@dataclass
class URLHausFeatures:
    id: str
    date_added: str
    url: str
    url_status: str
    last_online: str
    threat: str
    tags: str
    urlhaus_link: str
    reporter: str

Why Web2Vec?

While many scripts and solutions exist that perform some of the tasks offered by Web2Vec, none provide a complete all-in-one package. Web2Vec not only offers comprehensive functionality but also ensures maintainability and ease of use.

Integration and Configuration

Web2Vec focuses on integration with free services, leveraging their APIs or scraping their responses. Configuration is handled via Python settings, making it easily configurable through traditional methods (environment variables, configuration files, etc.). Its integration with dedicated phishing detection services makes it a robust tool for building phishing detection datasets.

How to use

Library can be installed using pip:

pip install web2vec

Code usage

Configuration

Configure the library using environment variables or configuration files.

export WEB2VEC_CRAWLER_SPIDER_DEPTH_LIMIT=2
export WEB2VEC_DEFAULT_OUTPUT_PATH=/home/admin/crawler/output
export WEB2VEC_OPEN_PAGE_RANK_API_KEY=XXXXX

Crawling websites and extract parameters

import os

from scrapy.crawler import CrawlerProcess

from src.web2vec.config import config
from src.web2vec.crawlers.extractors import ALL_EXTRACTORS
from src.web2vec.crawlers.spiders import Web2VecSpider

process = CrawlerProcess(
    settings={
        "FEEDS": {
            os.path.join(config.crawler_output_path, "output.json"): {
                "format": "json",
                "encoding": "utf8",
            }
        },
        "DEPTH_LIMIT": config.crawler_spider_depth_limit,
        "LOG_LEVEL": "INFO",
    }
)

process.crawl(
    Web2VecSpider,
    start_urls=["http://quotes.toscrape.com/"], # pages to process
    allowed_domains=["quotes.toscrape.com"], # domains to process for links
    extractors=ALL_EXTRACTORS, # extractors to use
)
process.start()

and as a results you will get each processed page stored in a separate file as json with the following keys:

  • url - processed url
  • title - website title extracted from HTML
  • html - HTTP response text attribute
  • response_headers - HTTP response headers
  • status_code - HTTP response status code
  • extractors - dictionary with extractors results

sample content


Website analysis

Websites can be analysed without scrapping process, by using extractors directly. For example to get data from SimilarWeb for given domain you have just to call appropriate method:

from src.web2vec.extractors.external_api.similar_web_features import \
    get_similar_web_features

domain_to_check = "down.pcclear.com"
entry = get_similar_web_features(domain_to_check)
print(entry)

Contributing

For contributing, refer to its CONTRIBUTING.md file. We are a welcoming community... just follow the Code of Conduct.

Maintainers

Project maintainers are:

  • Damian Frąszczak
  • Edyta Frąszczak

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2vec-0.1.0.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

web2vec-0.1.0-py3-none-any.whl (32.2 kB view details)

Uploaded Python 3

File details

Details for the file web2vec-0.1.0.tar.gz.

File metadata

  • Download URL: web2vec-0.1.0.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for web2vec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 32641131d3c06fa9a5718f4c8925389937b945aa0f8a36da908c4164137c03f9
MD5 1ba68405ea9d7a3f6f36dfb0cffb4b19
BLAKE2b-256 4ce27ddc0f87b213842e2a7a5446adffbc74fb352204c108ce054349bebad5e9

See more details on using hashes here.

File details

Details for the file web2vec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: web2vec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for web2vec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 16aae1bcba692ba885b116207d3df0e4db3033ccdd6b19032c80262582e971dc
MD5 e433e48e01966ae0e7b72c78b061b626
BLAKE2b-256 a3e74906ecfd3077e585a00fc21b08450bc306c48be3108d77bd9e200b8eeef6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page