Skip to main content

...

Project description

HTML2Vec

Converts list of URLs to salient features for ML tasks

THIS LIBRARY WILL DOWNLOAD THE ENTIRE INTERNET. IT CAN GET YOU BANNED, ARRESTED, DEPORTED, ETC

Files/Pipeline

Raw dataset -- pull CSV file from Alexa (kaggle dataset), Phishtank, etc

google_canonical_result.py -- Takes a list of URLs (host names) and finds a canonical full URL to associate with that host. Not supported or endorsed by Google nor is Google endorsed. Secondary purpose is to drop suspicious URLs from datasets. Necessary if dataset consists only of host names.

get_raw_html.py -- takes list of URLs and attempts to download the HTML file. Stores as CSV. Large file warning. Records all status codes and failures.

remove_zero.py -- utility script to drop failed connections from dataset. Can be skipped if NAs are needed in dataset.

html2vec.py -- takes CSV with HTML, generates summary features. Saves two CSVs, one that still contains the original HTML, and one that is trimmed. Features: document length, script length, style length, body length, script-to-body, number of title tags.

url_preproc.py -- given a dataset with url(s), generate a feature vector that summarizes core syntax of the URL. Inherits from html-level feature vector. Generates features based on host name (base url) and full url. Features: number of periods, presence of special symbols (@, -), URL length, IP address (if site responded), number of anchors (#), number of URL parameters, number of queries, number of digits, Shannon Entropy score.

jupyter notebook with examples

Aggregate feature set as of 14 Oct 2020: url, status, datetime, flag, dataset, batch, xml_doc_length, xml_script_length, xml_style_length, xml_body_length, xml_scriptbody_ratio, xml_num_titles, base_url, base_num_periods, full_num_periods, base_spec_symbols, full_spec_symbols, base_length, full_length, ip, full_anchors,base_anchors, full_params, base_params, full_queries, base_queries, full_digits, base_digits, full_entropy, base_entropy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

m77-0.0.0.1.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

m77-0.0.0.1-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file m77-0.0.0.1.tar.gz.

File metadata

  • Download URL: m77-0.0.0.1.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for m77-0.0.0.1.tar.gz
Algorithm Hash digest
SHA256 a3c05e2ad7d69838ad7f0a9393f6dbdf847518f475bdef48fdd85b4c07f6211b
MD5 e324f4be9be2a6377d7fde4fa31b3d80
BLAKE2b-256 62690b6752a96180b405daed318f96d7094770fce4839c453a097e90bb8a7ef9

See more details on using hashes here.

File details

Details for the file m77-0.0.0.1-py3-none-any.whl.

File metadata

  • Download URL: m77-0.0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for m77-0.0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 27e23bb264f546484fcec2e47d07982a591aed18e6921cfedb9a3e9bd837a622
MD5 df8da045ef54ac539e44e31b0ae27a3c
BLAKE2b-256 041471a20d3a1df92890b2292ff2174e43d1a168a9d786322adc42aca32490a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page