This package extracts/parses information from source HTML.
Project description
# HTML Parser
extracts/parses information from source HTML.
# construct a Pypi package
python3 setup.py sdist bdist_wheel
twine upload dist/*
# create CLI from dist (if you has .dist file)
python3 -m pip install /home/yaxiong/html_parsing/dist/htmlparsingbs4based-1.1.0.tar.gz
# install package and CLI
pip install htmlparsingbs4based
OR python3 -m pip install htmlparsingbs4based
# run from script
from htmlparsingbs4based.html_parsing.html_parser_custombs4_script import parse_single_page
parse_single_page(input_url=’https://bryansfuel.on.ca/about/’, path_to_crawled_files=’/home/yaxiong/data_crawled_websites/crawled_websites_first_batch’, min_length=1, prefix=””)
# run CLI (examples)
mode_1: eleasticsearch
PARSE -gpf elasticsearch -i ‘http://www.mineracamargo.com/MCA_Investors.html’ -esusr readwrite -espw ‘’
mode_2: local
PARSE -gpf local -i ‘https://bryansfuel.on.ca/about/’ -fo /home/yaxiong/data_crawled_websites/crawled_websites_first_batch
PARSE -gpf local -i ‘http://www.mineracamargo.com/MCA_Investors.html’ -fo /home/yaxiong/data_crawled_websites/crawled_websites_first_batch
PARSE -gpf local -i ‘https://www.conpak.com/About-Conpak/’ -fo /home/yaxiong/data_crawled_websites/crawled_websites_first_batch
mode_3: html
PARSE -gpf html -fi /home/yaxiong/html_parsing/html_example/parsed_html.json
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for htmlparsingbs4based-1.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b7b9cebb0be84fab1358213e760f6598038aec5d50ebf987d4e0261e8ce2ff8 |
|
MD5 | 4667ddefdf38b65b0137426fcc1edd90 |
|
BLAKE2b-256 | 8aaf5ef26eec33ebb2ef690026c6b96a8f536ef2379076459334cb7df1f17653 |
Hashes for htmlparsingbs4based-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd63e5250acfb85e0088faa75965d6dc481104e3d3b92423aad9b192b7f856e4 |
|
MD5 | f5a6e0cc4c5c47bd3caa8c674ee74d7d |
|
BLAKE2b-256 | bd81534aa32f4d8e0de77fdee3da28f80028c9d49db09c4b5d89f60ef8d2001f |