Skip to main content

a slight preparser to help parse webpage content or get request from urls,which supports win, mac and unix.

Project description

Description

this is a sight Parser to help you pre_parser the datas from specified website url or api, it help you get ride of the duplicate coding to get the request from the specified url and speed up the process with the threading pool and you just need focused on the bussiness proceess coding after you get the specified request response from the specified webpage or api urls

Attention

as this slight pre_parser for the old version 1.0.0, which only can help preparser the static html or api inform, but now from the 2.0.0 , I have added an new html_dynamic mode, which will help get all inform even generated by the JS code.

python version >= 3.9 

How to use

install

$ pip install preparser

Github Resouce ➡️ Github Repos

and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu.

PyPI: ➡️ PyPI Publish

parameters

here below are some of the parameters you can use for initai the Object PreParser from the package preparser:

Parameters Type Description
url_list list The list of URLs to parse from. Default is an empty list.
request_call_back_func Callable or None A callback function according to the parser_mode to handle the BeautifulSoup object or request json Object. and if you want to show your business process failed, you can return None, otherwise please return a not None Object.
parser_mode 'html', 'api' or 'html_dynamic' The pre-parsing datas mode,default is 'html'.
html: parse the content from static html, and return an BeautifulSoup Object.
api: parse the datas from an api, and return the json Object.
html_dynamic: parse from the whole webpage html content and return an BeautifulSoup Object, even the content that generated by the dynamic js code.
**and all of Object you can get when you defined the request_call_back_func, otherwise get it via the object of PreParer(....).cached_request_datas
cached_data bool weather cache the parsed datas, defalt is False.
start_threading bool Whether to use threading pool for parsing the data. Default is False.
threading_mode 'map' or 'single' to run the task mode, default is single.
map: use the map func of the theading pool to distribute tasks.
single: use the submit func to distribute the task one by one into the theading pool.
stop_when_task_failed bool wheather need stop when you failed to get request from a Url,default is True
threading_numbers int The maximum number of threads in the threading pool. Default is 3.
checked_same_site bool wheather need add more headers info to pretend requesting in a same site to parse datas, default is True,to resolve the CORS Block.

example

#  test.py
from preparser import PreParser,BeautifulSoup,Json_Data,Filer


def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:
    # here you can just write the bussiness logical you want
    
    # attention:
    # preparser_object type depaned on the `parser_mode` in the `PreParser`:
    #               'api' : preparser_object is the type of a Json_Data
    #               'html' : preparser_object is the type of a BeautifulSoup 
    
    ........
    
    # for the finally return:
    # if you want to show current result is failed just Return a None, else just return any object which is not None.
    return preparser_object


if __name__ == "__main__":
    
    #  start the parser
    url_list = [
        'https://example.com/api/1',
        'https://example.com/api/2',
        .....
    ]
  
    parser = PreParser(
        url_list=url_list,
        request_call_back_func=handle_preparser_result,
        parser_mode='api',    # this mode depands on you set, you can use the "api" or "html"
        start_threading=True,
        threading_mode='single',
        cached_data=True,
        stop_when_task_failed=False,
        threading_numbers=3,
        checked_same_site=True
    )
    
    #  start parse
    parser.start_parse()

    # when all task finished, you can get the all task result result like below:
    all_result = parser.cached_request_datas
    
    # if you want to terminal, just execute the function here below
    # parser.stop_parse()

    # also you can use the Filer to save the final result above
    # and also find the datas in the `result/test.json` 
    filer = Filer('json')
    filer.write_data_into_file('result/test',[all_result])

Get Help

Get help ➡️ Github issue

Update logs

  • version 2.0.5 : remove the dynamic mode browser core install from setup into package call.

  • version 2.0.4 : test the installing process command.

  • version 2.0.3 : optimise the error alert for html_dynamic.

  • version 2.0.2 : correct the README Doc of parser_mode.

  • version 2.0.1 : update the README Doc.

  • version 2.0.0 : add the new parser_mode of the html_dynamic, which help preparser all of the content from html , event it generated by the JS code.

  • version 1.0.0 : basical version, only perparser the static html and api content.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preparser-2.0.5.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

preparser-2.0.5-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file preparser-2.0.5.tar.gz.

File metadata

  • Download URL: preparser-2.0.5.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for preparser-2.0.5.tar.gz
Algorithm Hash digest
SHA256 f7cb64dc0cd80648e8e63463e0ac52bca1c5d52c23215db85479d1fb956abb98
MD5 1b0ea112caa644e732d18fae74f96164
BLAKE2b-256 4a2fae01863f8dcbd8869db9b65a6199ed902763073236ece2800ae7be8a7eaf

See more details on using hashes here.

File details

Details for the file preparser-2.0.5-py3-none-any.whl.

File metadata

  • Download URL: preparser-2.0.5-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for preparser-2.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ffee72af1e24096b6dbd39af40e23f3db92a3190b76c3b70b87e789846236db8
MD5 b4a215f43540a514d988ae25a75c494a
BLAKE2b-256 b9971f2423534279c09d6706e16ed7f63824e3a3167c04aed933cd4f9729fe44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page