a slight preparser to help parse webpage content or get request from urls,which supports win, mac and unix.

These details have not been verified by PyPI

Project links

Homepage

Project description

Description

this is a sight Parser to help you pre_parser the datas from specified website url or api, it help you get ride of the duplicate coding to get the request from the specified url and speed up the process with the threading pool and you just need focused on the bussiness proceess coding after you get the specified request response from the specified webpage or api urls

Attention

as this slight pre_parser for the old version 1.0.0, which only can help preparser the static html or api inform, but now from the 2.0.0 , I have added an new html_dynamic mode, which will help get all inform even generated by the JS code.

python version >= 3.9

How to use

install

$ pip install preparser

Github Resouce ➡️ Github Repos

and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu.

PyPI: ➡️ PyPI Publish

parameters

here below are some of the parameters you can use for initial the Object PreParser from the package preparser:

Parameters	Type	Description
url_list	list	The list of URLs to parse from. Default is an empty list.
request_call_back_func	Callable or None	A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object.
parser_mode	`'html'`, `'api'` or `'html_dynamic'`	The pre-parsing datas mode,default is `'html'`. `html`: parse the content from static html, and return an `BeautifulSoup` Object. `api`: parse the datas from an api, and return the `json` Object. `html_dynamic`: parse from the whole webpage html content and return an `BeautifulSoup` Object, even the content that generated by the dynamic js code. **and all of Object you can get when you defined the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas`
cached_data	bool	weather cache the parsed datas, defalt is False.
start_threading	bool	Whether to use threading pool for parsing the data. Default is `False`.
threading_mode	`'map'` or `'single'`	to run the task mode, default is `single`. `map`: use the `map` func of the theading pool to distribute tasks. `single`: use the `submit` func to distribute the task one by one into the theading pool.
stop_when_task_failed	bool	wheather need stop when you failed to get request from a Url,default is `True`
threading_numbers	int	The maximum number of threads in the threading pool. Default is `3`.
checked_same_site	bool	wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block.
html_dynamic_scope	list or None	point and get the specied scope dom of the whole page html, default is None,which stands for the whole page. if this value was set, the parameter should be a list(2) Object. 1. the first value is a tag selecter. for example, 'div#main' mean a div tag with 'id=main', 'div.test' will get the the first matched div tag with 'class = test'. but don't make the selecter too complex or matched the mutiple parent dom, otherwise you can't get their inner_html() correctly or time out, and finally you can get the BeautifulSoup object of the inner_html from this selecter selected tag in the `request_call_back_func`. 2. the secound value should be one of the values below: `attached`: wait for element to be present in DOM. `detached`: wait for element to not be present in DOM. `hidden`: wait for element to have non-empty bounding box and no 'visibility:hidden'. Note that element,without any content or with 'display:none' has an empty bounding box and is not considered visible. `visible`: wait for element to be either detached from DOM, or have an empty bounding box or 'visibility:hidden'. This is opposite to the 'visible' option.

example

#  test.py
from preparser import PreParser,BeautifulSoup,Json_Data,Filer


def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:
    # here you can just write the bussiness logical you want
    
    # attention：
    # preparser_object type depaned on the `parser_mode` in the `PreParser`:
    #               'api' : preparser_object is the type of a Json_Data
    #               'html' : preparser_object is the type of a BeautifulSoup 
    
    ........
    
    # for the finally return:
    # if you want to show current result is failed just Return a None, else just return any object which is not None.
    return preparser_object


if __name__ == "__main__":
    
    #  start the parser
    url_list = [
        'https://example.com/api/1',
        'https://example.com/api/2',
        .....
    ]
  
    parser = PreParser(
        url_list=url_list,
        request_call_back_func=handle_preparser_result,
        parser_mode='api',    # this mode depands on you set, you can use the "api", "html",or 'html_dynamic'
        start_threading=True,
        threading_mode='single',
        cached_data=True,
        stop_when_task_failed=False,
        threading_numbers=3,
        checked_same_site=True
    )
    
    #  start parse
    parser.start_parse()

    # when all task finished, you can get the all task result result like below:
    all_result = parser.cached_request_datas
    
    # if you want to terminal, just execute the function here below
    # parser.stop_parse()

    # also you can use the Filer to save the final result above
    # and also find the datas in the `result/test.json` 
    filer = Filer('json')
    filer.write_data_into_file('result/test',[all_result])

Get Help

Get help ➡️ Github issue

Update logs

version 2.0.6 : add the html_dynamic_scope parameters to let user can specified the whole dynamic parse scope, which can help faster the preparser speed when the parser_mode is html_dynamic . and resort the additional tools into the ToolsHelper package.
version 2.0.5 : remove the dynamic mode browser core install from setup into package call.
version 2.0.4 : test the installing process command.
version 2.0.3 : optimise the error alert for html_dynamic.
version 2.0.2 : correct the README Doc of parser_mode.
version 2.0.1 : update the README Doc.
version 2.0.0 : add the new parser_mode of the html_dynamic, which help preparser all of the content from html , event it generated by the JS code.
version 1.0.0 : basical version, only perparser the static html and api content.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.8

Jan 17, 2025

2.0.7

Jan 15, 2025

This version

2.0.6

Jan 15, 2025

2.0.5

Jan 12, 2025

1.0.0

Jan 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preparser-2.0.6.tar.gz (17.8 kB view details)

Uploaded Jan 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

preparser-2.0.6-py3-none-any.whl (15.4 kB view details)

Uploaded Jan 15, 2025 Python 3

File details

Details for the file preparser-2.0.6.tar.gz.

File metadata

Download URL: preparser-2.0.6.tar.gz
Upload date: Jan 15, 2025
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for preparser-2.0.6.tar.gz
Algorithm	Hash digest
SHA256	`2b1318a1cb2b1c6b50778ff1c4246c7e9283681c4d104a4a52d31195a5f03b27`
MD5	`31e32d694cc29a24ad2bd3d82f65248a`
BLAKE2b-256	`dc63920f94dc7a0c66b73c644f5933fa1d638aa12e8f7d647ae2033f8ab7b751`

See more details on using hashes here.

File details

Details for the file preparser-2.0.6-py3-none-any.whl.

File metadata

Download URL: preparser-2.0.6-py3-none-any.whl
Upload date: Jan 15, 2025
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for preparser-2.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`decc0e0037b4b5f54361a4d9a3d353866bb3bd530abdcef9a17c0bd1d56907a0`
MD5	`0bc47d5ed64ea947fca1edbeda4946f3`
BLAKE2b-256	`548e176214cce79a7b69d1cd83788e8a0f1b2d529a01e62d2e09fe34e372991c`

See more details on using hashes here.

preparser 2.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Description

Attention

How to use

install

parameters

example

Get Help

Update logs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes