A toolkit for quickly performing crawler functions

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Quick Crawler

A toolkit for quickly performing crawler functions

Installation

pip install quick-crawler

Functions

get a html page and can save the file if the file path is assigned.
get a json object from html string
get or download a series of url with similar format, like a page list
remove unicode str
get json object online
read a series of obj from a json list online
quick save csv file from a list of json objects
quick read csv file to a list of fields
quick download a file

Let Codes Speak

Example 1:

from quick_crawler.page import *

if __name__=="__main__":
    # get a html page and can save the file if the file path is assigned.
    url="https://learnersdictionary.com/3000-words/alpha/a"
    html_str=quick_html_page(url)
    print(html_str)

    # get a json object from html string
    html_obj=quick_html_object(html_str)
    word_list=html_obj.find("ul",{"class":"a_words"}).findAll("li")
    print("word list: ")
    for word in word_list:
        print(word.find("a").text.replace("  ","").strip())

    # get or download a series of url with similar format, like a page list
    url_range="https://learnersdictionary.com/3000-words/alpha/a/{pi}"
    list_html_str=quick_html_page_range(url_range,min_page=1,max_page=10)
    for idx,html in enumerate(list_html_str):
        html_obj = quick_html_object(html)
        word_list = html_obj.find("ul", {"class": "a_words"}).findAll("li")
        list_w=[]
        for word in word_list:
            list_w.append(word.find("a").text.replace("  ", "").strip())
        print(f"Page {idx+1}: ", ','.join(list_w))

Example 2:

from quick_crawler.page import *

if __name__=="__main__":
    # remove unicode str
    u_str = 'aà\xb9'
    u_str_removed = quick_remove_unicode(u_str)
    print("Removed str: ", u_str_removed)

    # get json object online
    json_url="http://soundcloud.com/oembed?url=http%3A//soundcloud.com/forss/flickermood&format=json"
    json_obj=quick_json_obj(json_url)
    print(json_obj)
    for k in json_obj:
        print(k,json_obj[k])

    # read a series of obj from a json list online
    json_list_url = "https://jsonplaceholder.typicode.com/posts"
    json_list = quick_json_obj(json_list_url)
    print(json_list)
    for obj in json_list:
        userId = obj["userId"]
        title = obj["title"]
        body = obj["body"]
        print(obj)

    # quick save csv file from a list of json objects
    quick_save_csv("news_list.csv",['userId','id','title','body'],json_list)

    # quick read csv file to a list of fields
    list_result=quick_read_csv("news_list.csv",fields=['userId','title'])
    print(list_result)

    # quick download a file
    quick_download_file("https://www.englishclub.com/images/english-club-C90.png",save_file_path="logo.png")

Example 3: obtain html text from the Browser

from quick_crawler import browser
import os
if __name__=="__main__":
    html_str=browser.get_html_str_with_browser("https://pypi.org/project/quick-crawler/0.0.2/",driver_path='../../examples/browsers/chromedriver.exe')
    print(html_str)

Example 4: Crawl a series of web pages from a group of websites

from quick_crawler import browser
import os
list_item=[
        ['CNN','https://edition.cnn.com/'],
        ['AP','https://apnews.com/']
    ]
current_path = os.path.dirname(os.path.realpath(__file__))
browser.fetch_meta_info_from_sites(list_item,current_path+"/data",is_save_fulltext=True,use_plain_text=True)

License

The quick-crawler project is provided by Donghua Chen.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.8

Apr 2, 2022

0.0.7

Jan 22, 2022

0.0.7a2 pre-release

Jan 17, 2022

0.0.7a1 pre-release

Jan 17, 2022

0.0.7a0 pre-release

Jan 17, 2022

0.0.6

Jan 12, 2022

0.0.6a2 pre-release

Jan 12, 2022

0.0.6a1 pre-release

Jan 12, 2022

0.0.6a0 pre-release

Jan 12, 2022

0.0.5

Jan 11, 2022

0.0.5a0 pre-release

Jan 8, 2022

0.0.4

Jan 4, 2022

0.0.4a3 pre-release

Jan 3, 2022

0.0.4a2 pre-release

Jan 3, 2022

0.0.4a1 pre-release

Jan 2, 2022

This version

0.0.4a0 pre-release

Jan 2, 2022

0.0.3

Jan 2, 2022

0.0.3a2 pre-release

Dec 31, 2021

0.0.3a1 pre-release

Dec 31, 2021

0.0.3a0 pre-release

Dec 31, 2021

0.0.2

Dec 30, 2021

0.0.1

Dec 24, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quick-crawler-0.0.4a0.tar.gz (28.5 kB view hashes)

Uploaded Jan 2, 2022 Source

Built Distribution

quick_crawler-0.0.4a0-py3-none-any.whl (35.8 kB view hashes)

Uploaded Jan 2, 2022 Python 3

Hashes for quick-crawler-0.0.4a0.tar.gz

Hashes for quick-crawler-0.0.4a0.tar.gz
Algorithm	Hash digest
SHA256	`5c5eb52e25a642f42a19be5689069b2aa95b44885b3153d15d45f3123821d7a1`
MD5	`6986d72ce6272375fa4949830baeaeff`
BLAKE2b-256	`d3e54af0b70dcda3da5ef29886b7e6f4a586909b3307ddd065c8416f553f7a62`

Hashes for quick_crawler-0.0.4a0-py3-none-any.whl

Hashes for quick_crawler-0.0.4a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80d66c2adcc6b73e41a5b0c8cf9f75ce65fc935cd76c7b5d92ff320a42bff486`
MD5	`34ed2ff7b9cfed88cf7155b8dd5af91d`
BLAKE2b-256	`3749ec221a0fac48f24d14946fb1315cf4d7013f84aa12cf30892b54a4aad5ff`