A toolkit for quickly performing crawler functions
Project description
Quick Crawler
A toolkit for quickly performing crawler functions
Installation
pip install quick-crawler
Functions
- get a html page and can save the file if the file path is assigned.
- get a json object from html string
- get or download a series of url with similar format, like a page list
- remove unicode str
- get json object online
- read a series of obj from a json list online
- quick save csv file from a list of json objects
- quick read csv file to a list of fields
- quick download a file
Let Codes Speak
Example 1:
from quick_crawler.page import *
if __name__=="__main__":
# get a html page and can save the file if the file path is assigned.
url="https://learnersdictionary.com/3000-words/alpha/a"
html_str=quick_html_page(url)
print(html_str)
# get a json object from html string
html_obj=quick_html_object(html_str)
word_list=html_obj.find("ul",{"class":"a_words"}).findAll("li")
print("word list: ")
for word in word_list:
print(word.find("a").text.replace(" ","").strip())
# get or download a series of url with similar format, like a page list
url_range="https://learnersdictionary.com/3000-words/alpha/a/{pi}"
list_html_str=quick_html_page_range(url_range,min_page=1,max_page=10)
for idx,html in enumerate(list_html_str):
html_obj = quick_html_object(html)
word_list = html_obj.find("ul", {"class": "a_words"}).findAll("li")
list_w=[]
for word in word_list:
list_w.append(word.find("a").text.replace(" ", "").strip())
print(f"Page {idx+1}: ", ','.join(list_w))
Example 2:
from quick_crawler.page import *
if __name__=="__main__":
# remove unicode str
u_str = 'aà\xb9'
u_str_removed = quick_remove_unicode(u_str)
print("Removed str: ", u_str_removed)
# get json object online
json_url="http://soundcloud.com/oembed?url=http%3A//soundcloud.com/forss/flickermood&format=json"
json_obj=quick_json_obj(json_url)
print(json_obj)
for k in json_obj:
print(k,json_obj[k])
# read a series of obj from a json list online
json_list_url = "https://jsonplaceholder.typicode.com/posts"
json_list = quick_json_obj(json_list_url)
print(json_list)
for obj in json_list:
userId = obj["userId"]
title = obj["title"]
body = obj["body"]
print(obj)
# quick save csv file from a list of json objects
quick_save_csv("news_list.csv",['userId','id','title','body'],json_list)
# quick read csv file to a list of fields
list_result=quick_read_csv("news_list.csv",fields=['userId','title'])
print(list_result)
# quick download a file
quick_download_file("https://www.englishclub.com/images/english-club-C90.png",save_file_path="logo.png")
Example 3: obtain html text from the Browser
from quick_crawler import browser
import os
if __name__=="__main__":
html_str=browser.get_html_str_with_browser("https://pypi.org/project/quick-crawler/0.0.2/",driver_path='../../examples/browsers/chromedriver.exe')
print(html_str)
Example 4: Crawl a series of web pages from a group of websites
from quick_crawler import browser
import os
list_item=[
['CNN','https://edition.cnn.com/'],
['AP','https://apnews.com/']
]
current_path = os.path.dirname(os.path.realpath(__file__))
browser.fetch_meta_info_from_sites(list_item,current_path+"/data",is_save_fulltext=True,use_plain_text=True)
License
The quick-crawler
project is provided by Donghua Chen.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
quick-crawler-0.0.3a1.tar.gz
(19.9 kB
view hashes)
Built Distribution
Close
Hashes for quick_crawler-0.0.3a1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6806c5537fefd00b4b853a93f34311b63164f8ef9119f85e008433182ba8f75 |
|
MD5 | 8b9d61421f17bc4af3c403ac9d2c6c84 |
|
BLAKE2b-256 | e2aecb5d317d23c2e96f4e10c18bdf92d393f1e895359b29b33f88076149ea71 |