A toolkit for quickly performing crawler functions
Project description
Quick Crawler
A toolkit for quickly performing crawler functions
Installation
pip install quick-crawler
Functions
- get a html page and can save the file if the file path is assigned.
- get a json object from html string
- get or download a series of url with similar format, like a page list
- remove unicode str
- get json object online
- read a series of obj from a json list online
- quick save csv file from a list of json objects
- quick read csv file to a list of fields
- quick download a file
- quick crawler of a series of multi-lang websites
Let Codes Speak
Example 1:
from quick_crawler.page import *
if __name__=="__main__":
# get a html page and can save the file if the file path is assigned.
url="https://learnersdictionary.com/3000-words/alpha/a"
html_str=quick_html_page(url)
print(html_str)
# get a json object from html string
html_obj=quick_html_object(html_str)
word_list=html_obj.find("ul",{"class":"a_words"}).findAll("li")
print("word list: ")
for word in word_list:
print(word.find("a").text.replace(" ","").strip())
# get or download a series of url with similar format, like a page list
url_range="https://learnersdictionary.com/3000-words/alpha/a/{pi}"
list_html_str=quick_html_page_range(url_range,min_page=1,max_page=10)
for idx,html in enumerate(list_html_str):
html_obj = quick_html_object(html)
word_list = html_obj.find("ul", {"class": "a_words"}).findAll("li")
list_w=[]
for word in word_list:
list_w.append(word.find("a").text.replace(" ", "").strip())
print(f"Page {idx+1}: ", ','.join(list_w))
Example 2:
from quick_crawler.page import *
if __name__=="__main__":
# remove unicode str
u_str = 'aà\xb9'
u_str_removed = quick_remove_unicode(u_str)
print("Removed str: ", u_str_removed)
# get json object online
json_url="http://soundcloud.com/oembed?url=http%3A//soundcloud.com/forss/flickermood&format=json"
json_obj=quick_json_obj(json_url)
print(json_obj)
for k in json_obj:
print(k,json_obj[k])
# read a series of obj from a json list online
json_list_url = "https://jsonplaceholder.typicode.com/posts"
json_list = quick_json_obj(json_list_url)
print(json_list)
for obj in json_list:
userId = obj["userId"]
title = obj["title"]
body = obj["body"]
print(obj)
# quick save csv file from a list of json objects
quick_save_csv("news_list.csv",['userId','id','title','body'],json_list)
# quick read csv file to a list of fields
list_result=quick_read_csv("news_list.csv",fields=['userId','title'])
print(list_result)
# quick download a file
quick_download_file("https://www.englishclub.com/images/english-club-C90.png",save_file_path="logo.png")
Example 3: obtain html text from the Browser
from quick_crawler import browser
import os
if __name__=="__main__":
html_str=browser.get_html_str_with_browser("https://pypi.org/project/quick-crawler/0.0.2/",driver_path='../../examples/browsers/chromedriver.exe')
print(html_str)
Example 4: Crawl a series of web pages from a group of websites
from quick_crawler import browser
import os
list_item=[
['CNN','https://edition.cnn.com/'],
['AP','https://apnews.com/']
]
current_path = os.path.dirname(os.path.realpath(__file__))
browser.fetch_meta_info_from_sites(list_item,current_path+"/data",is_save_fulltext=True,use_plain_text=True)
Example 5: Crawl a series of websites with advanced settings
from quick_crawler import page,browser
import os
import pickle
list_item=pickle.load(open("list_news_site.pickle","rb"))[20:]
current_path = os.path.dirname(os.path.realpath(__file__))
browser.fetch_meta_info_from_sites(list_item,current_path+"/news_data1",
is_save_fulltext=True,
use_plain_text=False,
max_num_urls=100,
use_keywords=True
)
list_model=browser.summarize_downloaded_data("news_data1",
# save_path="news_data_list.csv"
)
Example 6: Multi-lang crawler
import os
from quick_crawler.multilang import get_sites_with_multi_lang_keywords
keywords="digital economy"
init_urls=[
["en-cnn","https://edition.cnn.com/"],
['jp-asahi', 'https://www.asahi.com/'],
['ru-mk', 'https://www.mk.ru/'],
['zh-xinhuanet', 'http://xinhuanet.com/'],
]
current_path = os.path.dirname(os.path.realpath(__file__))
list_item=get_sites_with_multi_lang_keywords(
init_urls=init_urls,
src_term=keywords,
src_language="en",
target_langs=["ja","zh","es","ru"],
save_data_folder=f"{current_path}/news_data3"
)
Example 7: get multiple translations based on a keyword
import pickle
from quick_crawler.language import *
terms = 'digital economy'
dict_lang=get_lang_dict_by_translation("en",terms)
pickle.dump(dict_lang,open(f"multi-lang-{terms}.pickle",'wb'))
Example 8: Pipeline for web page list processing
from quick_crawler.pipline.page_list import run_web_list_analysis_shell
if __name__=="__main__":
def find_list(html_obj):
return html_obj.find("div", {"class": "bd"}).findAll("li")
def get_item(item):
datetime = item.find("span").text
title = item.find("a").text
url = item.find("a")["href"]
return title, url, datetime
run_web_list_analysis_shell(
url_pattern="https://www.abc.com/index_{p}.html",
working_folder='test',
min_page=1,
max_page=2,
fn_find_list=find_list,
fn_get_item=get_item,
tag='xxxx'
)
License
The quick-crawler project is provided by Donghua Chen.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quick-crawler-0.0.8.tar.gz.
File metadata
- Download URL: quick-crawler-0.0.8.tar.gz
- Upload date:
- Size: 34.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9900833b303781d2f1050372b6d301a5074cb1f5172905816d03a77b78f33b9
|
|
| MD5 |
e9cba5d8a3ab56f710f9e7765ca387b0
|
|
| BLAKE2b-256 |
e81e379c2e897974b79b62e0bbfb5d6e417e7d788499c3ffff395003a855a65a
|
File details
Details for the file quick_crawler-0.0.8-py3-none-any.whl.
File metadata
- Download URL: quick_crawler-0.0.8-py3-none-any.whl
- Upload date:
- Size: 42.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b68991432332c316d45647aeb3228db04741178f8db6cb88b6142ec6f6f660b3
|
|
| MD5 |
bc6b9a2effd59d8f54c6f404be0b3e9b
|
|
| BLAKE2b-256 |
78a4859371941e6b5303c413ea4efc6dc8ac00cb2fc1c821623503a29fd6743a
|