The pocket of Doraemon: many tools
Project description
Doraemon
Doraemon is a tool kit.
- Google Knowledge Graph [Deprecated]
- Google Translator
- Dianping [Deprecated] # 大众点评
- QQ music lyrics
- whois
- NetEase music comments
- Parse domain name to IP (in batch)
Tools
1. Robust Requests
from Doraemon import requests_dora
url = "https://www.baidu.com"
headers = requests_dora.get_default_headers()
headers["User-Agent"] = requests_dora.get_random_user_agent()
def get_proxies():
proxy_str = "127.0.0.1:1080"
proxies = {"http": "http://%s" % proxy_str,
"https": "http://%s" % proxy_str, }
return proxies
# max_times, get_proxies_fun, and invoked_by are optional parameters, others are the same as the requests.get() and requests.post()
res1 = requests_dora.try_best_2_get(url, max_times=5, get_proxies_fun=get_proxies, invoked_by="parent_fun_name")
res2 = requests_dora.try_best_2_post(url, max_times=5, get_proxies_fun=get_proxies)
print(res1.status_code)
print(res2.status_code)
2. Proxy Kit
from Doraemon import proxies_dora
proxies1 = proxies_dora.get_proxies("127.0.0.1:223") # get a self-defined proxies dict
proxies2 = proxies_dora.get_data5u_proxies("your data5u api key") # input api key for crawling, get a proxies dict
pool = [
"127.0.0.1:233",
"123.123.123.123:123",
"...",
]
proxies_dora.set_pool(pool) # set a self-defined proxy pool
proxies3 = proxies_dora.get_proxies_fr_pool() # get a proxies dict from the pool
loc_info1 = proxies_dora.loc_proxy_ipv4(proxies1) # get location info of a given proxy, ipv4 only
loc_info2 = proxies_dora.loc_proxy(proxies2) # get location info of a given proxy, for both ipv4 and ipv6
3. User-friendly Chrome
from Doraemon import chrome_dora
proxy = "127.0.0.1:1080"
baidu_url = "https://www.baidu.com"
# no_images: do not load images(response more quickly)
# headless: make the chrome invisible
# proxy: set if you need
# they are all optional
chrome = chrome_dora.MyChrome(headless=False, proxy="127.0.0.1:1080", no_images=True)
chrome.get(baidu_url)
print(chrome.page_source)
Crawlers
1. Google Knowledge Graph [Deprecated]
from Doraemon import google_KG
def get_proxies():
proxy_str = "127.0.0.1:1080"
proxies = {"http": "http://%s" % proxy_str,
"https": "http://%s" % proxy_str, }
return proxies
res = google_KG.get_entity("alibaba", get_proxies_fun=get_proxies)
print(res)
2. Google Translator
from Doraemon import google_translator, proxies_dora
def get_proxies():
proxy_str = "127.0.0.1:1080"
proxies = {"http": "http://%s" % proxy_str,
"https": "http://%s" % proxy_str, }
return proxies
ori_text = "中华民国"
# sl, tl and get_proxies_fun are optional, the default values are "auto", "en", None
res1 = google_translator.trans(ori_text,sl="auto", tl="zh-TW", get_proxies_fun=get_proxies)
# replace the function get_proxies with proxies_dora.get_proxies("127.0.0.1:1080")
res2 = google_translator.trans(ori_text,sl="auto", tl="zh-TW", get_proxies_fun=lambda: proxies_dora.get_proxies("127.0.0.1:1080"))
long_text = ori_text * 2500 # 10000 characters
res3 = google_translator.trans_long(long_text)# if len(text) > 5000
print(res1)
print(res2)
Language Code:
检测语言: auto
阿尔巴尼亚语: sq
阿拉伯语: ar
阿姆哈拉语: am
阿塞拜疆语: az
爱尔兰语: ga
爱沙尼亚语: et
巴斯克语: eu
白俄罗斯语: be
保加利亚语: bg
冰岛语: is
波兰语: pl
波斯尼亚语: bs
波斯语: fa
布尔语(南非荷兰语): af
丹麦语: da
德语: de
俄语: ru
法语: fr
菲律宾语: tl
芬兰语: fi
弗里西语: fy
高棉语: km
格鲁吉亚语: ka
古吉拉特语: gu
哈萨克语: kk
海地克里奥尔语: ht
韩语: ko
豪萨语: ha
荷兰语: nl
吉尔吉斯语: ky
加利西亚语: gl
加泰罗尼亚语: ca
捷克语: cs
卡纳达语: kn
科西嘉语: co
克罗地亚语: hr
库尔德语: ku
拉丁语: la
拉脱维亚语: lv
老挝语: lo
立陶宛语: lt
卢森堡语: lb
罗马尼亚语: ro
马尔加什语: mg
马耳他语: mt
马拉地语: mr
马拉雅拉姆语: ml
马来语: ms
马其顿语: mk
毛利语: mi
蒙古语: mn
孟加拉语: bn
缅甸语: my
苗语: hmn
南非科萨语: xh
南非祖鲁语: zu
尼泊尔语: ne
挪威语: no
旁遮普语: pa
葡萄牙语: pt
普什图语: ps
齐切瓦语: ny
日语: ja
瑞典语: sv
萨摩亚语: sm
塞尔维亚语: sr
塞索托语: st
僧伽罗语: si
世界语: eo
斯洛伐克语: sk
斯洛文尼亚语: sl
斯瓦希里语: sw
苏格兰盖尔语: gd
宿务语: ceb
索马里语: so
塔吉克语: tg
泰卢固语: te
泰米尔语: ta
泰语: th
土耳其语: tr
威尔士语: cy
乌尔都语: ur
乌克兰语: uk
乌兹别克语: uz
西班牙语: es
希伯来语: iw
希腊语: el
夏威夷语: haw
信德语: sd
匈牙利语: hu
修纳语: sn
亚美尼亚语: hy
伊博语: ig
意大利语: it
意第绪语: yi
印地语: hi
印尼巽他语: su
印尼语: id
印尼爪哇语: jw
英语: en
约鲁巴语: yo
越南语: vi
中文(繁体): zh-TW
中文(简体): zh-CN
3. Dianping [Deprecated: character decoding]
from Doraemon import dianping, proxies_dora
import json
# get_proxies_fun is optional, set if you want to use a proxy
shop_list = dianping.search_shops("2", "4s店", 1, get_proxies_fun=lambda: proxies_dora.get_proxies("127.0.0.1:1080")) # args: city id, keyword, page index
print(json.dumps(shop_list, indent=2, ensure_ascii=False))
# [{"name": "shopname1", "shop_id": "1245587}, ...]
# get_proxies_fun is optional, set if you want to use a proxy, this example use data5u proxy,
# the website is :http://www.data5u.com/api/doc-dynamic.html
shop_list_around = dianping.get_around("2", "5724615", 2000, 1, get_proxies_fun=lambda: proxies_dora.get_data5u_proxies("your data5u api key")) # args: city id, shop id, max distance, page index
print(json.dumps(shop_list_around, indent=2, ensure_ascii=False))
'''
shop_list_around is like this:
[
{
"img_src": "https://img.meituan.net/msmerchant/2e5787325ba4579ec2e2e3f45038ade1149446.jpg%40340w_255h_1e_1c_1l%7Cwatermark%3D1%26%26r%3D1%26p%3D9%26x%3D2%26y%3D2%26relative%3D1%26o%3D20",
"title": "速度披萨(华贸城店)",
"star_level": 4.5,
"review_num": 30,
"mean_price": 89,
"cat": "西餐",
"region": "北苑家园",
"addr": "清苑路13号",
"rec_dish": [
"黑芝麻沙拉",
"蟹肉意面",
"火腿榴莲披萨双拼"
],
"score": {
"taste": 8.5,
"env": 8.4,
"service": 8.4
}
},
]
'''
4. QQ music lyrics
import os
from Doraemon import qq_music_crawler_by_album as qm_album, qq_music_crawler_by_area as qm_area
# crawl lyrics of songs in specific areas
area_list = ["港台", "内地"] # {'全部': -100, '内地': 200, '港台': 2, '欧美': 5, '日本': 4, '韩国': 3, '其他': 6}
save_path = "./qq_music_songs_by_area"
if not os.path.exists(save_path):
os.makedirs(save_path)
qm_area.crawl_songs(area_list, save_path)
# crawl lyrics by albums
import json
from tqdm import tqdm
save_path = "./qq_music_songs_by_album"
if not os.path.exists(save_path):
os.makedirs(save_path)
for sin in range(0, 7050, 30):
ein = sin + 29
album_list = qm_album.get_album_list(sin, ein) # get 30 albums
for album in album_list:
dissname = album["dissname"]
song_list = qm_album.get_song_list(album["dissid"])
chunk = []
for song in tqdm(song_list, desc = "getting songs in {}".format(dissname)):
contributors, lyric = qm_album.get_lyric(song)
song["lyric"] = lyric
chunk.append(song)
json.dump(chunk, open("{}/lyric_{}.json".format(save_path, dissname), "w", encoding = "utf-8"), ensure_ascii = False)
5. whois
from Doraemon import whois
ip_list = ["154.17.24.36", "154.17.24.37", "154.17.24.39", "154.17.21.36"] * 100
# # friendly
# res = whois.extract_org_names_friendly(ip_list, block_size = 100, sleep = 2)
# no limited
res = whois.extract_org_names_no_limited(ip_list)
print(res)
6. NetEase music comments
run under netease_music
scrapy crawl comments
7. domain2ip
from Doraemon import domain2ip
threads = 100
max_fail_num = 0
domain_name2ip = {} # results
url_list = ["https://www.baidu.com", "https://www.qq.com"]
domain2ip.gethostbyname_fast(url_list, domain_name2ip, threads, max_fail_num)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Doraemon-4.7.tar.gz
(34.4 kB
view details)
Built Distribution
File details
Details for the file Doraemon-4.7.tar.gz
.
File metadata
- Download URL: Doraemon-4.7.tar.gz
- Upload date:
- Size: 34.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de98abb41c69ccf24c427291709f8aa8f8435e48d33d501db9dcd48257668835 |
|
MD5 | 25f7ff9f17f9066a318abd8b068a737a |
|
BLAKE2b-256 | a7b2af6448f4f89cca2f850b1e451d08493b3ef3bfad54ea196c17697ee19746 |
File details
Details for the file Doraemon-4.7-py2.py3-none-any.whl
.
File metadata
- Download URL: Doraemon-4.7-py2.py3-none-any.whl
- Upload date:
- Size: 60.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c8d196c315327c320fb820813f6eac03791b65722068d30d7119290cce3c6ba |
|
MD5 | a1101f3b74f1cab4af851226d65b20ea |
|
BLAKE2b-256 | ec2877a0785c9a9e186ea2d88b79164e24090595d9f97aff6ff2cff43e7eeb48 |