a collection of utils used to create crawler and document process
Project description
PaperCrawlerUtil
一套用来爬论文的工具,a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:
from PaperCrawlerUtil.crawler_util import *
basic_config(logs_style="print")
for times in ["2019", "2020", "2021"]:
# random_proxy_header_access 访问网站并且返回html字符串,其中可以设置是否使用代理等等
html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
# get_attribute_of_html 获取html字符串中所需要的元素,可以配置一个字典,键表示待匹配字符串,值表示规则,还可以选择获取什么样的元素
# 默认只获取标签<a>
attr_list = get_attribute_of_html(html, {'href': IN, 'CVPR': IN, "py": IN, "day": IN})
for ele in attr_list:
path = ele.split("<a href=\"")[1].split("\">")[0]
path = "https://openaccess.thecvf.com/" + path
# 继续访问获取论文地址
html = random_proxy_header_access(path, random_proxy=False)
# 同上获取网页元素
attr_list = get_attribute_of_html(html,
{'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
for eles in attr_list:
pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
# local_path_generate 生成文件名绝对路径,要求提供文件夹名称,
# 文件名不提供则默认使用当前时间作为文件名
work_path = local_path_generate("cvpr{}".format(times))
# retrieve_file 获取文件,可以设置是否使用代理等等
retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
#此外也可以通过doi从sci-hub下载,示例代码如下:
get_pdf_url_by_doi(doi="xxxx", work_path=local_path_generate("./"))
#本模块使用自己搭建的一个代理池,代码来自https://github.com/Germey/ProxyPool.git
#也可以自己在本地搭建这样的代理服务器,然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")
#同时可以替换,其他的一些配置,如下所示,其中日志的等级只能配置一次,之后不会再生效
basic_config(log_file_name="1.log",
log_level=logging.WARNING,
proxy_pool_url="http://xxx",
logs_style=LOG_STYLE_LOG)
#如下所示,可以抽取路径上的PDF中的信息,其中路径可以是PDF也可以是文件路径,会自动判断
#如果是文件夹,则会遍历所有文件,然后返回总的字符串,可以自选分割符的形式
#同时信息的提取是通过两个标记实现的,即通过开始和结束标记截取字段
title_and_abstract = get_para_from_pdf(path="E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", ranges=(0, 2))
write_file(path=local_path_generate("E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", "title_and_abstract.txt"),
mode="w+", string=title_and_abstract)
from PaperCrawlerUtil.document_util import *
from PaperCrawlerUtil.crawler_util import *
from PaperCrawlerUtil.common_util import *
basic_config(logs_style=LOG_STYLE_PRINT)
# 通过百度翻译api平台申请获得
appid = "20200316xxxx99558"
secret_key = "BK6xxxxxDGBwaZgr4F"
# 实现文本翻译, 可以结合上一块代码获取PDF中的文字翻译,注意的是使用了百度
# 和谷歌翻译,因此如果使用谷歌翻译,则需要提供代理,默认会尝试http://127.0.01:1080 这个地址
text_translate("", appid, secret_key, is_google=True)
也可以直接安装本包
pip install PaperCrawlerUtil
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
PaperCrawlerUtil-0.0.23.tar.gz
(11.9 kB
view details)
Built Distribution
File details
Details for the file PaperCrawlerUtil-0.0.23.tar.gz
.
File metadata
- Download URL: PaperCrawlerUtil-0.0.23.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f76c26d4742bb7f28226a6133ed11cf00fcf3d70509e6bd9220d1cb39b699a32 |
|
MD5 | 06f0d764dffa6707c73f5890be29895d |
|
BLAKE2b-256 | ff17a11ec1ab1a5df329683087a9e1b7d2c1288f10d9957cd2f64d85be293f36 |
File details
Details for the file PaperCrawlerUtil-0.0.23-py3-none-any.whl
.
File metadata
- Download URL: PaperCrawlerUtil-0.0.23-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fddf0931409d2365b7e6e3f5c63f96dbadb969a5ef643b5c047ff18ed93d0613 |
|
MD5 | 7f1bac9dc3c6148e73e4ff20c7421a90 |
|
BLAKE2b-256 | bc118014afe81b304aacc770f18bf6aceb06ed8661d1aa4242866fe502f5f02a |