a collection of utils used to create crawler and document process
Project description
PaperCrawlerUtil
一套用来爬论文的工具,a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:
from PaperCrawlerUtil.crawler_util import *
basic_config(logs_style="print")
for times in ["2019", "2020", "2021"]:
html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
attr_list = get_attribute_of_html(html, {'href': IN, 'CVPR': IN, "py": IN, "day": IN})
for ele in attr_list:
path = ele.split("<a href=\"")[1].split("\">")[0]
path = "https://openaccess.thecvf.com/" + path
html = random_proxy_header_access(path, random_proxy=False)
attr_list = get_attribute_of_html(html,
{'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
for eles in attr_list:
pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
work_path = local_path_generate("cvpr{}".format(times))
retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
#此外也可以通过doi从sci-hub下载,示例代码如下:
get_pdf_url_by_doi(doi="xxxx", work_path=local_path_generate("./"))
#本模块使用自己搭建的一个代理池,代码来自https://github.com/Germey/ProxyPool.git
#也可以自己在本地搭建这样的代理服务器,然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")
#同时可以替换,其他的一些配置,如下所示,其中日志的等级只能配置一次,之后不会再生效
basic_config(log_file_name="1.log",
log_level=logging.WARNING,
proxy_pool_url="http://xxx",
logs_style=LOG_STYLE_LOG)
#如下所示,可以抽取路径上的PDF中的信息,其中路径可以是PDF也可以是文件路径,会自动判断
#如果是文件夹,则会遍历所有文件,然后返回总的字符串,可以自选分割符的形式
#同时信息的提取是通过两个标记实现的,即通过开始和结束标记截取字段
title_and_abstract = get_para_from_pdf(path="E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", ranges=(0, 2))
write_file(path=local_path_generate("E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", "title_and_abstract.txt"),
mode="w+", string=title_and_abstract)
也可以直接安装本包
pip install PaperCrawlerUtil
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
PaperCrawlerUtil-0.0.21.tar.gz
(11.0 kB
view details)
Built Distribution
File details
Details for the file PaperCrawlerUtil-0.0.21.tar.gz
.
File metadata
- Download URL: PaperCrawlerUtil-0.0.21.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68e63c264bfbb4757fc8ed416c5233f0fbbfc8c62267fb66b4b2a975b435357e |
|
MD5 | d9dc5bcfee264c59dd573ddece29aaec |
|
BLAKE2b-256 | de38fcf548c54e4538f912811da100be7c63ac87d3d9efbc1b919715db0cdba3 |
File details
Details for the file PaperCrawlerUtil-0.0.21-py3-none-any.whl
.
File metadata
- Download URL: PaperCrawlerUtil-0.0.21-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da2c9f27ddafd5c7a1b43cf5fe7eb283dedc9286361936360671cb0134bef748 |
|
MD5 | 095229f045021923f1914115a318690a |
|
BLAKE2b-256 | 440163c52757c7b3ab0a5dc8a0ff1080eac01eb116c96cb633223c85d5f9ab25 |