A small paper crawler
Project description
PaperCrawlerUtil
一套用来爬论文的工具,a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:
from PaperCrawlerUtil.util import *
basic_config(style="print")
for times in ["2019", "2020", "2021"]:
html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
attr_list = get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
for ele in attr_list:
path = ele.split("<a href=\"")[1].split("\">")[0]
path = "https://openaccess.thecvf.com/" + path
html = random_proxy_header_access(path, random_proxy=False)
attr_list = get_attribute_of_html(html,
{'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
for eles in attr_list:
pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
work_path = local_path_generate("cvpr{}".format(times))
retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
也可以直接安装本包
pip install PaperCrawlerUtil
本模块使用自己搭建的一个代理池,代码来自https://github.com/Germey/ProxyPool.git
也可以自己在本地搭建这样的代理服务器,然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for PaperCrawlerUtil-0.0.16-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e81400fb64feab76d3c39b06bcbc9ae883fc2ce47c522b0f17e2d4554f12affc |
|
MD5 | 2580381ea48a5a8cdd89702b3cc61c86 |
|
BLAKE2b-256 | c9f59fd1053713235144db19f2c16f4f2815298655f9ece72fd2a9c78e842239 |