A small paper crawler
Project description
This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:
from PaperCrawlerUtil import util as u
import os
import time
for times in ["2019", "2020", "2021"]:
if os.path.exists("CVPR_{}".format(times)):
print("文件夹存在")
else:
os.makedirs("CVPR_{}".format(times))
html = u.random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), require_proxy=False)
attr_list = u.get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
for ele in attr_list:
path = ele.split("<a href=\"")[1].split("\">")[0]
path = "https://openaccess.thecvf.com/" + path
html = u.random_proxy_header_access(path)
attr_list = u.get_attribute_of_html(html,
{'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
for eles in attr_list:
pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
dir = os.path.abspath("CVPR_{}".format(times))
work_path = os.path.join(dir, '{}.pdf').format(str(time.strftime("%H_%M_%S", time.localtime())))
u.retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for PaperCrawlerUtil-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54d6d999e8f6d4564d10e52da00048aad6b0d3c9a8e3322bdc814d814ba9fc5f |
|
MD5 | 34bbed21c8d8032c87301c5a81e3ec0f |
|
BLAKE2b-256 | 2c5543eafe48d9d65a2c4c600ccb5942b8018869b3e49702323025afdeda943d |