A small paper crawler
Project description
This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:
from PaperCrawlerUtil.util import *
for times in ["2019", "2020", "2021"]:
html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), require_proxy=False)
attr_list = get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
for ele in attr_list:
path = ele.split("<a href=\"")[1].split("\">")[0]
path = "https://openaccess.thecvf.com/" + path
html = random_proxy_header_access(path)
attr_list = get_attribute_of_html(html,
{'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
for eles in attr_list:
pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
work_path = local_path_generate("CVPR_{}".format(times))
retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for PaperCrawlerUtil-0.0.14-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5b6759ea0cca363b7a00240cb3b20e1a6fe1df0cd2ad074048d2c64f36d7482 |
|
MD5 | 7dff6a857be79488c472abeb64c0f54f |
|
BLAKE2b-256 | 8294612d0eb6e2cd33b54052cad0f503f26cb856ab9d8f809bc8c33b6158c001 |