A small paper crawler
Project description
This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:
from PaperCrawlerUtil.util import *
for times in ["2019", "2020", "2021"]:
html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), require_proxy=False)
attr_list = get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
for ele in attr_list:
path = ele.split("<a href=\"")[1].split("\">")[0]
path = "https://openaccess.thecvf.com/" + path
html = random_proxy_header_access(path)
attr_list = get_attribute_of_html(html,
{'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
for eles in attr_list:
pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
work_path = local_path_generate("CVPR_{}".format(times))
retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file PaperCrawlerUtil-0.0.14.tar.gz
.
File metadata
- Download URL: PaperCrawlerUtil-0.0.14.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8168a34e7703fce2d1dda1956575162e1329b87ebbb1469e0327c72fb33f9990 |
|
MD5 | b4bc94e0315ff3a74ee39a7db781c2f1 |
|
BLAKE2b-256 | aa17326789ed6aec5b1901470b2a6532f7d373b8862123c7424a4464977f3dd6 |
File details
Details for the file PaperCrawlerUtil-0.0.14-py3-none-any.whl
.
File metadata
- Download URL: PaperCrawlerUtil-0.0.14-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5b6759ea0cca363b7a00240cb3b20e1a6fe1df0cd2ad074048d2c64f36d7482 |
|
MD5 | 7dff6a857be79488c472abeb64c0f54f |
|
BLAKE2b-256 | 8294612d0eb6e2cd33b54052cad0f503f26cb856ab9d8f809bc8c33b6158c001 |