Skip to main content

A small paper crawler

Project description

This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:

from PaperCrawlerUtil.util import *


for times in ["2019", "2020", "2021"]:
    html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), require_proxy=False)
    attr_list = get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
    for ele in attr_list:
        path = ele.split("<a href=\"")[1].split("\">")[0]
        path = "https://openaccess.thecvf.com/" + path
        html = random_proxy_header_access(path)
        attr_list = get_attribute_of_html(html,
                                          {'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
        for eles in attr_list:
            pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
            work_path = local_path_generate("CVPR_{}".format(times))
            retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PaperCrawlerUtil-0.0.14.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

PaperCrawlerUtil-0.0.14-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file PaperCrawlerUtil-0.0.14.tar.gz.

File metadata

  • Download URL: PaperCrawlerUtil-0.0.14.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for PaperCrawlerUtil-0.0.14.tar.gz
Algorithm Hash digest
SHA256 8168a34e7703fce2d1dda1956575162e1329b87ebbb1469e0327c72fb33f9990
MD5 b4bc94e0315ff3a74ee39a7db781c2f1
BLAKE2b-256 aa17326789ed6aec5b1901470b2a6532f7d373b8862123c7424a4464977f3dd6

See more details on using hashes here.

File details

Details for the file PaperCrawlerUtil-0.0.14-py3-none-any.whl.

File metadata

File hashes

Hashes for PaperCrawlerUtil-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 e5b6759ea0cca363b7a00240cb3b20e1a6fe1df0cd2ad074048d2c64f36d7482
MD5 7dff6a857be79488c472abeb64c0f54f
BLAKE2b-256 8294612d0eb6e2cd33b54052cad0f503f26cb856ab9d8f809bc8c33b6158c001

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page