Skip to main content

A small paper crawler

Project description

PaperCrawlerUtil

一套用来爬论文的工具,a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:

from PaperCrawlerUtil.util import *


basic_config(style="print")
for times in ["2019", "2020", "2021"]:
    html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
    attr_list = get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
    for ele in attr_list:
        path = ele.split("<a href=\"")[1].split("\">")[0]
        path = "https://openaccess.thecvf.com/" + path
        html = random_proxy_header_access(path, random_proxy=False)
        attr_list = get_attribute_of_html(html,
                                          {'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
        for eles in attr_list:
            pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
            work_path = local_path_generate("cvpr{}".format(times))
            retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
也可以直接安装本包
pip install PaperCrawlerUtil
本模块使用自己搭建的一个代理池代码来自https://github.com/Germey/ProxyPool.git
也可以自己在本地搭建这样的代理服务器然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PaperCrawlerUtil-0.0.17.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

PaperCrawlerUtil-0.0.17-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file PaperCrawlerUtil-0.0.17.tar.gz.

File metadata

  • Download URL: PaperCrawlerUtil-0.0.17.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for PaperCrawlerUtil-0.0.17.tar.gz
Algorithm Hash digest
SHA256 0b39e4ffb44859196eccd651ff2f062e83997a12a374406f4946d8e375f29e59
MD5 5220e6896efc0db12a2bf75969c93f79
BLAKE2b-256 d7de1e20c2a811d14ebb893691f02656736487850830ed8ce4925f038e89027e

See more details on using hashes here.

File details

Details for the file PaperCrawlerUtil-0.0.17-py3-none-any.whl.

File metadata

File hashes

Hashes for PaperCrawlerUtil-0.0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 2b16f8f0628c20e3d43f163db951ce1ca071740e9fb964c2e0158ed88a17e553
MD5 587fc31a8ecae021e5a28f66ce53a92a
BLAKE2b-256 be6b81a686138f3f36ca9b2042de27cf29c0adf1d2aa1ab6e6888067500b9d25

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page