A small paper crawler

These details have not been verified by PyPI

Project links

Homepage

Project description

PaperCrawlerUtil

一套用来爬论文的工具，a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:

from PaperCrawlerUtil.util import *


basic_config(style="print")
for times in ["2019", "2020", "2021"]:
    html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
    attr_list = get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
    for ele in attr_list:
        path = ele.split("<a href=\"")[1].split("\">")[0]
        path = "https://openaccess.thecvf.com/" + path
        html = random_proxy_header_access(path, random_proxy=False)
        attr_list = get_attribute_of_html(html,
                                          {'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
        for eles in attr_list:
            pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
            work_path = local_path_generate("cvpr{}".format(times))
            retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)

也可以直接安装本包
pip install PaperCrawlerUtil

本模块使用自己搭建的一个代理池，代码来自https://github.com/Germey/ProxyPool.git
也可以自己在本地搭建这样的代理服务器，然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.39

Feb 14, 2024

0.1.35

May 25, 2023

0.1.34

Apr 13, 2023

0.1.31

Feb 19, 2023

0.1.30

Feb 6, 2023

0.1.29

Jan 26, 2023

0.1.28

Jan 4, 2023

0.1.27

Nov 30, 2022

0.1.26

Nov 25, 2022

0.1.25

Nov 19, 2022

0.1.24

Nov 13, 2022

0.1.21

Oct 27, 2022

0.1.15

Oct 24, 2022

0.1.14

Oct 22, 2022

0.1.7

Oct 18, 2022

0.1.6

Oct 13, 2022

0.1.5

Oct 12, 2022

0.1.4

Oct 5, 2022

0.1.3

Oct 4, 2022

0.1.2

Oct 3, 2022

0.1.1

Oct 1, 2022

0.1.0

Sep 30, 2022

0.0.100

Sep 21, 2022

0.0.99

Sep 20, 2022

0.0.95

Sep 19, 2022

0.0.92

Sep 2, 2022

0.0.91

Aug 15, 2022

0.0.90

Aug 6, 2022

0.0.89

Aug 4, 2022

0.0.88

Aug 2, 2022

0.0.87

Aug 1, 2022

0.0.86

Jul 28, 2022

0.0.85

Jul 24, 2022

0.0.84

Jul 20, 2022

0.0.81

Jul 18, 2022

0.0.78

Jul 17, 2022

0.0.77

Jul 16, 2022

0.0.72

Jul 15, 2022

0.0.67

Jul 14, 2022

0.0.63

Jul 13, 2022

0.0.61

Jul 3, 2022

0.0.38

Jun 30, 2022

0.0.35

Jun 20, 2022

0.0.33

Jun 10, 2022

0.0.31

Jun 3, 2022

0.0.30

May 31, 2022

0.0.25

May 27, 2022

0.0.24

May 18, 2022

0.0.23

May 17, 2022

0.0.21

May 15, 2022

0.0.18

May 13, 2022

0.0.17

May 5, 2022

This version

0.0.16

May 3, 2022

0.0.14

Apr 29, 2022

0.0.12

Apr 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PaperCrawlerUtil-0.0.16.tar.gz (6.2 kB view hashes)

Uploaded May 3, 2022 Source

Built Distribution

PaperCrawlerUtil-0.0.16-py3-none-any.whl (6.5 kB view hashes)

Uploaded May 3, 2022 Python 3

Hashes for PaperCrawlerUtil-0.0.16.tar.gz

Hashes for PaperCrawlerUtil-0.0.16.tar.gz
Algorithm	Hash digest
SHA256	`e3c01b533b401e0108989e5cbe01ae91acf53f29722e5bb9d631d04489d5cd53`
MD5	`b0833db86e36cce9bcfed7e70fba7836`
BLAKE2b-256	`956e18a608591c7bd3f065eb960b1a7c0a8a8998dcc67cec6d8610e9943e2d83`

Hashes for PaperCrawlerUtil-0.0.16-py3-none-any.whl

Hashes for PaperCrawlerUtil-0.0.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e81400fb64feab76d3c39b06bcbc9ae883fc2ce47c522b0f17e2d4554f12affc`
MD5	`2580381ea48a5a8cdd89702b3cc61c86`
BLAKE2b-256	`c9f59fd1053713235144db19f2c16f4f2815298655f9ece72fd2a9c78e842239`