Skip to main content

A small paper crawler

Project description

PaperCrawlerUtil

一套用来爬论文的工具,a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:

from PaperCrawlerUtil.util import *


basic_config(style="print")
for times in ["2019", "2020", "2021"]:
    html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
    attr_list = get_attribute_of_html(html, {'href': "in", 'CVPR': "in", "py": "in", "day": "in"})
    for ele in attr_list:
        path = ele.split("<a href=\"")[1].split("\">")[0]
        path = "https://openaccess.thecvf.com/" + path
        html = random_proxy_header_access(path, random_proxy=False)
        attr_list = get_attribute_of_html(html,
                                          {'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
        for eles in attr_list:
            pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
            work_path = local_path_generate("cvpr{}".format(times))
            retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
本模块使用自己搭建的一个代理池代码来自https://github.com/Germey/ProxyPool.git
也可以自己在本地搭建这样的代理服务器然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")

同时可以替换其他的一些配置如下所示其中日志的等级只能配置一次之后不会再生效
basic_config(log_file_name="1.log",
                 log_level=logging.WARNING,
                 proxy_pool_url="http://xxx",
                 logs_style=LOG_STYLE_LOG)
如下所示可以抽取路径上的PDF中的信息其中路径可以是PDF也可以是文件路径会自动判断
如果是文件夹则会遍历所有文件然后返回总的字符串可以自选分割符的形式
同时信息的提取是通过两个标记实现的即通过开始和结束标记截取字段
title_and_abstract = get_para_from_pdf(path="E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", ranges=(0, 2))
write_file(path=local_path_generate("E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", "title_and_abstract.txt"),
               mode="w+", string=title_and_abstract)
也可以直接安装本包
pip install PaperCrawlerUtil

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PaperCrawlerUtil-0.0.18.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

PaperCrawlerUtil-0.0.18-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file PaperCrawlerUtil-0.0.18.tar.gz.

File metadata

  • Download URL: PaperCrawlerUtil-0.0.18.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for PaperCrawlerUtil-0.0.18.tar.gz
Algorithm Hash digest
SHA256 a8f588a1356b0076a735f1c08114ab909af819e8a224ac16720bd4216376debe
MD5 87bf30139622c915af47202b8c204a3f
BLAKE2b-256 2bd7838909b6526f79867497758ef9894a5a5a67541e5d7775c261ab8be58eb9

See more details on using hashes here.

File details

Details for the file PaperCrawlerUtil-0.0.18-py3-none-any.whl.

File metadata

File hashes

Hashes for PaperCrawlerUtil-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 d6390db44d1cd1356a79ac15916a67acf178bfe5fa9e81dff41937e2677a3755
MD5 f2baa493d2a22d0a1d48256175d23079
BLAKE2b-256 436d0ea75d623078ac43d25ea135e87bd210660643e9f10ebc321c2e7f85c468

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page