Skip to main content

a collection of utils used to create crawler and document process

Project description

PaperCrawlerUtil

一套用来爬论文的工具,a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:

from PaperCrawlerUtil.crawler_util import *


basic_config(logs_style="print")
for times in ["2019", "2020", "2021"]:
    html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
    attr_list = get_attribute_of_html(html, {'href': IN, 'CVPR': IN, "py": IN, "day": IN})
    for ele in attr_list:
        path = ele.split("<a href=\"")[1].split("\">")[0]
        path = "https://openaccess.thecvf.com/" + path
        html = random_proxy_header_access(path, random_proxy=False)
        attr_list = get_attribute_of_html(html,
                                          {'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
        for eles in attr_list:
            pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
            work_path = local_path_generate("cvpr{}".format(times))
            retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
#此外也可以通过doi从sci-hub下载,示例代码如下:
get_pdf_url_by_doi(doi="xxxx", work_path=local_path_generate("./"))
#本模块使用自己搭建的一个代理池,代码来自https://github.com/Germey/ProxyPool.git
#也可以自己在本地搭建这样的代理服务器,然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")

#同时可以替换,其他的一些配置,如下所示,其中日志的等级只能配置一次,之后不会再生效
basic_config(log_file_name="1.log",
                 log_level=logging.WARNING,
                 proxy_pool_url="http://xxx",
                 logs_style=LOG_STYLE_LOG)
#如下所示,可以抽取路径上的PDF中的信息,其中路径可以是PDF也可以是文件路径,会自动判断
#如果是文件夹,则会遍历所有文件,然后返回总的字符串,可以自选分割符的形式
#同时信息的提取是通过两个标记实现的,即通过开始和结束标记截取字段
title_and_abstract = get_para_from_pdf(path="E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", ranges=(0, 2))
write_file(path=local_path_generate("E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", "title_and_abstract.txt"),
               mode="w+", string=title_and_abstract)
也可以直接安装本包
pip install PaperCrawlerUtil

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PaperCrawlerUtil-0.0.21.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

PaperCrawlerUtil-0.0.21-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file PaperCrawlerUtil-0.0.21.tar.gz.

File metadata

  • Download URL: PaperCrawlerUtil-0.0.21.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for PaperCrawlerUtil-0.0.21.tar.gz
Algorithm Hash digest
SHA256 68e63c264bfbb4757fc8ed416c5233f0fbbfc8c62267fb66b4b2a975b435357e
MD5 d9dc5bcfee264c59dd573ddece29aaec
BLAKE2b-256 de38fcf548c54e4538f912811da100be7c63ac87d3d9efbc1b919715db0cdba3

See more details on using hashes here.

File details

Details for the file PaperCrawlerUtil-0.0.21-py3-none-any.whl.

File metadata

File hashes

Hashes for PaperCrawlerUtil-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 da2c9f27ddafd5c7a1b43cf5fe7eb283dedc9286361936360671cb0134bef748
MD5 095229f045021923f1914115a318690a
BLAKE2b-256 440163c52757c7b3ab0a5dc8a0ff1080eac01eb116c96cb633223c85d5f9ab25

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page