Skip to main content

a collection of utils used to create crawler and document process

Project description

PaperCrawlerUtil

一套用来爬论文的工具,a collection of utils to get paper This project is an util package to create a crawler. It contains many tools which can finish part function. There is an example:

from PaperCrawlerUtil.crawler_util import *


basic_config(logs_style="print")
for times in ["2019", "2020", "2021"]:
    # random_proxy_header_access 访问网站并且返回html字符串,其中可以设置是否使用代理等等
    html = random_proxy_header_access("https://openaccess.thecvf.com/CVPR{}".format(times), random_proxy=False)
    # get_attribute_of_html 获取html字符串中所需要的元素,可以配置一个字典,键表示待匹配字符串,值表示规则,还可以选择获取什么样的元素
    # 默认只获取标签<a>
    attr_list = get_attribute_of_html(html, {'href': IN, 'CVPR': IN, "py": IN, "day": IN})
    for ele in attr_list:
        path = ele.split("<a href=\"")[1].split("\">")[0]
        path = "https://openaccess.thecvf.com/" + path
        # 继续访问获取论文地址
        html = random_proxy_header_access(path, random_proxy=False)
        # 同上获取网页元素
        attr_list = get_attribute_of_html(html,
                                          {'href': "in", 'CVPR': "in", "content": "in", "papers": "in"})
        for eles in attr_list:
            pdf_path = eles.split("<a href=\"")[1].split("\">")[0]
            # local_path_generate 生成文件名绝对路径,要求提供文件夹名称,
            # 文件名不提供则默认使用当前时间作为文件名
            work_path = local_path_generate("cvpr{}".format(times))
            # retrieve_file 获取文件,可以设置是否使用代理等等
            retrieve_file("https://openaccess.thecvf.com/" + pdf_path, work_path)
#此外也可以通过doi从sci-hub下载,示例代码如下:
get_pdf_url_by_doi(doi="xxxx", work_path=local_path_generate("./"))
#本模块使用自己搭建的一个代理池,代码来自https://github.com/Germey/ProxyPool.git
#也可以自己在本地搭建这样的代理服务器,然后使用如下代码更换代理池
basic_config(proxy_pool_url="http://localhost:xxxx")

#同时可以替换,其他的一些配置,如下所示,其中日志的等级只能配置一次,之后不会再生效
basic_config(log_file_name="1.log",
                 log_level=logging.WARNING,
                 proxy_pool_url="http://xxx",
                 logs_style=LOG_STYLE_LOG)
#如下所示,可以抽取路径上的PDF中的信息,其中路径可以是PDF也可以是文件路径,会自动判断
#如果是文件夹,则会遍历所有文件,然后返回总的字符串,可以自选分割符的形式
#同时信息的提取是通过两个标记实现的,即通过开始和结束标记截取字段
title_and_abstract = get_para_from_pdf(path="E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", ranges=(0, 2))
write_file(path=local_path_generate("E:\\git-code\\paper-crawler\\CVPR\\CVPR_2021\\3\\3", "title_and_abstract.txt"),
               mode="w+", string=title_and_abstract)
from PaperCrawlerUtil.document_util import *
from PaperCrawlerUtil.crawler_util import *
from PaperCrawlerUtil.common_util import *

basic_config(logs_style=LOG_STYLE_PRINT)
# 通过百度翻译api平台申请获得
appid = "20200316xxxx99558"
secret_key = "BK6xxxxxDGBwaZgr4F"
# 实现文本翻译, 可以结合上一块代码获取PDF中的文字翻译,注意的是使用了百度
# 和谷歌翻译,因此如果使用谷歌翻译,则需要提供代理,默认会尝试http://127.0.01:1080 这个地址
text_translate("", appid, secret_key, is_google=True)
也可以直接安装本包
pip install PaperCrawlerUtil

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PaperCrawlerUtil-0.0.23.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

PaperCrawlerUtil-0.0.23-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file PaperCrawlerUtil-0.0.23.tar.gz.

File metadata

  • Download URL: PaperCrawlerUtil-0.0.23.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for PaperCrawlerUtil-0.0.23.tar.gz
Algorithm Hash digest
SHA256 f76c26d4742bb7f28226a6133ed11cf00fcf3d70509e6bd9220d1cb39b699a32
MD5 06f0d764dffa6707c73f5890be29895d
BLAKE2b-256 ff17a11ec1ab1a5df329683087a9e1b7d2c1288f10d9957cd2f64d85be293f36

See more details on using hashes here.

File details

Details for the file PaperCrawlerUtil-0.0.23-py3-none-any.whl.

File metadata

File hashes

Hashes for PaperCrawlerUtil-0.0.23-py3-none-any.whl
Algorithm Hash digest
SHA256 fddf0931409d2365b7e6e3f5c63f96dbadb969a5ef643b5c047ff18ed93d0613
MD5 7f1bac9dc3c6148e73e4ff20c7421a90
BLAKE2b-256 bc118014afe81b304aacc770f18bf6aceb06ed8661d1aa4242866fe502f5f02a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page