A package for crawling markdown formatted articles from certain webpage and storing them locally.
Project description
Article Crawler
✨ Introduction
Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
🚀 Quick Start
-
Install through
pip
pip install article-crawler
-
Usage
Usage:
python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]
Options: --version show program's version number and exit -h, --help show this help message and exit -u URL, --url=URL crawled url (required) -t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu] -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER output html / markdown / pdf folder (required) -w WEBSITE_TAG, --website_tag=WEBSITE_TAG position of the article content in HTML (not required if 'type' is specified) -c CLASS_, --class=CLASS_ position of the article content in HTML (not required if 'type' is specified) -i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)
-
type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
-
website_tag / class_ / id:
e.g.
<div id="article_content" class="article_content clearfix"></div>
- In this element,
website_tag
,class_
,id
isdiv
,article_content clearfix
,article_content
respectively.
- You don't need to specify
type
when you specifywebsite_tag / class_ / id
. - You need to use the web console to locate the position of the article.
website_tag / class_ / id
is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.
- In this element,
-
Open Source License
MIT License see https://opensource.org/license/mit/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
article_crawler-0.0.3.tar.gz
(7.4 kB
view hashes)
Built Distribution
Close
Hashes for article_crawler-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c86e089579806cdffbd421fc5e2f6339e393b6fd2ba81c37c5225ec7710886dd |
|
MD5 | c48693207f226a542787dfd08f7c7e6f |
|
BLAKE2b-256 | cd5fef573e2d1271d2d455527dd1e4c1e41fb483c3d9ebe1a56df2db2644b1c0 |