A package for crawling markdown formatted articles from certain webpage and storing them locally.
Project description
Article Crawler
✨ Introduction
Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
🚀 Quick Start
-
Install through
pip
pip install article-crawler
-
Usage
Usage:
python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]
Options: --version show program's version number and exit -h, --help show this help message and exit -u URL, --url=URL crawled url (required) -t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu] -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER output html / markdown / pdf folder (required) -w WEBSITE_TAG, --website_tag=WEBSITE_TAG position of the article content in HTML (not required if 'type' is specified) -c CLASS_, --class=CLASS_ position of the article content in HTML (not required if 'type' is specified) -i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)
-
type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
-
website_tag / class_ / id:
e.g.
<div id="article_content" class="article_content clearfix"></div>
- In this element,
website_tag
,class_
,id
isdiv
,article_content clearfix
,article_content
respectively.
- You don't need to specify
type
when you specifywebsite_tag / class_ / id
. - You need to use the web console to locate the position of the article.
website_tag / class_ / id
is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.
- In this element,
-
Open Source License
MIT License see https://opensource.org/license/mit/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
article_crawler-0.0.4.tar.gz
(7.2 kB
view hashes)
Built Distribution
Close
Hashes for article_crawler-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f388713e3d22526bb68404044f16af0652536ba817db3a125c48c9a4a69578a8 |
|
MD5 | fc00089c77356a6519d8ca9fc377e4e1 |
|
BLAKE2b-256 | fe758abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470 |