A package for crawling markdown formatted articles from certain webpage and storing them locally.
Project description
Article Crawler
✨ Introduction
Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
🚀 Quick Start
-
Install through
pip
pip install article-crawler
-
Usage
Usage:
python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]
Options: --version show program's version number and exit -h, --help show this help message and exit -u URL, --url=URL crawled url (required) -t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu] -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER output html / markdown / pdf folder (required) -w WEBSITE_TAG, --website_tag=WEBSITE_TAG position of the article content in HTML (not required if 'type' is specified) -c CLASS_, --class=CLASS_ position of the article content in HTML (not required if 'type' is specified) -i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)
-
type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
-
website_tag / class_ / id:
e.g.
<div id="article_content" class="article_content clearfix"></div>
- In this element,
website_tag
,class_
,id
isdiv
,article_content clearfix
,article_content
respectively.
- You don't need to specify
type
when you specifywebsite_tag / class_ / id
. - You need to use the web console to locate the position of the article.
website_tag / class_ / id
is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.
- In this element,
-
Open Source License
MIT License see https://opensource.org/license/mit/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file article_crawler-0.0.4.tar.gz
.
File metadata
- Download URL: article_crawler-0.0.4.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bae481259fbc896014cf02c79bb169519451bc4dfa5ef29ee5f35d0492e6aadb |
|
MD5 | 2e65b55346ea59609f7041ed219e1045 |
|
BLAKE2b-256 | b1e84f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437 |
File details
Details for the file article_crawler-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: article_crawler-0.0.4-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f388713e3d22526bb68404044f16af0652536ba817db3a125c48c9a4a69578a8 |
|
MD5 | fc00089c77356a6519d8ca9fc377e4e1 |
|
BLAKE2b-256 | fe758abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470 |