A package for crawling markdown formatted articles from certain webpage and storing them locally.
Project description
Article Crawler
✨ Introduction
Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
🚀 Quick Start
-
Install through
pippip install article-crawler
-
Usage
Usage:
python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]Options: --version show program's version number and exit -h, --help show this help message and exit -u URL, --url=URL crawled url (required) -t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu] -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER output html / markdown / pdf folder (required) -w WEBSITE_TAG, --website_tag=WEBSITE_TAG position of the article content in HTML (not required if 'type' is specified) -c CLASS_, --class=CLASS_ position of the article content in HTML (not required if 'type' is specified) -i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)-
type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
-
website_tag / class_ / id:
e.g.
<div id="article_content" class="article_content clearfix"></div>- In this element,
website_tag,class_,idisdiv,article_content clearfix,article_contentrespectively.
- You don't need to specify
typewhen you specifywebsite_tag / class_ / id. - You need to use the web console to locate the position of the article.
website_tag / class_ / idis used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.
- In this element,
-
Open Source License
MIT License see https://opensource.org/license/mit/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_crawler-0.0.4.tar.gz.
File metadata
- Download URL: article_crawler-0.0.4.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bae481259fbc896014cf02c79bb169519451bc4dfa5ef29ee5f35d0492e6aadb
|
|
| MD5 |
2e65b55346ea59609f7041ed219e1045
|
|
| BLAKE2b-256 |
b1e84f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437
|
File details
Details for the file article_crawler-0.0.4-py3-none-any.whl.
File metadata
- Download URL: article_crawler-0.0.4-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f388713e3d22526bb68404044f16af0652536ba817db3a125c48c9a4a69578a8
|
|
| MD5 |
fc00089c77356a6519d8ca9fc377e4e1
|
|
| BLAKE2b-256 |
fe758abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470
|