Skip to main content

A package for crawling markdown formatted articles from certain webpage and storing them locally.

Project description

Article Crawler

PyPI Latest Release PyPI Downloads

English Doc | 中文文档

✨ Introduction

Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.

🚀 Quick Start

  1. Install through pip

    pip install article-crawler
    
  2. Usage

    Usage: python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]

    Options:
      --version             show program's version number and exit
      -h, --help            show this help message and exit
      -u URL, --url=URL     crawled url (required)
      -t TYPE, --type=TYPE  crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]
      -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER
                            output html / markdown / pdf folder (required)
      -w WEBSITE_TAG, --website_tag=WEBSITE_TAG
                            position of the article content in HTML (not required if 'type' is specified)
      -c CLASS_, --class=CLASS_
                            position of the article content in HTML (not required if 'type' is specified)
      -i ID, --id=ID        position of the article content in HTML (not required if 'type' is specified)
    
    • type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.

    • website_tag / class_ / id:

      e.g. <div id="article_content" class="article_content clearfix"></div>

      • In this element, website_tag, class_, id is div, article_content clearfix, article_content respectively.
      1. You don't need to specify type when you specify website_tag / class_ / id.
      2. You need to use the web console to locate the position of the article.
      3. website_tag / class_ / id is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.

Open Source License

MIT License see https://opensource.org/license/mit/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_crawler-0.0.4.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

article_crawler-0.0.4-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file article_crawler-0.0.4.tar.gz.

File metadata

  • Download URL: article_crawler-0.0.4.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for article_crawler-0.0.4.tar.gz
Algorithm Hash digest
SHA256 bae481259fbc896014cf02c79bb169519451bc4dfa5ef29ee5f35d0492e6aadb
MD5 2e65b55346ea59609f7041ed219e1045
BLAKE2b-256 b1e84f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437

See more details on using hashes here.

File details

Details for the file article_crawler-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for article_crawler-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f388713e3d22526bb68404044f16af0652536ba817db3a125c48c9a4a69578a8
MD5 fc00089c77356a6519d8ca9fc377e4e1
BLAKE2b-256 fe758abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page