Skip to main content

PTT crawler using asyncio

Project description

AioPTTCrawler (PTT 網路版爬蟲)

This is Python Package use to crawl PTT's article data by using asyncio.

Documentation

PyPi Page

pip install AioPTTCrawler
from AioPTTCrawler import AioPTTCrawler
ptt_crawler = AioPTTCrawler()

Usage

get data from PTT

ptt_crawler = AioPTTCrawler()

BOARD = "Gossiping"
ptt_data = ptt_crawler.get_board_latest_articles(board=BOARD, page_count=10)
ptt_crawler = AioPTTCrawler()

BOARD = "Gossiping"
ptt_data = ptt_crawler.get_board_articles(board=BOARD, start_index=100, end_index=200)

ptt_data is a PTTData object. To extract data you need to use get_article_dict(), get_article_dataframe(), get_article_list() etc


get dict from PTTData

article_dict = ptt_data.get_article_dict()
comment_dict = ptt_data.get_comment_dict()

article's dict format

[
    {
        "article" : "Article's ID. ex:M.1663144920.A.A6E",
        "article_title" : "Article's title. ex:[公告] 批踢踢27週年活動宣導公告更新",
        "user_id" : "Author's ID. ex: ubcs",
        "user_name" : "Author's name. ex:(覺★青年超冒險蓋)",
        "board" : "BBS Board ex: Gossiping",
        "datetime" : "Post time. ex: Wed Sep 14 16:41:58 2022.",
        "context" : "Context of article. ex: PTT 27 周年活動開始囉,本篇為置底宣導,詳情參閱下面資料...",
        "ip_address" : "IP address. ex: 59.120.192.119",
        "comment_list" : [
            {"comment_dict"},
            {"comment_dict"},
        ]
    }, {"..."}
]

comment's dict format

[
    {
        "article_id" : "Article's ID. ex:M.1663144920.A.A6E",
        "tag" : "comment's reaction. ex: 推 噓 →",
        "user_id" : "User's ID. ex: bill403777",
        "comment_order" : "order of comment. ex: 1",
        "context" : "Context of comment. ex: 錢",
        "datetime" : "Post time. ex: 09/14 16:42",
        "ip_address" : "27.53.96.42",
    }, {"..."}
]

use this article for example

Comparison

Used time difference between normal method and async method

time diff

(unit: second)

Support

You may report bugs, ask for help and discuss various other issues on the issuse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AioPTTCrawler-0.0.12.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

AioPTTCrawler-0.0.12-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file AioPTTCrawler-0.0.12.tar.gz.

File metadata

  • Download URL: AioPTTCrawler-0.0.12.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for AioPTTCrawler-0.0.12.tar.gz
Algorithm Hash digest
SHA256 8d5bdfbd34c4afa0b1268776b1959fb1520a95ecf5bb9d587380871461e1a506
MD5 60b8d2a9ddb0d4b9bc8a90119aa4c88d
BLAKE2b-256 4c6b5d52e90fbeaf578a4623ff664c44bece76107aa876a2ac1c8ca79b375542

See more details on using hashes here.

Provenance

File details

Details for the file AioPTTCrawler-0.0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for AioPTTCrawler-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 02c5267d23ce245c689bd23d0b367cb19ce7391060029a1bb9fb2fa046f069a6
MD5 2e333b92a9e928af6669901336f506da
BLAKE2b-256 474e2547ae8806dda799f39b971f22a053e7080054ce2e3d321209f089c65a10

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page