Skip to main content

知乎关键词搜索、热榜、用户信息、回答、专栏文章、评论等信息的抓取程序

Project description

zhihu_crawler


本程序支持关键词搜索、热榜、用户信息、回答、专栏文章、评论等信息的抓取

项目目录


  • __init__.py 为程序的对外统一入口

  • constants.py 常量

  • exceptions.py 自定义异常

  • extractors.py 数据清洗

  • page_iterators.py 简单的页面处理

  • zhihu_scraper.py 页面请求、cookie设置

  • zhihu_types.py 类型提示、检查。项目自定义类型

  • 注意事项: 项目内有部分异步操作,在模块引用之前需要使用猴子补丁; 同时该项目没有对ip限制、登录做针对性处理

安装


pip install zhihu_crawler

使用


if __name__ == '__main__':



    # 设置代理; 如采集量较大,建议每次请求都切换代理

    set_proxy({'http': 'http://127.0.0.1:8125', 'https': 'http://127.0.0.1:8125'})



    # 设置cookie

    set_cookie({'d_c0': 'AIBfvRMxmhSPTk1AffR--QLwm-gDM5V5scE=|1646725014'})



    # 搜索采集使用案例:

    for info in search_crawl(key_word='天空', count=10):

        print(info)



    # 可传入data_type 指定搜索类型

    for info in search_crawl(key_word='天空', count=10, data_type='answer'):

        print(info)



    # 用户信息回答列表使用案例(采集该用户信息及50条回答信息,每条回答包含50条评论):

    for info in user_crawler('wo-men-de-tai-kong',

                             answer_count=50,

                             comment_count=50

                             ):

        print(info)



    # 用户信息提问列表使用案例(采集该用户信息及10条问题信息,每条问题包含10条回答,每条回答包含50条评论):

    for info in user_crawler('wo-men-de-tai-kong',

                             question_count=10,

                             drill_down_count=10,

                             comment_count=50):

        print(info)



    # 热点问题采集使用案例

    # 采集 前10个问题, 每个问题采集10条回答

    for info in hot_questions_crawl(question_count=10, drill_down_count=10):

        print(info)



    # 可传入period 指定热榜性质。如小时榜、日榜、周榜、月榜

    # 传入domains 采集指定主题的问题

    for info in hot_questions_crawl(question_count=10, period='day', domains=['1001', 1003]):

        print(info)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhihu_crawler-0.0.2.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhihu_crawler-0.0.2-py3-none-any.whl (4.4 MB view details)

Uploaded Python 3

File details

Details for the file zhihu_crawler-0.0.2.tar.gz.

File metadata

  • Download URL: zhihu_crawler-0.0.2.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.9

File hashes

Hashes for zhihu_crawler-0.0.2.tar.gz
Algorithm Hash digest
SHA256 917fd4c687cd0cb01b3c95ad240d386f06ac776aa6fdeec5b44916398328a3b8
MD5 ad9cbc1f277e979f182e1b0c71a39d3b
BLAKE2b-256 98d7abf98bcb4c21d91c6bfc711f4239d28ea9311bd61102f3ed00ae0110c8c2

See more details on using hashes here.

File details

Details for the file zhihu_crawler-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: zhihu_crawler-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.9

File hashes

Hashes for zhihu_crawler-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8f90144d2ed2668f785889a77908fc518b4cd6857aeef173bf44686ea7f4586e
MD5 65022539b7aa072591c57505afad53c6
BLAKE2b-256 0f306bf5a1e8d490df3b1f0fddcfcdf7bf5c68da7080c2ba965eee4418839548

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page