知乎关键词搜索、热榜、用户信息、回答、专栏文章、评论等信息的抓取程序
Project description
zhihu_crawler
本程序支持关键词搜索、热榜、用户信息、回答、专栏文章、评论等信息的抓取
项目目录
__init__.py 为程序的对外统一入口
constants.py 常量
exceptions.py 自定义异常
extractors.py 数据清洗
page_iterators.py 简单的页面处理
zhihu_scraper.py 页面请求、cookie设置
zhihu_types.py 类型提示、检查。项目自定义类型
注意事项: 项目内有部分异步操作,在模块引用之前需要使用猴子补丁; 同时该项目没有对ip限制、登录做针对性处理
安装
pip install zhihu_crawler
使用
if __name__ == '__main__':
# 设置代理; 如采集量较大,建议每次请求都切换代理
set_proxy({'http': 'http://127.0.0.1:8125', 'https': 'http://127.0.0.1:8125'})
# 设置cookie
set_cookie({'d_c0': 'AIBfvRMxmhSPTk1AffR--QLwm-gDM5V5scE=|1646725014'})
# 搜索采集使用案例:
for info in search_crawl(key_word='天空', count=10):
print(info)
# 可传入data_type 指定搜索类型
for info in search_crawl(key_word='天空', count=10, data_type='answer'):
print(info)
# 用户信息回答列表使用案例(采集该用户信息及50条回答信息,每条回答包含50条评论):
for info in user_crawler('wo-men-de-tai-kong',
answer_count=50,
comment_count=50
):
print(info)
# 用户信息提问列表使用案例(采集该用户信息及10条问题信息,每条问题包含10条回答,每条回答包含50条评论):
for info in user_crawler('wo-men-de-tai-kong',
question_count=10,
drill_down_count=10,
comment_count=50):
print(info)
# 热点问题采集使用案例
# 采集 前10个问题, 每个问题采集10条回答
for info in hot_questions_crawl(question_count=10, drill_down_count=10):
print(info)
# 可传入period 指定热榜性质。如小时榜、日榜、周榜、月榜
# 传入domains 采集指定主题的问题
for info in hot_questions_crawl(question_count=10, period='day', domains=['1001', 1003]):
print(info)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zhihu_crawler-0.0.2.tar.gz.
File metadata
- Download URL: zhihu_crawler-0.0.2.tar.gz
- Upload date:
- Size: 3.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
917fd4c687cd0cb01b3c95ad240d386f06ac776aa6fdeec5b44916398328a3b8
|
|
| MD5 |
ad9cbc1f277e979f182e1b0c71a39d3b
|
|
| BLAKE2b-256 |
98d7abf98bcb4c21d91c6bfc711f4239d28ea9311bd61102f3ed00ae0110c8c2
|
File details
Details for the file zhihu_crawler-0.0.2-py3-none-any.whl.
File metadata
- Download URL: zhihu_crawler-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f90144d2ed2668f785889a77908fc518b4cd6857aeef173bf44686ea7f4586e
|
|
| MD5 |
65022539b7aa072591c57505afad53c6
|
|
| BLAKE2b-256 |
0f306bf5a1e8d490df3b1f0fddcfcdf7bf5c68da7080c2ba965eee4418839548
|