用于爬取zhihu和github的代码,数据存储于mongodb。.
Project description
scrapy-zhihu-github
用于爬取zhihu和github的代码,数据存储于mongodb。
Install
Scrapy安装见使用Scrapy抓取数据。
Mongodb安装在本机,数据库为zhihu
,端口默认,存在以下collection:
zh_user
:知乎用户zh_ask
:知乎问题zh_answer
:知乎回答zh_followee
:知乎关注列表zh_follower
:知乎粉丝列表gh_user
:github 用户gh_repo
:github 仓库
zhihu
Scrapy爬取知乎数据,说明见使用Scrapy爬取知乎网站。
zhihu 用户表结构(db.zhihu.zh_user)为:
_id int, # 用户id
url string,
username string, # 用户名,如 zhouyuan
nickname string, # 昵称,如 周源
location string, # 居住地
industry string, # 行业,如 互联网
sex int, # 性别,1:男, 2:女, 0:未知
jobs [],
educations [],
description string, # 自我简介
sinaweibo string, # 新浪微博账号
tencentweibo string, # 腾讯微博账号
# qq string, # QQ号
ask_num int, # 提问数, 如 590
answer_num int, # 回答数,如 340
post_num int, # 专栏文章数, 如 3
collection_num int, # 收藏数,如 9
log_num int, # 编辑数,如14980
agree_num int, # 收到的赞同,如 15316
thank_num int, # 收到的感谢,如 3500
fav_num int, # 被收藏次数,如 9424
share_num int, # 被分享次数,如 922
followee_num int, # 关注数,如 1515
follower_num int, # 被关注数(粉丝),如 319529
update_time datetime # 信息更新时间,如 2014-05-17 11:15:00
先运行下面代码,采集用户信息以及用户的关注和粉丝列表:
scrapy crawl zhihu_user
再来采集问题和答案:
scrapy crawl zhihu_ask
scrapy crawl zhihu_answer
github
github 用户表结构(db.zhihu.gh_user)为:
_id, #用户id
url, #主页url
username,#用户名
nickname,#昵称
user_id,#用户id
type,#类型:1,组织;0,个人
company,#公司
location,#位置
website,#网站
email,#邮箱
update_time,#爬虫更新时间
join_date,#加入时间
followee_num,#关注数
follower_num,#粉丝数
star_num,#星数
organizations,#加入的组织
member_num,#组织成员数
先运行下面代码,采集用户信息:
scrapy crawl github_user
爬取用户信息以及粉丝用户:
scrapy crawl github_follower
查看爬取的结果:
> use zhihu
switched to db zhihu
> db.gh_user.count()
126135
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy_zhihu_github-1.2.5.tar.gz
(27.2 kB
view details)
Built Distribution
File details
Details for the file scrapy_zhihu_github-1.2.5.tar.gz
.
File metadata
- Download URL: scrapy_zhihu_github-1.2.5.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb55270945ae37a5b834356abc7f8a4e32888962e44f396c299966b4f43f1912 |
|
MD5 | eec24cb70cde6c61aae5ef8d7f31d617 |
|
BLAKE2b-256 | 8163591ff1c5ae3bd134f6c10dd2da6ba62868def620ac749fce0e501236a0a4 |
File details
Details for the file scrapy_zhihu_github-1.2.5-py3-none-any.whl
.
File metadata
- Download URL: scrapy_zhihu_github-1.2.5-py3-none-any.whl
- Upload date:
- Size: 36.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b3286a83df10741664ad493b0dff8382799ff72280ec1c8b6b3c9d956b461bc |
|
MD5 | 33b089690a0d64df40194c3927178dc7 |
|
BLAKE2b-256 | 39d4313f1afac95353b5b3390c54c89075518c537e66e1f694545692cb7d6402 |