Skip to main content

用于爬取zhihu和github的代码,数据存储于mongodb。.

Project description

scrapy-zhihu-github

用于爬取zhihu和github的代码,数据存储于mongodb。

Install

Scrapy安装见使用Scrapy抓取数据

Mongodb安装在本机,数据库为zhihu,端口默认,存在以下collection:

  • zh_user:知乎用户
  • zh_ask:知乎问题
  • zh_answer:知乎回答
  • zh_followee:知乎关注列表
  • zh_follower:知乎粉丝列表
  • gh_user:github 用户
  • gh_repo:github 仓库

zhihu

Scrapy爬取知乎数据,说明见使用Scrapy爬取知乎网站

zhihu 用户表结构(db.zhihu.zh_user)为:

_id int, # 用户id
url string,
username string, # 用户名,如 zhouyuan
nickname string, # 昵称,如 周源
location string, # 居住地
industry string, # 行业,如 互联网
sex int, # 性别,1:男, 2:女, 0:未知
jobs [],
educations [],
description string, # 自我简介
sinaweibo string, # 新浪微博账号
tencentweibo string, # 腾讯微博账号
# qq string, # QQ号
ask_num int, # 提问数, 如 590
answer_num int, # 回答数,如 340
post_num int, # 专栏文章数, 如 3
collection_num int, # 收藏数,如 9
log_num int, # 编辑数,如14980
agree_num int, # 收到的赞同,如 15316
thank_num int, # 收到的感谢,如 3500
fav_num int, # 被收藏次数,如 9424
share_num int, # 被分享次数,如 922
followee_num int, # 关注数,如 1515
follower_num int, # 被关注数(粉丝),如 319529
update_time datetime # 信息更新时间,如 2014-05-17 11:15:00

先运行下面代码,采集用户信息以及用户的关注和粉丝列表:

scrapy crawl zhihu_user

再来采集问题和答案:

scrapy crawl zhihu_ask

scrapy crawl zhihu_answer

github

github 用户表结构(db.zhihu.gh_user)为:

_id, #用户id
url, #主页url
username,#用户名
nickname,#昵称 
user_id,#用户id
type,#类型:1,组织;0,个人 

company,#公司
location,#位置 
website,#网站 
email,#邮箱 
update_time,#爬虫更新时间

join_date,#加入时间
followee_num,#关注数
follower_num,#粉丝数 
star_num,#星数 
organizations,#加入的组织

member_num,#组织成员数

先运行下面代码,采集用户信息:

scrapy crawl github_user

爬取用户信息以及粉丝用户:

scrapy crawl github_follower

查看爬取的结果:

> use zhihu
switched to db zhihu
> db.gh_user.count()
126135

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_zhihu_github-1.2.5.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

scrapy_zhihu_github-1.2.5-py3-none-any.whl (36.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_zhihu_github-1.2.5.tar.gz.

File metadata

  • Download URL: scrapy_zhihu_github-1.2.5.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.2

File hashes

Hashes for scrapy_zhihu_github-1.2.5.tar.gz
Algorithm Hash digest
SHA256 bb55270945ae37a5b834356abc7f8a4e32888962e44f396c299966b4f43f1912
MD5 eec24cb70cde6c61aae5ef8d7f31d617
BLAKE2b-256 8163591ff1c5ae3bd134f6c10dd2da6ba62868def620ac749fce0e501236a0a4

See more details on using hashes here.

File details

Details for the file scrapy_zhihu_github-1.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_zhihu_github-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0b3286a83df10741664ad493b0dff8382799ff72280ec1c8b6b3c9d956b461bc
MD5 33b089690a0d64df40194c3927178dc7
BLAKE2b-256 39d4313f1afac95353b5b3390c54c89075518c537e66e1f694545692cb7d6402

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page