Skip to main content

A python crawler authored by Ken.

Project description

kcrawler

Build Status license

A python crawler authored by Ken.

1. 安装

1.1 环境要求

  • python>=3.0
  • pip>=19.0
python -V
pip install --upgrade pip
pip -V

1.2 查看最新版本

pip search kcrawler

1.3 初次安装

pip install kcrawler
# or
pip install --index-url https://pypi.org/simple kcrawler

1.4 更新已有安装

pip install --upgrade kcrawler
# or
pip install --upgrade --index-url https://pypi.org/simple kcrawler

1.5 卸载

pip uninstall -y kcrawler

2. 命令行调用

2.1 使用方式

使用 pip 安装成功后,会自动在系统搜索路径创建可执行程序:kcrawler, kcanjuke, kcjuejin

通常是 pythonconda 安装目录下的 bin 子目录下,例如:/anaconda3/bin/kcrawler。windows 平台会创建 .exe 文件。

kcrawler 是爬取所有网站应用的入口,命令执行格式如下:

kcrawler <webapp> [webapp-data] [--options]

等效于:

kc<webapp> [webapp-data] [--options]

例如:

kcrawler juejin books --url "https://..."
kcjuejin books --url "https://..."

2.2 使用示例

kcrawler <webapp> [webapp-data] [--options] 方式运行为例。

2.2.1 爬取掘金小册数据

执行如下命令:

kcrawler juejin book

命令执行成功,显示如下统计图表:

并将明细数据保存在当前目录下,同时保存 .csv.xls 文件,文件名格式如下:

juejin_books_YYYY-MM-DD.csv juejin_books_YYYY-MM-DD.xls

2.2.2 爬取掘金专栏阅读量

格式:

kcrawler juejin post --name <username> --limit 100 --url '<user_post_url>'
  • name: 目标爬取用户的名称,可以自定义,仅仅用于区分不同用户,同时作为爬取数据保存的文件夹名称
  • limit: 限制爬取最新专栏数
  • url: 目标爬取用户的接口地址,这个参数真正决定了要爬取谁的专栏

url 获取方式如下:

为了快速体验爬取效果,也提供了 url 缺省情况下的支持,爬取用户 ken 的专栏:

kcrawler juejin post --name ken --limit 100

爬取明细数据,会在 ken 目录下,以爬取日期和时间命名,同时保存 .csv 文件和 .xls 文件。

2.2.3 指定城市爬取安居客小区房价

首先需要获取网站cookie 。获取方式参考《python 自动抓取分析房价数据——安居客版 》2.4 小节

<anjuke_cookie> 替换成自己 cookie,运行如下命令:

kcrawler anjuke --city shenzhen --limit 50 --cookie "<anjuke_cookie>"

也可以将 cookie 保存在当前目录下的 anjuke_cookie (无后缀)文件中,运行如下命令:

kcrawler anjuke --city shenzhen --limit 50

命令成功运行成功后,会显示房价平均值,最大值,最小值,并绘制房价分布直方图,关闭直方图后,明细数据将保存在当前目录下,形如:anjuke_shenzhen_community_price_20xx-xx-xx.csv

获取其他城市的房价,只需将 city 参数改成安居客网站覆盖的城市拼音。可打开页面 https://www.anjuke.com/sy-city.html ,点击需要获取的城市,复制浏览器地址栏中城市对应的二级域名,如 beijing.anjuke.com 只取 beijing 作为 city 参数。

3. 导入 python 模块

3.1 Boss 接口

from kcrawler import Boss
boss  = Boss()

boss_positions = boss.position()
boss_cities = boss.city()
boss_hotcities = boss.hotcity()
boss_industries = boss.industry()
boss_user_city = boss.userCity()
boss_expects = boss.expect()

jobs = boss.job(0, 1)
tencent_jobs = boss.queryjob(query='腾讯', city=101280600, industry=None, position=101301)
tencent_jobs = boss.queryjobpage(query='腾讯', city=101280600, industry=None, position=101301, page=2)

jobcard = boss.jobcard('3c2016bbf8413f3b1XR63t-1FVI~', '505ee74b-504b-4aea-921c-a3dc2016be80.f1:common-155-GroupA--157-GroupA.15')

Release history

https://pypi.org/project/kcrawler/#history

License

MIT

Copyright (c) 2019 kenblikylee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kcrawler-1.1.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

kcrawler-1.1-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file kcrawler-1.1.tar.gz.

File metadata

  • Download URL: kcrawler-1.1.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.0

File hashes

Hashes for kcrawler-1.1.tar.gz
Algorithm Hash digest
SHA256 f569987ce1e16f5e2a2f65a27ae84cf6183ac09e9869a2d67d75741a659247ed
MD5 6888a1c1c75c78d42e2f4fabd1209eb4
BLAKE2b-256 07c7c5c0c2fe2232203e3b3bd60029f03b4b135c77dc5c0b80dd57b464b67ea8

See more details on using hashes here.

File details

Details for the file kcrawler-1.1-py3-none-any.whl.

File metadata

  • Download URL: kcrawler-1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.0

File hashes

Hashes for kcrawler-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5f07579d961d0c5dd3abe981f90ea8a0ffc50c3e77c8c5a256b928263de754fd
MD5 8dd316183db75805eed3e0ffd2938fbb
BLAKE2b-256 9bd137060caa40c4d78fd0ea0fbe8c680a73a6f131186ff318b7839a4407a284

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page