Skip to main content

a spider bot (scrawler) by python, using selenium and chrome driver

Project description

spiderbot

爬虫机器人。

请注意,本 repo 并未提供有效的 xpaths 语法,目前配置文件中的 xpaths 仅作示例。

如何部署?

1、拷贝源码

git clone https://github.com/liujuanjuan1984/spiderbot.git
cd spiderbot 

2、安装依赖

pip install spiderbot
pip install selenium

安装与 chrome 版本一致的 chromedriver 并把可执行文件放在系统的 PATH 目录下

3、修改配置

参考 config_private_sample.py 创建 config_private.py 文件并更新相关字段

4、如何运行?

4.1 首次初始化 bot 时,传入 init=True 用于生成 database,成功执行将在当前目录下 生成 spiderbot.db 文件。

from spiderbot import SpiderBot

bot = SpiderBot(init=True)

4.2 添加 users,如果确定爬取这些用户,则传入 True,待确认就传入 None

urls = ["https://example.com/user_a_homepage", "https://example.com/user_b_homepage"]

bot.add_users(working_status=True, *urls)

4.3 根据需要爬取内容

bot.get_profiles()
bot.get_new_posturls()
bot.get_history_posturls(1, 9)
bot.get_posts()

历史内容和 profile 只需要爬取一遍,如果有遗漏,可重复爬取;

最新内容则需要持续爬取。

代码格式化与检查

isort .
black .
pylint spiderbot > pylint_spiderbot.log

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderbot-0.1.1.tar.gz (21.4 kB view hashes)

Uploaded Source

Built Distribution

spiderbot-0.1.1-py3-none-any.whl (22.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page