a spider bot (scrawler) by python, using selenium and chrome driver

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

spiderbot

爬虫机器人。

请注意，本 repo 并未提供有效的 xpaths 语法，目前配置文件中的 xpaths 仅作示例。

如何部署？

1、拷贝源码

git clone https://github.com/liujuanjuan1984/spiderbot.git
cd spiderbot

2、安装依赖

pip install spiderbot
pip install selenium

安装与 chrome 版本一致的 chromedriver 并把可执行文件放在系统的 PATH 目录下

3、修改配置

参考 config_private_sample.py 创建 config_private.py 文件并更新相关字段

4、如何运行？

4.1 首次初始化 bot 时，传入 init=True 用于生成 database，成功执行将在当前目录下生成 spiderbot.db 文件。

from spiderbot import SpiderBot

bot = SpiderBot(init=True)

4.2 添加 users，如果确定爬取这些用户，则传入 True，待确认就传入 None

urls = ["https://example.com/user_a_homepage", "https://example.com/user_b_homepage"]

bot.add_users(working_status=True, *urls)

4.3 根据需要爬取内容

bot.get_profiles()
bot.get_new_posturls()
bot.get_history_posturls(1, 9)
bot.get_posts()

历史内容和 profile 只需要爬取一遍，如果有遗漏，可重复爬取；

最新内容则需要持续爬取。

代码格式化与检查

isort .
black .
pylint spiderbot > pylint_spiderbot.log

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.3

Sep 27, 2022

0.3.2

Sep 27, 2022

0.3.1

Sep 25, 2022

0.3.0

Sep 24, 2022

0.2.4

Sep 24, 2022

0.2.3

Sep 24, 2022

0.2.2

Sep 24, 2022

0.2.1

Sep 24, 2022

0.2.0

Sep 23, 2022

This version

0.1.1

Sep 23, 2022

0.1.0

Sep 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderbot-0.1.1.tar.gz (21.4 kB view hashes)

Uploaded Sep 23, 2022 Source

Built Distribution

spiderbot-0.1.1-py3-none-any.whl (22.6 kB view hashes)

Uploaded Sep 23, 2022 Python 3

Hashes for spiderbot-0.1.1.tar.gz

Hashes for spiderbot-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`46ed016aea1d1b3428510b5294a956837d52efd5f8575c19b3667e76167fabd0`
MD5	`862e2d25799629bb6f8f1ca1f455d7eb`
BLAKE2b-256	`88118387517116530f10a5590ad8ceff84103d04e7394e28c5749b4e78e19214`

Hashes for spiderbot-0.1.1-py3-none-any.whl

Hashes for spiderbot-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`636a69fff037d98d0ec4f9350682533f7ed98c092ecd53f1f4a8ff3c9ace5aca`
MD5	`19fe2c6a1a236daa31dcb736f498a78c`
BLAKE2b-256	`c437200195c255e6850627d259a2a47b60a2c0c94fe73df3254d07281a640b22`