Ssdb-based components for Scrapy.
Project description
scrapy-ssdb-spider
- 对着 scrapy-redis 照葫芦画瓢的作品
- 基于 ssdb 队列的 scrapy 分布式解决方案
依赖说明
- Python 3.6(测试环境)
- SSDB 1.9.7
- scrapy
- pyssdb
使用说明
shell:
git clone https://github.com/PickledFish/scrapy-ssdb-spider
python3 setup.py install
或者
pip install scrapy-ssdb-spider
在scrapy项目中:
# settings
# ssdb服务
SSDB_HOST = '127.0.0.1'
SSDB_PORT = 8888
# ssdb密码,可选配置
#SSDB_PWD = 'your password'
# 配置调度器
SCHEDULER = 'scrapy_ssdb_spider.scheduler.Scheduler'
# 配置去重类
DUPEFILTER_CLASS = 'scrapy_ssdb_spider.dupefilter.SSDBDupeFilter'
# 配置调度队列键(可选)
#SCHEDULER_QUEUE_KEY = ''
# 配置调度队列类(可选)
#SCHEDULER_QUEUE_CLASS = ''
# 配置去重队列键
#SCHEDULER_DUPEFILTER_KEY = ''
# 下面两个配置,如果我先启动了A爬虫,过了半小时启动B爬虫?
# 队列被清空了?????我没搞懂,反正scrapy-redis有这个功能,我也搞一个,默认不清空队列
# 配置在爬虫开始前清空去重及调度队列(布尔类型)
#SCHEDULER_OPEN_CLEAR_QUEUE =
# 配置在爬虫结束后清空去重及调度队列(布尔类型)
#SCHEDULER_CLOSE_CLEAR_QUEUE =
# 编写爬虫
from scrapy_ssdb_spider.spiders import SsdbSpider
class TestSpider(SsdbSpider):
# 配置种子队列键
ssdb_key = 'start_key'
def parse(self, response):
pass
- 一切都和scrapy_redis那么像,即使是代码,都很像
- 相信聪明如你,一定没问题的,欢迎提意见
差异
虽然代码都是参照scrapy-redis写的,但是有些功能并未实现:
- 基于 ssdb 的 Pipeline 没有实现
没有爬虫结束或爬虫开始清除队列的配置- 忘了
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for scrapy_ssdb_spider-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05e060876bcdf95a7fefca1a37d359e61c1cc37a0b49f5c232e84d24b03d3065 |
|
MD5 | c30b7375489cd02e5b93e9af84b7b683 |
|
BLAKE2b-256 | 2713fa2751c2d40f59320086bb5fb015c5535ab5e32f34b8e8439f79fd534d55 |