Ssdb-based components for Scrapy.
Project description
scrapy-ssdb-spider
- 对着 scrapy-redis 照葫芦画瓢的作品
- 基于 ssdb 队列的 scrapy 分布式解决方案
依赖说明
- Python 3.6(测试环境)
- SSDB 1.9.7
- scrapy
- pyssdb
使用说明
shell:
git clone https://github.com/PickledFish/scrapy-ssdb-spider
python3 setup.py install
或者
pip install scrapy-ssdb-spider
在scrapy项目中:
# settings
# ssdb服务
SSDB_HOST = '127.0.0.1'
SSDB_PORT = 8888
# ssdb密码,可选配置
#SSDB_PWD = 'your password'
# 配置调度器
SCHEDULER = 'scrapy_ssdb_spider.scheduler.Scheduler'
# 配置去重类
DUPEFILTER_CLASS = 'scrapy_ssdb_spider.dupefilter.SSDBDupeFilter'
# 配置调度队列键(可选)
#SCHEDULER_QUEUE_KEY = ''
# 配置调度队列类(可选)
#SCHEDULER_QUEUE_CLASS = ''
# 配置去重队列键
#SCHEDULER_DUPEFILTER_KEY = ''
# 下面两个配置,如果我先启动了A爬虫,过了半小时启动B爬虫?
# 队列被清空了?????我没搞懂,反正scrapy-redis有这个功能,我也搞一个,默认不清空队列
# 配置在爬虫开始前清空去重及调度队列(布尔类型)
#SCHEDULER_OPEN_CLEAR_QUEUE =
# 配置在爬虫结束后清空去重及调度队列(布尔类型)
#SCHEDULER_CLOSE_CLEAR_QUEUE =
# 编写爬虫
from scrapy_ssdb_spider.spiders import SsdbSpider
class TestSpider(SsdbSpider):
# 配置种子队列键
ssdb_key = 'start_key'
def parse(self, response):
pass
- 一切都和scrapy_redis那么像,即使是代码,都很像
- 相信聪明如你,一定没问题的,欢迎提意见
差异
虽然代码都是参照scrapy-redis写的,但是有些功能并未实现:
- 基于 ssdb 的 Pipeline 没有实现
没有爬虫结束或爬虫开始清除队列的配置- 忘了
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_ssdb_spider-0.1.1.tar.gz.
File metadata
- Download URL: scrapy_ssdb_spider-0.1.1.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be4c0a5bdde5b7c562b4a22de8c767f5494e66d3bc4ececc7dc8cc4e09b9cc26
|
|
| MD5 |
490aced4a1d40fe499c6c6575b8ac656
|
|
| BLAKE2b-256 |
83da93c5ba5a3f22ffc53a57fc66611b3e4ebabb7212874d0ee10d57f649bac9
|
File details
Details for the file scrapy_ssdb_spider-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scrapy_ssdb_spider-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05e060876bcdf95a7fefca1a37d359e61c1cc37a0b49f5c232e84d24b03d3065
|
|
| MD5 |
c30b7375489cd02e5b93e9af84b7b683
|
|
| BLAKE2b-256 |
2713fa2751c2d40f59320086bb5fb015c5535ab5e32f34b8e8439f79fd534d55
|