Skip to main content

Ssdb-based components for Scrapy.

Project description

scrapy-ssdb-spider

  • 对着 scrapy-redis 照葫芦画瓢的作品
  • 基于 ssdb 队列的 scrapy 分布式解决方案

依赖说明

  • Python 3.6(测试环境)
  • SSDB 1.9.7
  • scrapy
  • pyssdb

使用说明

shell:

git clone https://github.com/PickledFish/scrapy-ssdb-spider
python3 setup.py install

或者

pip install scrapy-ssdb-spider

在scrapy项目中:

# settings
# ssdb服务
SSDB_HOST = '127.0.0.1'
SSDB_PORT = 8888
# ssdb密码,可选配置
#SSDB_PWD = 'your password'
# 配置调度器
SCHEDULER = 'scrapy_ssdb_spider.scheduler.Scheduler'
# 配置去重类
DUPEFILTER_CLASS = 'scrapy_ssdb_spider.dupefilter.SSDBDupeFilter'
# 配置调度队列键(可选)
#SCHEDULER_QUEUE_KEY = ''
# 配置调度队列类(可选)
#SCHEDULER_QUEUE_CLASS = ''
# 配置去重队列键
#SCHEDULER_DUPEFILTER_KEY = ''

# 下面两个配置,如果我先启动了A爬虫,过了半小时启动B爬虫?
# 队列被清空了?????我没搞懂,反正scrapy-redis有这个功能,我也搞一个,默认不清空队列
# 配置在爬虫开始前清空去重及调度队列(布尔类型)
#SCHEDULER_OPEN_CLEAR_QUEUE = 
# 配置在爬虫结束后清空去重及调度队列(布尔类型)
#SCHEDULER_CLOSE_CLEAR_QUEUE = 
# 编写爬虫
from scrapy_ssdb_spider.spiders import SsdbSpider

class TestSpider(SsdbSpider):
    # 配置种子队列键
    ssdb_key = 'start_key'

    def parse(self, response):
        pass
  • 一切都和scrapy_redis那么像,即使是代码,都很像
  • 相信聪明如你,一定没问题的,欢迎提意见

差异

虽然代码都是参照scrapy-redis写的,但是有些功能并未实现:

  • 基于 ssdb 的 Pipeline 没有实现
  • 没有爬虫结束或爬虫开始清除队列的配置
  • 忘了

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_ssdb_spider-0.1.1.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_ssdb_spider-0.1.1-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_ssdb_spider-0.1.1.tar.gz.

File metadata

  • Download URL: scrapy_ssdb_spider-0.1.1.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for scrapy_ssdb_spider-0.1.1.tar.gz
Algorithm Hash digest
SHA256 be4c0a5bdde5b7c562b4a22de8c767f5494e66d3bc4ececc7dc8cc4e09b9cc26
MD5 490aced4a1d40fe499c6c6575b8ac656
BLAKE2b-256 83da93c5ba5a3f22ffc53a57fc66611b3e4ebabb7212874d0ee10d57f649bac9

See more details on using hashes here.

File details

Details for the file scrapy_ssdb_spider-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: scrapy_ssdb_spider-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for scrapy_ssdb_spider-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 05e060876bcdf95a7fefca1a37d359e61c1cc37a0b49f5c232e84d24b03d3065
MD5 c30b7375489cd02e5b93e9af84b7b683
BLAKE2b-256 2713fa2751c2d40f59320086bb5fb015c5535ab5e32f34b8e8439f79fd534d55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page