Skip to main content

Redis Cluster for Scrapy.

Project description

scrapy-redis 集群版

PyPI PyPI - License GitHub last commit PyPI - Downloads

本项目基于原项目 scrapy-redis

进行修改,修改内容如下:

  1. 添加了 Redis 哨兵连接支持
  2. 添加了 Redis 集群连接支持
  3. 添加了 Bloomfilter 去重

安装

pip install scrapy-redis-sentinel --user

配置示例

原版本 scrapy-redis 的所有配置都支持, 优先级:哨兵模式 > 集群模式 > 单机模式

# ----------------------------------------Bloomfilter 配置-------------------------------------
# 使用的哈希函数数,默认为 6
BLOOMFILTER_HASH_NUMBER = 6

# Bloomfilter 使用的 Redis 内存位,30 表示 2 ^ 30 = 128MB,默认为 30   (2 ^ 22 = 1MB 可去重 130W URL)
BLOOMFILTER_BIT = 30

# 是否开启去重调试模式 默认为 False 关闭
DUPEFILTER_DEBUG = False

# ----------------------------------------Redis 单机模式-------------------------------------
# Redis 单机地址
REDIS_HOST = "172.25.2.25"
REDIS_PORT = 6379

# REDIS 单机模式配置参数
REDIS_PARAMS = {
    "password": "password",
    "db": 0
}

# ----------------------------------------Redis 哨兵模式-------------------------------------

# Redis 哨兵地址
REDIS_SENTINELS = [
    ('172.25.2.25', 26379),
    ('172.25.2.26', 26379),
    ('172.25.2.27', 26379)
]

# REDIS_SENTINEL_PARAMS 哨兵模式配置参数。
REDIS_SENTINEL_PARAMS = {
    "service_name": "mymaster",
    "password": "password",
    "db": 0
}

# ----------------------------------------Redis 集群模式-------------------------------------

# Redis 集群地址
REDIS_STARTUP_NODES = [
    {"host": "172.25.2.25", "port": "6379"},
    {"host": "172.25.2.26", "port": "6379"},
    {"host": "172.25.2.27", "port": "6379"},
]

# REDIS_CLUSTER_PARAMS 集群模式配置参数
REDIS_CLUSTER_PARAMS = {
    "password": "password"
}

# ----------------------------------------Scrapy 其他参数-------------------------------------

# 在 redis 中保持 scrapy-redis 用到的各个队列,从而允许暂停和暂停后恢复,也就是不清理 redis queues
SCHEDULER_PERSIST = True
# 调度队列  
SCHEDULER = "scrapy_redis_sentinel.scheduler.Scheduler"
# 基础去重
DUPEFILTER_CLASS = "scrapy_redis_sentinel.dupefilter.RedisDupeFilter"
# BloomFilter
# DUPEFILTER_CLASS = "scrapy_redis_sentinel.dupefilter.RedisBloomFilter"

# 启用基于 Redis 统计信息
STATS_CLASS = "scrapy_redis_sentinel.stats.RedisStatsCollector"

# 指定排序爬取地址时使用的队列
# 默认的 按优先级排序( Scrapy 默认),由 sorted set 实现的一种非 FIFO、LIFO 方式。
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis_sentinel.queue.SpiderPriorityQueue'
# 可选的 按先进先出排序(FIFO)
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis_sentinel.queue.SpiderStack'
# 可选的 按后进先出排序(LIFO)
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis_sentinel.queue.SpiderStack'

注:当使用集群时单机不生效

spiders 使用

修改 RedisSpider 引入方式

原版本 scrapy-redis 使用方式

from scrapy_redis.spiders import RedisSpider


class Spider(RedisSpider):
    ...

scrapy-redis-sentinel 使用方式

from scrapy_redis_sentinel.spiders import RedisSpider


class Spider(RedisSpider):
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-redis-sentinel-0.7.2.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

scrapy_redis_sentinel-0.7.2-py2.py3-none-any.whl (17.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-redis-sentinel-0.7.2.tar.gz.

File metadata

  • Download URL: scrapy-redis-sentinel-0.7.2.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.9

File hashes

Hashes for scrapy-redis-sentinel-0.7.2.tar.gz
Algorithm Hash digest
SHA256 02c12eae22777b5c622f57440dae95d1342c56c5cf65bbc9693a99be9ed58b65
MD5 c4a761423260b8f8ee1a3716a4ab3086
BLAKE2b-256 6718d4a8b495982d36679a16591cb0b9c38eab45adad579a2ca02a19a09fda6d

See more details on using hashes here.

File details

Details for the file scrapy_redis_sentinel-0.7.2-py2.py3-none-any.whl.

File metadata

  • Download URL: scrapy_redis_sentinel-0.7.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.9

File hashes

Hashes for scrapy_redis_sentinel-0.7.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 394507acc94c23be45d95c6206743014122683f6ace3832069104134941e3a4d
MD5 3018d3bdee71b29e6e2511fc6785230f
BLAKE2b-256 997bf90943c6c0827ade2b9de8a625a0930b954e912264558026829fb9c50bbe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page