Distribution Support for Scrapy & Gerapy using Redis
Project description
Gerapy Redis
This is a package for supporting distribution in Scrapy using Redis, also this package is a module in Gerapy.
This package is almost copied from https://github.com/rmax/scrapy-redis.
Change
Removed RedisSpider, move the logic to Scheduler. It will pre enqueue all start requests to Redis Queue instead of adding one start request when crawler is idle.
Arg: SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS
, default to True
.
Installation
pip3 install gerapy-redis
Usage
# Enables scheduling storing requests queue in redis.
SCHEDULER = "gerapy_redis.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "gerapy_redis.dupefilter.RFPDupeFilter"
# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "gerapy_redis.picklecompat"
# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True
# Pre enqueue all start requests to queue, (default True)
#SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS = True
# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.PriorityQueue'
# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.LifoQueue'
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10
# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
'gerapy_redis.pipelines.RedisPipeline': 300
}
# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'
# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'
# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379
# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'
# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'
# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False
# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``
# command to add URLS and Scores to redis queue. This could be useful if you
# want to use priority and avoid duplicates in your start urls list.
#REDIS_START_URLS_AS_ZSET = False
# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'
# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'
For more information, please refer to https://github.com/rmax/scrapy-redis.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gerapy-redis-0.1.1.tar.gz
.
File metadata
- Download URL: gerapy-redis-0.1.1.tar.gz
- Upload date:
- Size: 12.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4e4a273c9ed12d8bbd86b299a32f672b30e6397b7be045e3ae49fa712e48338 |
|
MD5 | 45322e11c12e700ad50d692a148009db |
|
BLAKE2b-256 | 55521a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a |
File details
Details for the file gerapy_redis-0.1.1-py2.py3-none-any.whl
.
File metadata
- Download URL: gerapy_redis-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f4cea8785e170b3458a1e3d57042ecc0350f85b3923e74271e4b68cbaa7dd61 |
|
MD5 | c7a3df26d1fd073788f08d6c82dcfaa3 |
|
BLAKE2b-256 | 2ee2e287af8d55726ce558834d439ed8767945aa8d7e356495a1542a6633d1ce |