Skip to main content

Distribution Support for Scrapy & Gerapy using Redis

Project description

Gerapy Redis

This is a package for supporting distribution in Scrapy using Redis, also this package is a module in Gerapy.

This package is almost copied from https://github.com/rmax/scrapy-redis.

Change

Removed RedisSpider, move the logic to Scheduler. It will pre enqueue all start requests to Redis Queue instead of adding one start request when crawler is idle.

Arg: SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS, default to True.

Installation

pip3 install gerapy-redis

Usage

# Enables scheduling storing requests queue in redis.
SCHEDULER = "gerapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "gerapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "gerapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Pre enqueue all start requests to queue, (default True)
#SCHEDULER_PRE_ENQUEUE_ALL_START_REQUESTS = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'gerapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'gerapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``
# command to add URLS and Scores to redis queue. This could be useful if you
# want to use priority and avoid duplicates in your start urls list.
#REDIS_START_URLS_AS_ZSET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

For more information, please refer to https://github.com/rmax/scrapy-redis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gerapy-redis-0.1.1.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

gerapy_redis-0.1.1-py2.py3-none-any.whl (12.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file gerapy-redis-0.1.1.tar.gz.

File metadata

  • Download URL: gerapy-redis-0.1.1.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.7

File hashes

Hashes for gerapy-redis-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b4e4a273c9ed12d8bbd86b299a32f672b30e6397b7be045e3ae49fa712e48338
MD5 45322e11c12e700ad50d692a148009db
BLAKE2b-256 55521a89f0b0697afdb25364556009464b63a59b1fd3becc905b15cc4de0700a

See more details on using hashes here.

File details

Details for the file gerapy_redis-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: gerapy_redis-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.7

File hashes

Hashes for gerapy_redis-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2f4cea8785e170b3458a1e3d57042ecc0350f85b3923e74271e4b68cbaa7dd61
MD5 c7a3df26d1fd073788f08d6c82dcfaa3
BLAKE2b-256 2ee2e287af8d55726ce558834d439ed8767945aa8d7e356495a1542a6633d1ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page