Skip to main content

Redis-based components for Scrapy.

Project description

Scrapy-Redis

Documentation Status https://img.shields.io/pypi/v/scrapy-redis.svg https://img.shields.io/pypi/pyversions/scrapy-redis.svg https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg Coverage Status Security Status

Redis-based components for Scrapy.

Features

  • Distributed crawling/scraping

    You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.

  • Distributed post-processing

    Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.

  • Scrapy plug-and-play components

    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.

  • In this forked version: added json supported data in Redis

    data contains url, `meta` and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata.

    For example:

    { "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }

    this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies

Requirements

  • Python 3.7+

  • Redis >= 5.0

  • Scrapy >= 2.0

  • redis-py >= 4.0

Installation

From pip

pip install scrapy-redis

From GitHub

git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install
pip uninstall scrapy-redis

Alternative Choice

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

History

0.7.3 (2022-07-21)

  • Move docs to GitHub Wiki

  • Update tox and support dynamic tests

  • Update support for json data

  • Refactor max idle time

  • Add support for python3.7~python3.10

  • Deprecate python2.x support

0.7.2 (2021-12-27)

  • Fix RedisStatsCollector._get_key()

  • Fix redis-py dependency version

  • Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE

0.7.1 (2021-03-27)

  • Fixes datetime parse error for redis-py 3.x.

  • Add support for stats extensions.

0.7.1-rc1 (2021-03-27)

  • Fixes datetime parse error for redis-py 3.x.

0.7.1-b1 (2021-03-22)

  • Add support for stats extensions.

0.7.0-dev (unreleased)

  • Unreleased.

0.6.8 (2017-02-14)

  • Fixed automated release due to not matching registered email.

0.6.7 (2016-12-27)

  • Fixes bad formatting in logging message.

0.6.6 (2016-12-20)

  • Fixes wrong message on dupefilter duplicates.

0.6.5 (2016-12-19)

  • Fixed typo in default settings.

0.6.4 (2016-12-18)

  • Fixed data decoding in Python 3.x.

  • Added REDIS_ENCODING setting (default utf-8).

  • Default to CONCURRENT_REQUESTS value for REDIS_START_URLS_BATCH_SIZE.

  • Renamed queue classes to a proper naming conventiong (backwards compatible).

0.6.3 (2016-07-03)

  • Added REDIS_START_URLS_KEY setting.

  • Fixed spider method from_crawler signature.

0.6.2 (2016-06-26)

  • Support redis_cls parameter in REDIS_PARAMS setting.

  • Python 3.x compatibility fixed.

  • Added SCHEDULER_SERIALIZER setting.

0.6.1 (2016-06-25)

  • Backwards incompatible change: Require explicit DUPEFILTER_CLASS setting.

  • Added SCHEDULER_FLUSH_ON_START setting.

  • Added REDIS_START_URLS_AS_SET setting.

  • Added REDIS_ITEMS_KEY setting.

  • Added REDIS_ITEMS_SERIALIZER setting.

  • Added REDIS_PARAMS setting.

  • Added REDIS_START_URLS_BATCH_SIZE spider attribute to read start urls in batches.

  • Added RedisCrawlSpider.

0.6.0 (2015-07-05)

  • Updated code to be compatible with Scrapy 1.0.

  • Added -a domain=… option for example spiders.

0.5.0 (2013-09-02)

  • Added REDIS_URL setting to support Redis connection string.

  • Added SCHEDULER_IDLE_BEFORE_CLOSE setting to prevent the spider closing too quickly when the queue is empty. Default value is zero keeping the previous behavior.

  • Schedule preemptively requests on item scraped.

  • This version is the latest release compatible with Scrapy 0.24.x.

0.4.0 (2013-04-19)

  • Added RedisSpider and RedisMixin classes as building blocks for spiders to be fed through a redis queue.

  • Added redis queue stats.

  • Let the encoder handle the item as it comes instead converting it to a dict.

0.3.0 (2013-02-18)

  • Added support for different queue classes.

  • Changed requests serialization from marshal to cPickle.

0.2.0 (2013-02-17)

  • Improved backward compatibility.

  • Added example project.

0.1.0 (2011-09-01)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latest-scrapy-redis-0.7.3.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

latest_scrapy_redis-0.7.3-py2.py3-none-any.whl (18.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file latest-scrapy-redis-0.7.3.tar.gz.

File metadata

  • Download URL: latest-scrapy-redis-0.7.3.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for latest-scrapy-redis-0.7.3.tar.gz
Algorithm Hash digest
SHA256 e1d5ddd1de50704e1a66ba8bb22bc731196f7ae9ee64b888f2536da60c4a1b7a
MD5 441e62cc3190055de31aa5d3b609f070
BLAKE2b-256 763e65b0b51b969806ebe1d1d0f71650202085dcf60a76490344634c9dee85e7

See more details on using hashes here.

File details

Details for the file latest_scrapy_redis-0.7.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for latest_scrapy_redis-0.7.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cba254a3c5f955d5ed1c61313a8efc8e5d770b8d2923b7d8427090881c93a911
MD5 60743461de5cd109a6d3d4558a677a39
BLAKE2b-256 f1102620c98cf98050533b4bf974346c332f3e3d312a4d33152d18151a4a8787

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page