Skip to main content

Redis-based components for Scrapy.

Project description

Documentation Status https://img.shields.io/pypi/v/scrapy-redis.svg https://img.shields.io/pypi/pyversions/scrapy-redis.svg https://img.shields.io/travis/rmax/scrapy-redis.svg Coverage Status Requirements Status Security Status

Redis-based components for Scrapy.

Features

  • Distributed crawling/scraping

    You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.

  • Distributed post-processing

    Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.

  • Scrapy plug-and-play components

    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.

  • In this forked version: added json supported data in Redis

    data contains url, `meta` and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata.

    For example:

    { "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }

    this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies

Requirements

  • Python 3.7+

  • Redis >= 5.0

  • Scrapy >= 2.0

  • redis-py >= 4.0

Installation

From pip

pip install scrapy-redis

From GitHub

git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install
pip uninstall scrapy-redis

Alternative Choice

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

History

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kk-scrapy-redis-0.0.1.tar.gz (14.4 kB view hashes)

Uploaded Source

Built Distribution

kk_scrapy_redis-0.0.1-py2.py3-none-any.whl (16.4 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page