Redis-based components for Scrapy.
Project description
Redis-based components for Scrapy.
Documentation: https://github.com/rmax/scrapy-redis/wiki.
Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
LICENSE: MIT license
Features
Distributed crawling/scraping
You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing
Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components
Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added json supported data in Redis
data contains url, `meta` and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata.
For example:
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies
Requirements
Python 3.7+
Redis >= 5.0
Scrapy >= 2.0
redis-py >= 4.0
Installation
From pip
pip install scrapy-redis
From GitHub
git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install
pip uninstall scrapy-redis
Alternative Choice
Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.
History
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kk_scrapy_redis-0.0.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d68d754df75b35064596e97dfbff8ff7817c7da1f48fad07b77c84cfaed5ee9 |
|
MD5 | 6d2e555212ad6e97ec5c01a9773bf426 |
|
BLAKE2b-256 | 41457d21adc8c7abe40d7afdcdf7bfa8f1339cc58601533dc42b56f5db32f0cf |