Redis-based components for Scrapy
Project description
This is a initial work on Scrapy-Redis integration, not production-tested. Use it at your own risk!
Features:
Distributed crawling/scraping
Distributed post-processing
Requirements:
Scrapy >= 0.13 (development version)
redis-py (tested on 2.4.9)
redis server (tested on 2.2-2.4)
Available Scrapy components:
Scheduler
Duplication Filter
Item Pipeline
Usage
In your settings.py:
# enables scheduling storing requests queue in redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# don't cleanup redis queues, allows to pause/resume crawls
SCHEDULER_PERSIST = True
# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
# Schedule requests using a queue (FIFO).
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderQueue'
# Schedule requests using a stack (LIFO).
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderStack'
# store scraped item in redis for post-processing
ITEM_PIPELINES = [
'scrapy_redis.pipelines.RedisPipeline',
]
Running the example project
You can test the funcionality following the next steps:
Setup scrapy_redis package in your PYTHONPATH
Run the crawler for first time then stop it:
$ cd example-project $ scrapy crawl dmoz ... [dmoz] ... ^C
Run the crawler again to resume stopped crawling:
$ scrapy crawl dmoz ... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)
Start one or more additional scrapy crawlers:
$ scrapy crawl dmoz ... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)
Start one or more post-processing workers:
$ python process_items.py Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/) Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/) ...
That’s it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.