Redis-based components for Scrapy
Project description
Redis-based components for Scrapy
=================================
This is a initial work on Scrapy-Redis integration, not production-tested.
Use it at your own risk!
Features:
* Distributed crawling/scraping
* Distributed post-processing
Requirements:
* Scrapy >= 0.13 (development version)
* redis-py (tested on 2.4.9)
* redis server (tested on 2.2-2.4)
Available Scrapy components:
* Scheduler
* Duplication Filter
* Item Pipeline
Usage
-----
In your settings.py:
.. code-block:: python
# enables scheduling storing requests queue in redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# don't cleanup redis queues, allows to pause/resume crawls
SCHEDULER_PERSIST = True
# store scraped item in redis for post-processing
ITEM_PIPELINES = [
'scrapy_redis.pipelines.RedisPipeline',
]
Running the example project
---------------------------
You can test the funcionality following the next steps:
1. Setup scrapy_redis package in your PYTHONPATH
2. Run the crawler for first time then stop it::
$ cd example-project
$ scrapy crawl dmoz
... [dmoz] ...
^C
3. Run the crawler again to resume stopped crawling::
$ scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)
4. Start one or more additional scrapy crawlers::
$ scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)
5. Start one or more post-processing workers::
$ python process_items.py
Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/)
Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/)
...
That's it.
=================================
This is a initial work on Scrapy-Redis integration, not production-tested.
Use it at your own risk!
Features:
* Distributed crawling/scraping
* Distributed post-processing
Requirements:
* Scrapy >= 0.13 (development version)
* redis-py (tested on 2.4.9)
* redis server (tested on 2.2-2.4)
Available Scrapy components:
* Scheduler
* Duplication Filter
* Item Pipeline
Usage
-----
In your settings.py:
.. code-block:: python
# enables scheduling storing requests queue in redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# don't cleanup redis queues, allows to pause/resume crawls
SCHEDULER_PERSIST = True
# store scraped item in redis for post-processing
ITEM_PIPELINES = [
'scrapy_redis.pipelines.RedisPipeline',
]
Running the example project
---------------------------
You can test the funcionality following the next steps:
1. Setup scrapy_redis package in your PYTHONPATH
2. Run the crawler for first time then stop it::
$ cd example-project
$ scrapy crawl dmoz
... [dmoz] ...
^C
3. Run the crawler again to resume stopped crawling::
$ scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)
4. Start one or more additional scrapy crawlers::
$ scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)
5. Start one or more post-processing workers::
$ python process_items.py
Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/)
Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/)
...
That's it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy-redis-0.2.tar.gz
(3.3 kB
view details)
File details
Details for the file scrapy-redis-0.2.tar.gz.
File metadata
- Download URL: scrapy-redis-0.2.tar.gz
- Upload date:
- Size: 3.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e0446e50081b5c0a9b9e897db6325ef6d374ebcfbbeffe1fd1250020e7c2c5c
|
|
| MD5 |
9f2eb76734042c420d3823d03c1b961e
|
|
| BLAKE2b-256 |
d55ea068c572a6967b955e59f1921526c251173a842f4feacf86e8c6b92ceb5e
|