Skip to main content

RabbitMQ Plug-in for Scrapy

Project description

## A RabbitMQ Scheduler for Scrapy Framework.

Scrapy-rabbitmq is a tool that lets you feed and queue URLs from RabbitMQ via Scrapy spiders, using the [Scrapy framework](http://doc.scrapy.org/en/latest/index.html).

Inpsired by and modled after [scrapy-redis](https://github.com/darkrho/scrapy-redis).

## Installation

Using pip, type in your command-line prompt

```
pip install scrapy-rabbitmq
```

Or clone the repo and inside the scrapy-rabbitmq directory, type

```
python setup.py install
```

## Usage

### Step 1: In your scrapy settings, add the following config values:

```
# Enables scheduling storing requests queue in rabbitmq.

SCHEDULER = "scrapy_rabbitmq.scheduler.Scheduler"

# Don't cleanup rabbitmq queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_rabbitmq.queue.SpiderQueue'

# RabbitMQ Queue to use to store requests
RABBITMQ_QUEUE_NAME = 'scrapy_queue'

# Provide host and port to RabbitMQ daemon
RABBITMQ_CONNECTION_PARAMETERS = {'host': 'localhost', 'port': 6666}

# Store scraped item in rabbitmq for post-processing.
ITEM_PIPELINES = {
'scrapy_rabbitmq.pipelines.RabbitMQPipeline': 1
}

```

### Step 2: Add RabbitMQMixin to Spider.

#### Example: multidomain_spider.py

```
from scrapy.contrib.spiders import CrawlSpider
from scrapy_rabbitmq.spiders import RabbitMQMixin

class MultiDomainSpider(RabbitMQMixin, CrawlSpider):
name = 'multidomain'

def parse(self, response):
# parse all the things
pass

```

### Step 3: Run spider using [scrapy client](http://doc.scrapy.org/en/1.0/topics/shell.html)

```
scrapy runspider multidomain_spider.py
```

### Step 4: Push URLs to RabbitMQ

#### Example: push_web_page_to_queue.py

```
#!/usr/bin/env python
import pika
import settings

connection = pika.BlockingConnection(pika.ConnectionParameters(
'localhost'))
channel = connection.channel()

channel.basic_publish(exchange='',
routing_key=settings.RABBITMQ_QUEUE_NAME,
body='</html>raw html contents<a href="http://twitter.com/roycehaynes">extract url</a></html>')

connection.close()

```

## Contributing and Forking

See [Contributing Guidlines](CONTRIBUTING.MD)

## Releases

See the [changelog](CHANGELOG.md) for release details.

| Version | Release Date |
| :-----: | :----------: |
| 0.1.0 | 2014-11-14 |
| 0.1.1 | 2015-07-02 |



## Copyright & License

Copyright (c) 2015 Royce Haynes - Released under The MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

scrapy-rabbitmq-0.1.2.zip (9.5 kB view details)

Uploaded Source

scrapy-rabbitmq-0.1.2.tar.gz (5.6 kB view details)

Uploaded Source

File details

Details for the file scrapy-rabbitmq-0.1.2.zip.

File metadata

File hashes

Hashes for scrapy-rabbitmq-0.1.2.zip
Algorithm Hash digest
SHA256 3c938edd770061b0b3257f1c009c210d604fa44afe1af1a1753016b77ff83ef2
MD5 f1b14a0ede47aac81bec8e2b37e8355a
BLAKE2b-256 a9d98e228008fe7d560b089ca8ac2cf293ac7b805c6f0cca92de8208836a229e

See more details on using hashes here.

File details

Details for the file scrapy-rabbitmq-0.1.2.tar.gz.

File metadata

File hashes

Hashes for scrapy-rabbitmq-0.1.2.tar.gz
Algorithm Hash digest
SHA256 769374ce03b7f9261b30f4df4162096202c3f69e87d5953a7be6dbee1cd88246
MD5 d1ab4cd43c481c7dd5cac32af2583581
BLAKE2b-256 82d579e39481a45ff04e9beb2128f92ce9a12e1945254ef378abfb1e402c2637

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page