Skip to main content

rabbitmq-spider is an open-source tool that helps with web scraping by using RabbitMQ and Scrapy to distribute and scale scraping tasks across multiple instances.

Project description

rabbitmq-spider

rabbitmq-spider is an open-source tool that helps with web scraping by using RabbitMQ and Scrapy to distribute and scale scraping tasks across multiple instances.

Inpsired by scrapy-redis.

Features

  1. It only uses RabbitMQ for message generation tasks and does not use RabbitMQ to implement Scrapy’s queue.
  2. It can automatically acknowledge (ack) or negatively acknowledge (nack) messages based on the response results.

Installation

pip install rabbitmq_spider

Usage

1.Add config values:

RABBITMQ_HOST = 'localhost'
RABBITMQ_PORT = '5672'
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'
RABBITMQ_VIRTUAL_HOST = '/'

SPIDER_MIDDLEWARES = {
    'rabbitscrape.middlewares.RabbitmqSpiderMiddleware': 49,
}

2.Add RabbitMQSpider to your spider

import json

from rabbitmq_spider.spiders import RabbitMQSpider
from scrapy import Request


class YourSpider(RabbitMQSpider):
    """Demo"""
    name = 'demo'
    routing_key = 'demo.queue'

    def make_request_from_data(self, data):
        msg_dict = json.loads(data)
        url = msg_dict['url']

        return Request(url)

    def parse(self, response, **kwargs):
        self.logger.debug(response.status)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rabbitmq_spider-0.0.2.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

rabbitmq_spider-0.0.2-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file rabbitmq_spider-0.0.2.tar.gz.

File metadata

  • Download URL: rabbitmq_spider-0.0.2.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for rabbitmq_spider-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5858a06d25566684994316b75c3e9e606a71289caabf244a74f59af29aa863e2
MD5 09f8ee6da130849c0eb5acb523f20d19
BLAKE2b-256 48d7ec0c2427b74bf9cf46828481b03130db407a66b40e093e500e776069037e

See more details on using hashes here.

File details

Details for the file rabbitmq_spider-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for rabbitmq_spider-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c0f3ee276f4ff62cc87e7b36e20a3e22df291451b89b9770f40d868b48ad1ee9
MD5 a2a7a5437788341dd3f077a7f74aee5c
BLAKE2b-256 037077ebe694f25f37c12865636a8423010adb4589b9d11fc4261bde8bd49dff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page