Skip to main content

rabbitmq-spider is an open-source tool that helps with web scraping by using RabbitMQ and Scrapy to distribute and scale scraping tasks across multiple instances.

Project description

rabbitmq-spider

rabbitmq-spider is an open-source tool that helps with web scraping by using RabbitMQ and Scrapy to distribute and scale scraping tasks across multiple instances.

Inpsired by scrapy-redis.

Features

  1. It only uses RabbitMQ for message generation tasks and does not use RabbitMQ to implement Scrapy’s queue.
  2. It can automatically acknowledge (ack) or negatively acknowledge (nack) messages based on the response results.

Installation

pip install rabbitmq_spider

Usage

1.Add config values:

RABBITMQ_HOST = 'localhost'
RABBITMQ_PORT = '5672'
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'
RABBITMQ_VIRTUAL_HOST = '/'

SPIDER_MIDDLEWARES = {
    'rabbitscrape.middlewares.RabbitmqSpiderMiddleware': 49,
}

2.Add RabbitMQSpider to your spider

import json

from rabbitmq_spider.spiders import RabbitMQSpider
from scrapy import Request


class YourSpider(RabbitMQSpider):
    """Demo"""
    name = 'demo'
    api = 'demo.queue'

    def make_request_from_data(self, data):
        msg_dict = json.loads(data)
        url = msg_dict['url']

        return Request(url)

    def parse(self, response, **kwargs):
        self.logger.debug(response.status)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rabbitmq_spider-0.0.1.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

rabbitmq_spider-0.0.1-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file rabbitmq_spider-0.0.1.tar.gz.

File metadata

  • Download URL: rabbitmq_spider-0.0.1.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for rabbitmq_spider-0.0.1.tar.gz
Algorithm Hash digest
SHA256 72c70a5dc4ffa167985e59d5674a0948383997a3d19263de0f6d7dca5b10ec40
MD5 9217295abfebe428f159b81b59f23313
BLAKE2b-256 f9e955197f4b10d57a055d313d4c3eda4a333e64c68df2243c2860972f922763

See more details on using hashes here.

File details

Details for the file rabbitmq_spider-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for rabbitmq_spider-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 11fd1be89205b92933a8b1560318114aa90e35bb66e9c4ffd7fa6548826dd2c8
MD5 2846d71f11c9d180d8d923d0aa82d944
BLAKE2b-256 1cac9ade27280f1e345aeabb6dfdbe387c54cf5756d415127e5a4406b7816fff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page