Skip to main content

Rabbitmq for Distributed scraping

Project description

Scrapy分布式RabbitMQ调度器

安装

使用pip安装

pip install scrapy-rabbitmq-task

或者克隆这个项目并且执行setup.py安装

python setup.py install

使用

第一步: 在你的项目中的settings.py中添加配置项

# 指定项目的调度器
SCHEDULER = "scrapy_rabbitmq_scheduler.scheduler.SaaS"

# 指定rabbitmq的连接DSN
RABBITMQ_CONNECTION_PARAMETERS = 'amqp://guest:guest@localhost:5672/'

# 指定重试的http状态码(重新加回队列重试)
SCHEDULER_REQUEUE_ON_STATUS = [500]

# 指定下载器中间件, 确认任务是否成功
DOWNLOADER_MIDDLEWARES = {
    'scrapy_rabbitmq_scheduler.middleware.RabbitMQMiddleware': 999
}
# 指定item处理方式, item会加入到rabbitmq中
ITEM_PIPELINES = {
    'scrapy_rabbitmq_scheduler.pipelines.RabbitmqPipeline': 300,
}

第二步: 修改Spider的继承类

import scrapy
from scrapy_rabbitmq_scheduler.spiders import RabbitSpider

class CustomSpider(RabbitSpider):
    name = 'custom_spider'    
    queue_name = 'test_urls' # 指定任务队列的名称
    items_key = 'test_item' # 指定item队列名称

    def parse(self, response):
        item = ... # parse item
        yield item

第三步: 将任务写入到RabbitMQ队列

#!/usr/bin/env python
import pika
import settings

connection = pika.BlockingConnection(pika.URLParameters(settings.RABBITMQ_CONNECTION_PARAMETERS))
channel = connection.channel()

queue_key = 'test_urls'

# 读取文件中的链接并写入到队列中
with open('urls.txt') as f:
    for url in f:
        url = url.strip(' \n\r')
        channel.basic_publish(exchange='',
                        routing_key=queue_key,
                        body=url,
                        properties=pika.BasicProperties(
                            content_type='text/plain',
                            delivery_mode=2
                        ))

connection.close()

urls.txt

http://www.baidu.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_rabbitmq_task-1.0.1.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

scrapy_rabbitmq_task-1.0.1-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_rabbitmq_task-1.0.1.tar.gz.

File metadata

  • Download URL: scrapy_rabbitmq_task-1.0.1.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.5

File hashes

Hashes for scrapy_rabbitmq_task-1.0.1.tar.gz
Algorithm Hash digest
SHA256 eee3134a57dae7d6480b9621f7e417f740819a1adcdfe13840fc1af5c800a0a4
MD5 939bbba16d02f5235017c2584f5845cc
BLAKE2b-256 7b7b2f5c870da1595c7655cac73dd10e7077b5d9400af6a9eeef93df55185701

See more details on using hashes here.

File details

Details for the file scrapy_rabbitmq_task-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_rabbitmq_task-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3ed0cccd1e13f9b0bdd75408a5e2fd1341b1af2f26b8218bd9ed9dac0f42e28b
MD5 fb34fbacd49d732942fe0dae0d69bae7
BLAKE2b-256 8ff4826074267c46c6ae58e92bfa1f8001c76781c58828411a4c659a4ea003d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page