Rabbitmq for Distributed scraping
Project description
Scrapy分布式RabbitMQ调度器
安装
使用pip安装
pip install scrapy-rabbitmq-task
或者克隆这个项目并且执行setup.py安装
python setup.py install
使用
第一步: 在你的项目中的settings.py中添加配置项
# 指定项目的调度器
SCHEDULER = "scrapy_rabbitmq_scheduler.scheduler.SaaS"
# 指定rabbitmq的连接DSN
RABBITMQ_CONNECTION_PARAMETERS = 'amqp://guest:guest@localhost:5672/'
# 指定重试的http状态码(重新加回队列重试)
SCHEDULER_REQUEUE_ON_STATUS = [500]
# 指定下载器中间件, 确认任务是否成功
DOWNLOADER_MIDDLEWARES = {
'scrapy_rabbitmq_scheduler.middleware.RabbitMQMiddleware': 999
}
# 指定item处理方式, item会加入到rabbitmq中
ITEM_PIPELINES = {
'scrapy_rabbitmq_scheduler.pipelines.RabbitmqPipeline': 300,
}
第二步: 修改Spider的继承类
import scrapy
from scrapy_rabbitmq_scheduler.spiders import RabbitSpider
class CustomSpider(RabbitSpider):
name = 'custom_spider'
queue_name = 'test_urls' # 指定任务队列的名称
items_key = 'test_item' # 指定item队列名称
def parse(self, response):
item = ... # parse item
yield item
第三步: 将任务写入到RabbitMQ队列
#!/usr/bin/env python
import pika
import settings
connection = pika.BlockingConnection(pika.URLParameters(settings.RABBITMQ_CONNECTION_PARAMETERS))
channel = connection.channel()
queue_key = 'test_urls'
# 读取文件中的链接并写入到队列中
with open('urls.txt') as f:
for url in f:
url = url.strip(' \n\r')
channel.basic_publish(exchange='',
routing_key=queue_key,
body=url,
properties=pika.BasicProperties(
content_type='text/plain',
delivery_mode=2
))
connection.close()
urls.txt
http://www.baidu.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy_rabbitmq_task-1.0.1.tar.gz
.
File metadata
- Download URL: scrapy_rabbitmq_task-1.0.1.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eee3134a57dae7d6480b9621f7e417f740819a1adcdfe13840fc1af5c800a0a4 |
|
MD5 | 939bbba16d02f5235017c2584f5845cc |
|
BLAKE2b-256 | 7b7b2f5c870da1595c7655cac73dd10e7077b5d9400af6a9eeef93df55185701 |
File details
Details for the file scrapy_rabbitmq_task-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: scrapy_rabbitmq_task-1.0.1-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ed0cccd1e13f9b0bdd75408a5e2fd1341b1af2f26b8218bd9ed9dac0f42e28b |
|
MD5 | fb34fbacd49d732942fe0dae0d69bae7 |
|
BLAKE2b-256 | 8ff4826074267c46c6ae58e92bfa1f8001c76781c58828411a4c659a4ea003d9 |