A series distributed components for Scrapy framework
Project description
Scrapy-Distributed
Scrapy-Distributed
is a series of components for you to develop a distributed crawler base on Scrapy
in an easy way.
Now! Scrapy-Distributed
has supported RabbitMQ Scheduler
, Kafka Scheduler
and RedisBloom DupeFilter
. You can use either of those in your Scrapy's project very easily.
Features
- RabbitMQ Scheduler
- Support custom declare a RabbitMQ's Queue. Such as
passive
,durable
,exclusive
,auto_delete
, and all other options.
- Support custom declare a RabbitMQ's Queue. Such as
- RabbitMQ Pipeline
- Support custom declare a RabbitMQ's Queue for the items of spider. Such as
passive
,durable
,exclusive
,auto_delete
, and all other options.
- Support custom declare a RabbitMQ's Queue for the items of spider. Such as
- Kafaka Scheduler
- Support custom declare a Kafka's Topic. Such as
num_partitions
,replication_factor
and will support other options.
- Support custom declare a Kafka's Topic. Such as
- RedisBloom DupeFilter
- Support custom the
key
,errorRate
,capacity
,expansion
and auto-scaling(noScale
) of a bloom filter.
- Support custom the
Requirements
- Python >= 3.6
- Scrapy >= 1.8.0
- Pika >= 1.0.0
- RedisBloom >= 0.2.0
- Redis >= 3.0.1
- kafka-python >= 1.4.7
TODO
RabbitMQ Item Pipeline- Support Delayed Message in RabbitMQ Scheduler
- Support Scheduler Serializer
- Custom Interface for DupeFilter
- RocketMQ Scheduler
- RocketMQ Item Pipeline
- SQLAlchemy Item Pipeline
- Mongodb Item Pipeline
Kafka SchedulerKafka Item Pipeline
Usage
Step 0:
pip install scrapy-distributed
OR
git clone https://github.com/Insutanto/scrapy-distributed.git && cd scrapy-distributed
&& python setup.py install
There is a simple demo in examples/simple_example
. Here is the fast way to use Scrapy-Distributed
.
Examples of RabbitMQ
# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3
# enable rabbitmq_management
docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management
# pull and run a RedisBloom container.
docker run -d --name redis-redisbloom -p 6379:6379 redislabs/rebloom:latest
cd examples/rabbitmq_example
python run_simple_example.py
Examples of Kafka
# make sure you have a Kafka running on localhost:9092
# pull and run a RedisBloom container.
docker run -d --name redis-redisbloom -p 6379:6379 redislabs/rebloom:latest
cd examples/kafka_example
python run_simple_example.py
RabbitMQ Support
If you don't have the required environment for tests:
# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3
# enable rabbitmq_management
docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management
# pull and run a RedisBloom container.
docker run -d --name redis-redisbloom -p 6379:6379 redislabs/rebloom:latest
Step 1:
Only by change SCHEDULER
, DUPEFILTER_CLASS
and add some configs, you can get a distributed crawler in a moment.
SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/?heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
"redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000
# disable the RedirectMiddleware, because the RabbitMiddleware can handle those redirect request.
DOWNLOADER_MIDDLEWARES = {
...
"scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
"scrapy_distributed.middlewares.amqp.RabbitMiddleware": 542
}
# add RabbitPipeline, it will push your items to rabbitmq's queue.
ITEM_PIPELINES = {
...
'scrapy_distributed.pipelines.amqp.RabbitPipeline': 301,
}
Step 2:
scrapy crawl <your_spider>
Kafka Support
Step 1:
SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.kafka.KafkaQueue"
KAFKA_CONNECTION_PARAMETERS = "localhost:9092"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
"redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000
DOWNLOADER_MIDDLEWARES = {
...
"scrapy_distributed.middlewares.kafka.KafkaMiddleware": 542
}
Step 2:
scrapy crawl <your_spider>
Reference Project
scrapy-rabbitmq-link
(scrapy-rabbitmq-link)
scrapy-redis
(scrapy-redis)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for Scrapy-Distributed-2020.12.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0f3451581f586d36d77408815295ca63ff7b93753e2cc3e99491f53e4447fed |
|
MD5 | ff23773ff0ff81472babf80deda09538 |
|
BLAKE2b-256 | b4145da63637aa9a0112d2c45dfd60aedc38726a73528ba2cf254028b4186274 |
Hashes for Scrapy_Distributed-2020.12.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e2819ccfcd1090af1ab56335be9f6b5caf5d9ef99f16df0b4122ea010abebbe |
|
MD5 | ffc843adf58bad3c6bc94cf223980cfb |
|
BLAKE2b-256 | 9ab0d4c3ee8fcfaad190b0c946c4ca54d2f8fcf988ef7198fda21fcf939571ff |