A series distributed components for Scrapy framework
Project description
Scrapy-Distributed
Scrapy-Distributed
is a series of components for you to develop a distributed crawler base on Scrapy
in an easy way.
Now! Scrapy-Distributed
has supported RabbitMQ Scheduler
, Kafka Scheduler
and RedisBloom DupeFilter
. You can use either of those in your Scrapy's project very easily.
Features
- RabbitMQ Scheduler
- Support custom declare a RabbitMQ's Queue. Such as
passive
,durable
,exclusive
,auto_delete
, and all other options.
- Support custom declare a RabbitMQ's Queue. Such as
- RabbitMQ Pipeline
- Support custom declare a RabbitMQ's Queue for the items of spider. Such as
passive
,durable
,exclusive
,auto_delete
, and all other options.
- Support custom declare a RabbitMQ's Queue for the items of spider. Such as
- Kafaka Scheduler
- Support custom declare a Kafka's Topic. Such as
num_partitions
,replication_factor
and will support other options.
- Support custom declare a Kafka's Topic. Such as
- RedisBloom DupeFilter
- Support custom the
key
,errorRate
,capacity
,expansion
and auto-scaling(noScale
) of a bloom filter.
- Support custom the
Requirements
- Python >= 3.6
- Scrapy >= 1.8.0
- Pika >= 1.0.0
- RedisBloom >= 0.2.0
- Redis >= 3.0.1
- kafka-python >= 1.4.7
TODO
RabbitMQ Item Pipeline- Support Delayed Message in RabbitMQ Scheduler
- Support Scheduler Serializer
- Custom Interface for DupeFilter
- RocketMQ Scheduler
- RocketMQ Item Pipeline
- SQLAlchemy Item Pipeline
- Mongodb Item Pipeline
Kafka SchedulerKafka Item Pipeline
Usage
Step 0:
pip install scrapy-distributed
OR
git clone https://github.com/Insutanto/scrapy-distributed.git && cd scrapy-distributed
&& python setup.py install
There is a simple demo in examples/simple_example
. Here is the fast way to use Scrapy-Distributed
.
Examples of RabbitMQ
If you don't have the required environment for tests:
# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management
# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack
cd examples/rabbitmq_example
python run_simple_example.py
Or you can use docker compose:
docker compose -f ./docker-compose.dev.yaml up -d
cd examples/rabbitmq_example
python run_simple_example.py
Examples of Kafka
If you don't have the required environment for tests:
# make sure you have a Kafka running on localhost:9092
# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack
cd examples/kafka_example
python run_simple_example.py
Or you can use docker compose:
docker compose -f ./docker-compose.dev.yaml up -d
cd examples/kafka_example
python run_simple_example.py
RabbitMQ Support
If you don't have the required environment for tests:
# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management
# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack
Or you can use docker compose:
docker compose -f ./docker-compose.dev.yaml up -d
Step 1:
Only by change SCHEDULER
, DUPEFILTER_CLASS
and add some configs, you can get a distributed crawler in a moment.
SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/?heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
"redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000
# disable the RedirectMiddleware, because the RabbitMiddleware can handle those redirect request.
DOWNLOADER_MIDDLEWARES = {
...
"scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
"scrapy_distributed.middlewares.amqp.RabbitMiddleware": 542
}
# add RabbitPipeline, it will push your items to rabbitmq's queue.
ITEM_PIPELINES = {
...
'scrapy_distributed.pipelines.amqp.RabbitPipeline': 301,
}
Step 2:
scrapy crawl <your_spider>
Kafka Support
Step 1:
SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.kafka.KafkaQueue"
KAFKA_CONNECTION_PARAMETERS = "localhost:9092"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
"redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000
DOWNLOADER_MIDDLEWARES = {
...
"scrapy_distributed.middlewares.kafka.KafkaMiddleware": 542
}
Step 2:
scrapy crawl <your_spider>
Reference Project
scrapy-rabbitmq-link
(scrapy-rabbitmq-link)
scrapy-redis
(scrapy-redis)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Scrapy-Distributed-2023.4.23.tar.gz
.
File metadata
- Download URL: Scrapy-Distributed-2023.4.23.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5e7ef8f9b19a9e65db89d0b44073cfb26e57f7c77b01a78b862839cfb2a7ac7 |
|
MD5 | d37a12aaf0d429bcfac6432b416392c4 |
|
BLAKE2b-256 | ea4f45752f122a7eef6e33ebae677a355a17267fe86c1cf7a69431259d62a6af |
File details
Details for the file Scrapy_Distributed-2023.4.23-py3-none-any.whl
.
File metadata
- Download URL: Scrapy_Distributed-2023.4.23-py3-none-any.whl
- Upload date:
- Size: 24.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39ceeaabffb14cdbad7e73dc7e46d4c73bb2d2e252c3d60c01761b75ffd87096 |
|
MD5 | 6d072ffeba861b61e06f9c62e17819e5 |
|
BLAKE2b-256 | 09cf1b35f0be93d52d49a560eaa3c73d26e8a6975058a2b2cc3733e2e91ae7f7 |