Run Scrapy Distributed

Scrapy Eagle is a tool that allow us to run any Scrapy_ based project in a distributed fashion and monitor how it is going on and how many resources it is consuming on each server.

**This project is Under Development, don't use it yet**


Scrapy Eagle uses Redis_ as Distributed Queue, so you will need a redis instance running.

It could be easily made by running the code bellow,

.. code-block:: console

$ pip install scrapy-eagle

You should create one ``configparser`` configuration file (e.g. in /etc/scrapy-eagle.ini) containing:

.. code-block:: console

host =
port = 6379
db = 0

debug = True
cookie_secret_key = ha74h3hdh42a
host =
port = 5000

base_dir = /project_venv/project/project
binary = /project_venv/bin/scrapy

Then you will be able to execute the `eagle_server` command like,

.. code-block:: console

eagle_server --config-file=/etc/scrapy-eagle.ini

Changes into your Scrapy project

Enable the components in your `` of your Scrapy project:

.. code-block:: python

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_eagle.worker.scheduler.DistributedScheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_eagle.worker.dupefilter.RFPDupeFilter"

# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = "scrapy_eagle.worker.queue.SpiderPriorityQueue"

# Schedule requests using a queue (FIFO).
SCHEDULER_QUEUE_CLASS = "scrapy_eagle.worker.queue.SpiderQueue"

# Schedule requests using a stack (LIFO).
SCHEDULER_QUEUE_CLASS = "scrapy_eagle.worker.queue.SpiderStack"

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).

# Specify the host and port to use when connecting to Redis (optional).
REDIS_HOST = 'localhost'

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
REDIS_URL = "redis://user:pass@hostname:9001"

Once the configuration is finished, you should adapt each spider to use our Mixin:

.. code-block:: python

from scrapy.spiders import CrawlSpider, Rule
from scrapy_eagle.worker.spiders import DistributedMixin

class YourSpider(DistributedMixin, CrawlSpider):

name = ""

# start_urls = ['']
redis_key = ''

rules = (

def _set_crawler(self, crawler):
CrawlSpider._set_crawler(self, crawler)

Dashboard Development

If you would like to change the client-side then you'll need to have NPM_ installed because we use ReactJS_ to build our interface. Installing all dependencies locally:

.. code-block:: console

cd scrapy-eagle/dashboard
npm install

Then you can run ``npm start`` to compile and start monitoring any changes and recompiling automatically.

To be easier to test the Dashboard you could use one simple http server instead of run the ``eagle_server``, like:

.. code-block:: console

sudo npm install -g http-server
cd scrapy-eagle/dashboard
http-server templates/

It would be available for you at

**Note**: Until now the Scrapy Eagle is mostly based on

