Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs
Project description
This Scrapy spider middleware uses the HCF backend from Scrapinghub’s Scrapy Cloud service to retrieve the new urls to crawl and store back the links extracted.
Installation
Install scrapy-hcf using pip:
$ pip install scrapy-hcf
Configuration
To activate this middleware it needs to be added to the SPIDER_MIDDLEWARES dict, i.e:
SPIDER_MIDDLEWARES = { 'scrapy_hcf.HcfMiddleware': 543, }
And the following settings need to be defined:
- HS_AUTH
- Scrapy Cloud API key
- HS_PROJECTID
- Scrapy Cloud project ID (not needed if the spider is ran on dash)
- HS_FRONTIER
- Frontier name.
- HS_CONSUME_FROM_SLOT
- Slot from where the spider will read new URLs.
Note that HS_FRONTIER and HS_CONSUME_FROM_SLOT can be overriden from inside a spider using the spider attributes hs_frontier and hs_consume_from_slot respectively.
The following optional Scrapy settings can be defined:
- HS_ENDPOINT
- URL to the API endpoint, i.e: http://localhost:8003. The default value is provided by the python-hubstorage package.
- HS_MAX_LINKS
- Number of links to be read from the HCF, the default is 1000.
- HS_START_JOB_ENABLED
- Enable whether to start a new job when the spider finishes. The default is False
- HS_START_JOB_ON_REASON
- This is a list of closing reasons, if the spider ends with any of these reasons a new job will be started for the same slot. The default is ['finished']
- HS_NUMBER_OF_SLOTS
- This is the number of slots that the middleware will use to store the new links. The default is 8.
Usage
The following keys can be defined in a Scrapy Request meta in order to control the behavior of the HCF middleware:
- 'use_hcf'
- If set to True the request will be stored in the HCF.
- 'hcf_params'
Dictionary of parameters to be stored in the HCF with the request fingerprint
- 'qdata'
- data to be stored along with the fingerprint in the request queue
- 'fdata'
- data to be stored along with the fingerprint in the fingerprint set
- 'p'
- Priority - lower priority numbers are returned first. The default is 0
The value of 'qdata' parameter could be retrieved later using response.meta['hcf_params']['qdata'].
The spider can override the default slot assignation function by setting the spider slot_callback method to a function with the following signature:
def slot_callback(request): ... return slot
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size scrapy_hcf-1.0.0-py2.py3-none-any.whl (4.8 kB) | File type Wheel | Python version py2.py3 | Upload date | Hashes View |
Filename, size scrapy-hcf-1.0.0.tar.gz (4.4 kB) | File type Source | Python version None | Upload date | Hashes View |
Hashes for scrapy_hcf-1.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5ae573a655a3efe31af7262238a33f294192e0f18a7fa2a5590ed729e29d493 |
|
MD5 | c5722a8e5372dda845b2f66a3b30d53b |
|
BLAKE2-256 | 6ebb709957580f8bd2e5163094776d650f9ea91dc2450c67b1b7db4003a0b99f |