Skip to main content

Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs

Project description

https://travis-ci.org/scrapy-plugins/scrapy-hcf.svg?branch=master https://codecov.io/gh/scrapy-plugins/scrapy-hcf/branch/master/graph/badge.svg

This Scrapy spider middleware uses the HCF backend from Scrapinghub’s Scrapy Cloud service to retrieve the new urls to crawl and store back the links extracted.

Installation

Install scrapy-hcf using pip:

$ pip install scrapy-hcf

Configuration

To activate this middleware it needs to be added to the SPIDER_MIDDLEWARES dict, i.e:

SPIDER_MIDDLEWARES = {
    'scrapy_hcf.HcfMiddleware': 543,
}

And the following settings need to be defined:

HS_AUTH

Scrapy Cloud API key

HS_PROJECTID

Scrapy Cloud project ID (not needed if the spider is ran on dash)

HS_FRONTIER

Frontier name.

HS_CONSUME_FROM_SLOT

Slot from where the spider will read new URLs.

Note that HS_FRONTIER and HS_CONSUME_FROM_SLOT can be overriden from inside a spider using the spider attributes hs_frontier and hs_consume_from_slot respectively.

The following optional Scrapy settings can be defined:

HS_ENDPOINT

URL to the API endpoint, i.e: http://localhost:8003. The default value is provided by the python-hubstorage package.

HS_MAX_LINKS

Number of links to be read from the HCF, the default is 1000.

HS_START_JOB_ENABLED

Enable whether to start a new job when the spider finishes. The default is False

HS_START_JOB_ON_REASON

This is a list of closing reasons, if the spider ends with any of these reasons a new job will be started for the same slot. The default is ['finished']

HS_NUMBER_OF_SLOTS

This is the number of slots that the middleware will use to store the new links. The default is 8.

Usage

The following keys can be defined in a Scrapy Request meta in order to control the behavior of the HCF middleware:

'use_hcf'

If set to True the request will be stored in the HCF.

'hcf_params'

Dictionary of parameters to be stored in the HCF with the request fingerprint

'qdata'

data to be stored along with the fingerprint in the request queue

'fdata'

data to be stored along with the fingerprint in the fingerprint set

'p'

Priority - lower priority numbers are returned first. The default is 0

The value of 'qdata' parameter could be retrieved later using response.meta['hcf_params']['qdata'].

The spider can override the default slot assignation function by setting the spider slot_callback method to a function with the following signature:

def slot_callback(request):
    ...
    return slot

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-hcf-1.0.0.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

scrapy_hcf-1.0.0-py2.py3-none-any.whl (4.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-hcf-1.0.0.tar.gz.

File metadata

  • Download URL: scrapy-hcf-1.0.0.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapy-hcf-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ef36065bba23571af7a6a3a770c710664a5f86447cedee6b476e2c846c63013f
MD5 3cc8d2e1352891fc40b5fa9f3d4b433d
BLAKE2b-256 2f7cb39ebb49f0e0dbc04d41b32e5252346c3b2a84b3dfc0cbd2c4c56fc8c701

See more details on using hashes here.

File details

Details for the file scrapy_hcf-1.0.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_hcf-1.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a5ae573a655a3efe31af7262238a33f294192e0f18a7fa2a5590ed729e29d493
MD5 c5722a8e5372dda845b2f66a3b30d53b
BLAKE2b-256 6ebb709957580f8bd2e5163094776d650f9ea91dc2450c67b1b7db4003a0b99f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page