Skip to main content

Scrapy middleware to ignore previously crawled pages

Project description

https://github.com/scrapy-plugins/scrapy-deltafetch/workflows/CI/badge.svg https://img.shields.io/pypi/pyversions/scrapy-deltafetch.svg https://img.shields.io/pypi/v/scrapy-deltafetch.svg https://img.shields.io/pypi/l/scrapy-deltafetch.svg Downloads count

This is a Scrapy spider middleware to ignore requests to pages seen in previous crawls of the same spider, thus producing a “delta crawl” containing only new requests.

This also speeds up the crawl, by reducing the number of requests that need to be crawled, and processed (typically, item requests are the most CPU intensive).

DeltaFetch middleware uses Python’s dbm package to store requests fingerprints.

Installation

Install scrapy-deltafetch using pip:

$ pip install scrapy-deltafetch

Configuration

  1. Add DeltaFetch middleware by including it in SPIDER_MIDDLEWARES in your settings.py file:

    SPIDER_MIDDLEWARES = {
        'scrapy_deltafetch.DeltaFetch': 100,
    }

    Here, priority 100 is just an example. Set its value depending on other middlewares you may have enabled already.

  2. Enable the middleware using DELTAFETCH_ENABLED in your settings.py:

    DELTAFETCH_ENABLED = True

Usage

Following are the different options to control DeltaFetch middleware behavior.

Supported Scrapy settings

  • DELTAFETCH_ENABLED — to enable (or disable) this extension

  • DELTAFETCH_DIR — directory where to store state

  • DELTAFETCH_RESET — reset the state, clearing out all seen requests

These usually go in your Scrapy project’s settings.py.

Supported Scrapy spider arguments

  • deltafetch_reset — same effect as DELTAFETCH_RESET setting

Example:

$ scrapy crawl example -a deltafetch_reset=1

Supported Scrapy request meta keys

  • deltafetch_key — used to define the lookup key for that request. by default it’s Scrapy’s default Request fingerprint function, but it can be changed to contain an item id, for example. This requires support from the spider, but makes the extension more efficient for sites that many URLs for the same item.

  • deltafetch_enabled - if set to False it will disable deltafetch for some specific request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_deltafetch-2.1.0.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

scrapy_deltafetch-2.1.0-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_deltafetch-2.1.0.tar.gz.

File metadata

  • Download URL: scrapy_deltafetch-2.1.0.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for scrapy_deltafetch-2.1.0.tar.gz
Algorithm Hash digest
SHA256 66dbc1d7483a8ff24d729c8b6494b847ef030bb4d883431640cbccd2771a2713
MD5 c7cb7977e1f6fc2776ee390a9d8885e2
BLAKE2b-256 33827674b8c5ab5f40bfbbd0f910ac1b00910590e38170adde773c1a3d62c846

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_deltafetch-2.1.0.tar.gz:

Publisher: publish.yml on scrapy-plugins/scrapy-deltafetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapy_deltafetch-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_deltafetch-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1505cc5846ba1c8534b0cc66176c18eed04012780dbc335baecdc6680672a7cb
MD5 a3f381989874f43e797abee497e0e083
BLAKE2b-256 4d751c47c74c995bd711ce72149672340bb599acf7fa0e03cdf003c4e7853d5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_deltafetch-2.1.0-py3-none-any.whl:

Publisher: publish.yml on scrapy-plugins/scrapy-deltafetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page