Skip to main content
Python Software Foundation 20th Year Anniversary Fundraiser  Donate today!

Scrapy middleware to ignore previously crawled pages

Project description

This is a Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider, thus producing a “delta crawl” containing only new items.

This also speeds up the crawl, by reducing the number of requests that need to be crawled, and processed (typically, item requests are the most CPU intensive).


DeltaFetch middleware depends on Python’s bsddb3 package.

On Ubuntu/Debian, you may need to install libdb-dev if it’s not installed already.


Install scrapy-deltafetch using pip:

$ pip install scrapy-deltafetch


  1. Add DeltaFetch middleware by including it in SPIDER_MIDDLEWARES in your file:

        'scrapy_deltafetch.DeltaFetch': 100,

    Here, priority 100 is just an example. Set its value depending on other middlewares you may have enabled already.

  2. Enable the middleware using DELTAFETCH_ENABLED in your



Following are the different options to control DeltaFetch middleware behavior.

Supported Scrapy settings

  • DELTAFETCH_ENABLED — to enable (or disable) this extension
  • DELTAFETCH_DIR — directory where to store state
  • DELTAFETCH_RESET — reset the state, clearing out all seen requests

These usually go in your Scrapy project’s

Supported Scrapy spider arguments

  • deltafetch_reset — same effect as DELTAFETCH_RESET setting


$ scrapy crawl example -a deltafetch_reset=1

Supported Scrapy request meta keys

  • deltafetch_key — used to define the lookup key for that request. by default it’s Scrapy’s default Request fingerprint function, but it can be changed to contain an item id, for example. This requires support from the spider, but makes the extension more efficient for sites that many URLs for the same item.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for scrapy-deltafetch, version 1.2.1
Filename, size File type Python version Upload date Hashes
Filename, size scrapy_deltafetch-1.2.1-py2.py3-none-any.whl (3.9 kB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size scrapy-deltafetch-1.2.1.tar.gz (3.2 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page