Skip to main content

Scrapy middleware which allows to crawl only new content

Project description

scrapy-crawl-once

PyPI Version Build Status Code Coverage

This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.

License is MIT.

Installation

pip install scrapy-crawl-once

Usage

To enable it, modify your settings.py:

SPIDER_MIDDLEWARES = {
    # ...
    'scrapy_crawl_once.CrawlOnceMiddleware': 100,
    # ...
}

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_crawl_once.CrawlOnceMiddleware': 50,
    # ...
}

By default it does nothing. To avoid crawling a particular page multiple times set request.meta['crawl_once'] = True. When a response is received and a callback is successful, the fingerprint of such request is stored to a database. When spider schedules a new request middleware first checks if its fingerprint is in the database, and drops the request if it is there.

Other request.meta keys:

  • crawl_once_value - a value to store in DB. By default, timestamp is stored.
  • crawl_once_key - request unique id; by default request_fingerprint is used.

Settings

  • CRAWL_ONCE_ENABLED - set it to False to disable middleware. Default is True.
  • CRAWL_ONCE_PATH - a path to a folder with crawled requests database. By default .scrapy/crawl_once/ path inside a project dir is used; this folder contains <spider_name>.sqlite files with databases of seen requests.
  • CRAWL_ONCE_DEFAULT - default value for crawl_once meta key (False by default). When True, all requests are handled by this middleware unless disabled explicitly using request.meta['crawl_once'] = False.

Alternatives

https://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it does almost the same. Differences:

  • scrapy-deltafetch chooses whether to discard a request or not based on yielded items; scrapy-crawl-once uses an explicit request.meta['crawl_once'] flag.
  • scrapy-deltafetch uses bsddb3, scrapy-crawl-once uses sqlite.

Another alternative is a built-in Scrapy HTTP cache. Differences:

  • scrapy cache stores all pages on disc, scrapy-crawl-once only keeps request fingerprints;
  • scrapy cache allows a more fine grained invalidation consistent with how browsers work;
  • with scrapy cache all pages are still processed (though not all pages are downloaded).

Contributing

To run tests, install tox and run tox from the source checkout.

CHANGES

0.1 (2017-03-03)

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for scrapy-crawl-once, version 0.1
Filename, size File type Python version Upload date Hashes
Filename, size scrapy_crawl_once-0.1-py2.py3-none-any.whl (7.0 kB) File type Wheel Python version 3.5 Upload date Hashes View
Filename, size scrapy-crawl-once-0.1.tar.gz (5.0 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page