Scrapy middleware which allows to crawl only new content
Project description
scrapy-crawl-once
This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.
License is MIT.
Installation
pip install scrapy-crawl-once
Usage
To enable it, modify your settings.py:
SPIDER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 100, # ... } DOWNLOADER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 50, # ... }
By default it does nothing. To avoid crawling a particular page multiple times set request.meta['crawl_once'] = True. When a response is received and a callback is successful, the fingerprint of such request is stored to a database. When spider schedules a new request middleware first checks if its fingerprint is in the database, and drops the request if it is there.
Other request.meta keys:
crawl_once_value - a value to store in DB. By default, timestamp is stored.
crawl_once_key - request unique id; by default request_fingerprint is used.
Settings
CRAWL_ONCE_ENABLED - set it to False to disable middleware. Default is True.
CRAWL_ONCE_PATH - a path to a folder with crawled requests database. By default .scrapy/crawl_once/ path inside a project dir is used; this folder contains <spider_name>.sqlite files with databases of seen requests.
CRAWL_ONCE_DEFAULT - default value for crawl_once meta key (False by default). When True, all requests are handled by this middleware unless disabled explicitly using request.meta['crawl_once'] = False.
Alternatives
https://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it does almost the same. Differences:
scrapy-deltafetch chooses whether to discard a request or not based on yielded items; scrapy-crawl-once uses an explicit request.meta['crawl_once'] flag.
scrapy-deltafetch uses bsddb3, scrapy-crawl-once uses sqlite.
Another alternative is a built-in Scrapy HTTP cache. Differences:
scrapy cache stores all pages on disc, scrapy-crawl-once only keeps request fingerprints;
scrapy cache allows a more fine grained invalidation consistent with how browsers work;
with scrapy cache all pages are still processed (though not all pages are downloaded).
Contributing
source code: https://github.com/TeamHG-Memex/scrapy-crawl-once
bug tracker: https://github.com/TeamHG-Memex/scrapy-crawl-once/issues
To run tests, install tox and run tox from the source checkout.
CHANGES
0.1.1 (2017-03-04)
new 'crawl_once/initial' value in scrapy stats - it contains the initial size (number of records) of crawl_once database.
0.1 (2017-03-03)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy_crawl_once-0.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60ea4e7529f99ad1ec6cacbad53828fbfa5959cc4dddfe8047557e7c189e920c |
|
MD5 | 7cbd808e48d307faf08a88e23e87d7b3 |
|
BLAKE2b-256 | 49978684f7a85d6be3a52f50cce2411eaaaf6c4e0d6c1598fa7b4e99578ba2cb |