Scrapy downloader middleware that stores response HTML files to disk.
Project description
This is Scrapy downloader middleware that stores response HTMLs to disk.
Usage
Turn downloader on, e.g. specifying it in settings.py:
DOWNLOADER_MIDDLEWARES = { 'scrapy_html_storage.HtmlStorageMiddleware': 10, }
None of responses by default are saved to disk. You must select for which requests the response HTMLs will be saved:
def parse(self, response): """Processes start urls. Args: response (HtmlResponse): scrapy HTML response object. """ yield scrapy.Request( 'http://target.com', callback=self.parse_target, meta={ 'save_html': True, } )
The file path where HTML will be stored is resolved with spider method response_html_path. E.g.:
class TargetSpider(scrapy.Spider): def response_html_path(self, request): """ Args: request (scrapy.http.request.Request): request that produced the response. """ return 'html/last_response.html'
Configuration
HTML storage downloader middleware supports such options:
gzip_output (bool) - if True, HTML output will be stored in gzip format. Default is False.
save_html_on_status (list) - if not empty, sets list of response codes whitelisted for html saving. If list is empty or not provided, all response codes will be allowed for html saving.
Sample:
HTML_STORAGE = { 'gzip_output': True, 'save_html_on_status': [200, 202] }
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scrapy-html-storage-0.4.0.tar.gz
.
File metadata
- Download URL: scrapy-html-storage-0.4.0.tar.gz
- Upload date:
- Size: 3.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7b55bb5025efe8c545b4d9fdcf357bb9a4bbaa17265f68ec093615dd5672fdc |
|
MD5 | 81fff2eec7d59dcc8770ca857e0ed40a |
|
BLAKE2b-256 | 7815f7a99fbfa63298323b8288695f34c21095410d4739a5368fd0b0bb0f7f1c |