Skip to main content

Scrapy downloader middleware that stores response HTML files to disk.

Project description

https://travis-ci.org/povilasb/scrapy-html-storage.svg?branch=master https://coveralls.io/repos/github/povilasb/scrapy-html-storage/badge.svg?branch=master:target:https://coveralls.io/github/povilasb/scrapy-html-storage?branch=master

This is Scrapy downloader middleware that stores response HTMLs to disk.

Usage

Turn downloader on, e.g. specifying it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_html_storage.HtmlStorageMiddleware': 10,
}

None of responses by default are saved to disk. You must select for which requests the response HTMLs will be saved:

def parse(self, response):
     """Processes start urls.

     Args:
         response (HtmlResponse): scrapy HTML response object.
     """
     yield scrapy.Request(
         'http://target.com',
         callback=self.parse_target,
         meta={
           'save_html': True,
         }
     )

The file path where HTML will be stored is resolved with spider method response_html_path. E.g.:

class TargetSpider(scrapy.Spider):
    def response_html_path(self, request):
        """
        Args:
            request (scrapy.http.request.Request): request that produced the
                response.
        """
        return 'html/last_response.html'

Configuration

HTML storage downloader middleware supports such options:

  • gzip_output (bool) - if True, HTML output will be stored in gzip format. Default is False.

  • save_html_on_status (list) - if not empty, sets list of response codes whitelisted for html saving. If list is empty or not provided, all response codes will be allowed for html saving.

Sample:

HTML_STORAGE = {
    'gzip_output': True,
    'save_html_on_status': [200, 202]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-html-storage-0.4.0.tar.gz (3.0 kB view details)

Uploaded Source

File details

Details for the file scrapy-html-storage-0.4.0.tar.gz.

File metadata

File hashes

Hashes for scrapy-html-storage-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b7b55bb5025efe8c545b4d9fdcf357bb9a4bbaa17265f68ec093615dd5672fdc
MD5 81fff2eec7d59dcc8770ca857e0ed40a
BLAKE2b-256 7815f7a99fbfa63298323b8288695f34c21095410d4739a5368fd0b0bb0f7f1c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page