Skip to main content

Scrapy downloader middleware that stores response HTML files to disk.

Project description

This is Scrapy downloader middleware that stores response HTMLs to disk.

Usage

Turn downloader on, e.g. specifying it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_html_storage.HtmlStorageMiddleware': 10,
}

None of responses by default are saved to disk. You must select for which requests the response HTMLs will be saved:

def parse(self, response):
 """Processes start urls.

 Args:
     response (HtmlResponse): scrapy HTML response object.
 """
 yield scrapy.Request(
     'http://target.com',
     callback=self.parse_target,
     meta={
       'save_html': True,
     }
 )

The file path where HTML will be stored is resolved with spider method response_html_path. E.g.:

class TargetSpider(scrapy.Spider):
    def response_html_path(self, request):
    """
    Args:
        request (scrapy.http.request.Request): request that produced the
            response.
    """
    return 'html/last_response.html'

Configuration

HTML storage downloader middleware supports such options:

  • gzip_output (bool) - if True, HTML output will be stored in gzip format. Default is False.

Sample:

HTML_STORAGE = {
    'gzip_output': True
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-html-storage-0.2.0.tar.gz (2.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page