scrapy-html-storage

Scrapy downloader middleware that stores response HTML files to disk.

Project description

https://travis-ci.org/povilasb/scrapy-html-storage.svg?branch=master

https://coveralls.io/repos/github/povilasb/scrapy-html-storage/badge.svg?branch=master:target:https://coveralls.io/github/povilasb/scrapy-html-storage?branch=master

This is Scrapy downloader middleware that stores response HTMLs to disk.

Usage

Turn downloader on, e.g. specifying it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_html_storage.HtmlStorageMiddleware': 10,
}

None of responses by default are saved to disk. You must select for which requests the response HTMLs will be saved:

def parse(self, response):
     """Processes start urls.

     Args:
         response (HtmlResponse): scrapy HTML response object.
     """
     yield scrapy.Request(
         'http://target.com',
         callback=self.parse_target,
         meta={
           'save_html': True,
         }
     )

The file path where HTML will be stored is resolved with spider method response_html_path. E.g.:

class TargetSpider(scrapy.Spider):
    def response_html_path(self, request):
        """
        Args:
            request (scrapy.http.request.Request): request that produced the
                response.
        """
        return 'html/last_response.html'

Configuration

HTML storage downloader middleware supports such options:

gzip_output (bool) - if True, HTML output will be stored in gzip format. Default is False.
save_html_on_status (list) - if not empty, sets list of response codes whitelisted for html saving. If list is empty or not provided, all response codes will be allowed for html saving.

Sample:

HTML_STORAGE = {
    'gzip_output': True,
    'save_html_on_status': [200, 202]
}

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Jun 24, 2018

0.3.0

Nov 11, 2016

0.2.0

Apr 19, 2016

0.1.0

Mar 29, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-html-storage-0.4.0.tar.gz (3.0 kB view details)

Uploaded Jun 24, 2018 Source

File details

Details for the file scrapy-html-storage-0.4.0.tar.gz.

File metadata

Download URL: scrapy-html-storage-0.4.0.tar.gz
Upload date: Jun 24, 2018
Size: 3.0 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapy-html-storage-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`b7b55bb5025efe8c545b4d9fdcf357bb9a4bbaa17265f68ec093615dd5672fdc`
MD5	`81fff2eec7d59dcc8770ca857e0ed40a`
BLAKE2b-256	`7815f7a99fbfa63298323b8288695f34c21095410d4739a5368fd0b0bb0f7f1c`

See more details on using hashes here.

scrapy-html-storage 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta