Scrapy middleware with wayback machine support for more robust scrapers.

These details have not been verified by PyPI

Project links

Homepage

Project description

scrapy-wayback

Scrapy middleware with wayback machine support for more robust scrapers.

Dependencies :globe_with_meridians:

Installation :inbox_tray:

This is a python package hosted on pypi, so to install simply run the following command:

pip install scrapy-wayback

Settings

WAYBACK_MACHINE_FALLBACK_ENABLED (Optional)

Whether falling back to wayback machine after a failed request is enabled (defaults to true).

Meta field to enable/disable this per request is: wayback_machine_fallback_enabled

WAYBACK_MACHINE_PROXY_ENABLED (Optional)

Whether proxying to wayback machine before a request is made is enabled (defaults to false).

Meta field to enable/disable this per request is: wayback_machine_proxy_enabled

WAYBACK_MACHINE_PROXY_FALLTHROUGH_ENABLED (Optional)

Whether when proxying to wayback machine and an error occurs, that the request should continue to the original URL as per normal (defaults to true). Note that this will not have an effect if the wayback machine proxy is not enabled first.

Meta field to enable/disable this per request is: wayback_machine_proxy_fallthrough_enabled

Usage example :eyes:

In order to use this plugin simply add the following settings and substitute your variables:

DOWNLOADER_MIDDLEWARES = {
    "waybackmiddleware.middleware.WaybackMachineDownloaderMiddleware": 630
}

This will immediately allow you begin using the wayback machine as a fallback when one of your requests fail. In order to use it as a proxy you can add the following to your settings:

WAYBACK_MACHINE_PROXY_ENABLED = True

This will make every request hit the wayback machine for a response first, before hitting the original server. If you want to avoid hitting the original server entirely, put the following in your settings (as well as the above):

WAYBACK_MACHINE_PROXY_FALLTHROUGH_ENABLED = False

This will ensure that your scraper never hits the original servers, just what has been recorded by the wayback machine.

Whenever you receive a response from the wayback machine middleware, it will use the class WaybackMachineResponse. It subclasses scrapy.http.HtmlResponse so you can use it like a normal response, however it has some other goodies:

def parse(self, response):
    while True:
        if response is None:
            return
        print(f"Response {response.request.url} at {response.timestamp.isoformat()}")
        response = response.earlier_response()

This will allow you to go through the history one by one to get the earlier snapshots of the page. If you are interested in the response that the wayback middleware recovered, use the original_response attribute.

In order to perform a request that will yield the whole archived contents of a site, you can do the following:

import scrapy
from waybackmiddleware.request import WaybackMachineRequest
from waybackmiddleware.response import WaybackMachineResponse


class ArchiveScraper(scrapy.Spider):
    def start_requests():
        yield WaybackMachineRequest("http://www.walmart.com")
    
    def parse(self, response):
        print(f"Archive of {response.url} at {response.timestamp}")
        if isinstance(response, WaybackMachineResponse):
            next_response = response.earlier_response()
            if next_response is not None:
                yield next_response.request_for_response(self.parse)

This will send all archived contents of walmart.com to the parse callback (called multiple times).

License :memo:

The project is available under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.9

Dec 22, 2021

1.0.8

Dec 22, 2021

1.0.7

Oct 5, 2021

This version

1.0.6

Oct 5, 2021

1.0.5

Oct 5, 2021

1.0.4

Oct 4, 2021

1.0.3

Oct 4, 2021

1.0.2

Sep 21, 2021

1.0.1

Sep 15, 2021

1.0.0

Sep 15, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-wayback-1.0.6.tar.gz (5.4 kB view details)

Uploaded Oct 5, 2021 Source

File details

Details for the file scrapy-wayback-1.0.6.tar.gz.

File metadata

Download URL: scrapy-wayback-1.0.6.tar.gz
Upload date: Oct 5, 2021
Size: 5.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for scrapy-wayback-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`c0daa2a9c740d94ab2f646d1ce63d46c579431f79ae14b3b887ca24a129d7349`
MD5	`681c9f798b6d648e886ed43f1e54c22f`
BLAKE2b-256	`1e95c4c744c3dc33c8cfdc982cf2b8ccb028f77fe474d0882736c53423e4739d`