Skip to main content

Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine

Project description

Scrapy Wayback Middleware

Build status

Middleware for submitting all scraped response URLs to the Internet Archive Wayback Machine for archival.

Installation

pip install scrapy-wayback-middleware

Setup

Add scrapy_wayback_middleware.WaybackMiddleware to your project's SPIDER_MIDDLEWARES settings. By default, the middleware will make GET requests to web.archive.org/save/{URL}, but if the WAYBACK_MIDDLEWARE_POST setting is True then it will make POST requests to pragma.archivelab.org instead.

Configuration

To configure custom behavior for certain methods, subclass WaybackMiddleware and override the get_item_urls method to pull additional links to archive from individual items or handle_wayback to change how responses from the Wayback Machine are handled. The WAYBACK_MIDDLEWARE_POST can be set to True to adjust request behavior.

Duplicate Filtering

In order to avoid sending duplicate requests with WAYBACK_MIDDLEWARE_POST set to False, you'll need to either include web.archive.org in your spider's allowed_domains property (if specified) or disable scrapy.spidermiddlewares.offsite.OffsiteMiddleware in your settings.

Rate Limits

While neither endpoint returns headers indicating specific rate limits, the GET endpoint at web.archive.org/save has a rate limit of 25 requests/minute, resetting each minute. The middleware is configured to wait for 60 seconds whenever it sees a 429 error code to handle this.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-wayback-middleware-0.3.3.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file scrapy-wayback-middleware-0.3.3.tar.gz.

File metadata

  • Download URL: scrapy-wayback-middleware-0.3.3.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/5.11.0-1025-azure

File hashes

Hashes for scrapy-wayback-middleware-0.3.3.tar.gz
Algorithm Hash digest
SHA256 fd2b8500d64b909289e3541c201b4d672c0e7b458fc20e77bb37f0d71d93a75a
MD5 f9094c7621d9e1d6273006688ec2c5a7
BLAKE2b-256 dcef271a189bb78e7a7bddb406ae2ca394fc280992b7987cd6cbd4c0653ae60a

See more details on using hashes here.

File details

Details for the file scrapy_wayback_middleware-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_wayback_middleware-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d48079310958cc9583f2e2ffcdbfd583ee2ec2226c5c3ff9a7a72b0bf3d95784
MD5 6f60ce5d87915d538a7ae01ef6f7e0f7
BLAKE2b-256 f89e5e5ffcd6d3b79d809df2d8ebae803c06febb5d44e3b1489f4d13b70f3d91

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page