Skip to main content

Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine

Project description

Scrapy Wayback Middleware

Build status

Middleware for submitting all scraped response URLs to the Internet Archive Wayback Machine for archival.

Installation

pip install scrapy-wayback-middleware

Setup

Add scrapy_wayback_middleware.WaybackMiddleware to your project's SPIDER_MIDDLEWARES settings. By default, the middleware will make GET requests to web.archive.org/save/{URL}, but if the WAYBACK_MIDDLEWARE_POST setting is True then it will make POST requests to pragma.archivelab.org instead.

Configuration

To configure custom behavior for certain methods, subclass WaybackMiddleware and override the get_item_urls method to pull additional links to archive from individual items or handle_wayback to change how responses from the Wayback Machine are handled. The WAYBACK_MIDDLEWARE_POST can be set to True to adjust request behavior.

Duplicate Filtering

In order to avoid sending duplicate requests with WAYBACK_MIDDLEWARE_POST set to False, you'll need to either include web.archive.org in your spider's allowed_domains property (if specified) or disable scrapy.spidermiddlewares.offsite.OffsiteMiddleware in your settings.

Rate Limits

While neither endpoint returns headers indicating specific rate limits, the GET endpoint at web.archive.org/save has a rate limit of 25 requests/minute, resetting each minute. The middleware is configured to wait for 60 seconds whenever it sees a 429 error code to handle this.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-wayback-middleware-0.3.3b0.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_wayback_middleware-0.3.3b0-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-wayback-middleware-0.3.3b0.tar.gz.

File metadata

  • Download URL: scrapy-wayback-middleware-0.3.3b0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/5.11.0-1025-azure

File hashes

Hashes for scrapy-wayback-middleware-0.3.3b0.tar.gz
Algorithm Hash digest
SHA256 70f35263f4b0d17d5a2896c4b31e7627c8ceb6fe0e928b3f4509d9c002faec64
MD5 2682508a0dbcc32511ee008af981182c
BLAKE2b-256 2c983f507fab492e3609072d0f579bab51398f6d936f006095901cc3df4fcc77

See more details on using hashes here.

File details

Details for the file scrapy_wayback_middleware-0.3.3b0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_wayback_middleware-0.3.3b0-py3-none-any.whl
Algorithm Hash digest
SHA256 0dbcdf161b7653787dc250d5217a44422498ba119a982ffa53c4bef48d4a45f4
MD5 38847d24a9cafdd6908ccc4cabb064a6
BLAKE2b-256 f6aaa4d6aae0f9db05cfa994568ce4e64aa4f6ba25a7ce24c67ec2b1fc2c33a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page