Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine
Project description
Scrapy Wayback Middleware
Middleware for submitting all scraped response URLs to the Internet Archive Wayback Machine for archival.
Installation
pip install scrapy-wayback-middleware
Setup
Add scrapy_wayback_middleware.WaybackMiddleware to your project's SPIDER_MIDDLEWARES settings. By default, the middleware will make GET requests to web.archive.org/save/{URL}, but if the WAYBACK_MIDDLEWARE_POST setting is True then it will make POST requests to pragma.archivelab.org instead.
Configuration
To configure custom behavior for certain methods, subclass WaybackMiddleware and override the get_item_urls method to pull additional links to archive from individual items or handle_wayback to change how responses from the Wayback Machine are handled. The WAYBACK_MIDDLEWARE_POST can be set to True to adjust request behavior.
Duplicate Filtering
In order to avoid sending duplicate requests with WAYBACK_MIDDLEWARE_POST set to False, you'll need to either include web.archive.org in your spider's allowed_domains property (if specified) or disable scrapy.spidermiddlewares.offsite.OffsiteMiddleware in your settings.
Rate Limits
While neither endpoint returns headers indicating specific rate limits, the GET endpoint at web.archive.org/save has a rate limit of 25 requests/minute, resetting each minute. The middleware is configured to wait for 60 seconds whenever it sees a 429 error code to handle this.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy-wayback-middleware-0.3.3b0.tar.gz.
File metadata
- Download URL: scrapy-wayback-middleware-0.3.3b0.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/5.11.0-1025-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70f35263f4b0d17d5a2896c4b31e7627c8ceb6fe0e928b3f4509d9c002faec64
|
|
| MD5 |
2682508a0dbcc32511ee008af981182c
|
|
| BLAKE2b-256 |
2c983f507fab492e3609072d0f579bab51398f6d936f006095901cc3df4fcc77
|
File details
Details for the file scrapy_wayback_middleware-0.3.3b0-py3-none-any.whl.
File metadata
- Download URL: scrapy_wayback_middleware-0.3.3b0-py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.9.9 Linux/5.11.0-1025-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0dbcdf161b7653787dc250d5217a44422498ba119a982ffa53c4bef48d4a45f4
|
|
| MD5 |
38847d24a9cafdd6908ccc4cabb064a6
|
|
| BLAKE2b-256 |
f6aaa4d6aae0f9db05cfa994568ce4e64aa4f6ba25a7ce24c67ec2b1fc2c33a9
|