A Python script to submit web pages to the Wayback Machine for archiving.
Project description
Wayback Machine Archiver
Wayback Machine Archiver (Archiver for short) is a command-line utility written in Python to back up web pages using the Internet Archive.
Installation
The best way to install Archiver is with pip:
pip install wayback-machine-archiver
This will give you access to the script simply by calling:
archiver --help
You can also install it directly from a local clone of this repository:
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
pip install .
All dependencies are handled automatically. Archiver supports Python 3.8+.
Usage
The archiver is simple to use from the command line. The examples below work regardless of which execution mode you are using.
Command-Line Examples
Archive a single page:
archiver https://alexgude.com
Archive all pages from a sitemap:
archiver --sitemaps https://alexgude.com/sitemap.xml
Archive from a local sitemap file:
(Note the file:// prefix is required)
archiver --sitemaps file://sitemap.xml
Archive from a text file of URLs: (The file should contain one URL per line)
archiver --file urls.txt
Combine multiple sources:
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml
Archive the sitemap URL itself:
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
Execution Modes
The script runs in one of two modes, which it selects automatically based on whether it finds Internet Archive credentials.
Authenticated Mode (Recommended)
This is the preferred mode. The script uses the Internet Archive's Save Page Now 2 (SPN2) API to submit a capture job, wait for it to complete, and confirm the final success or failure.
To enable this mode:
-
Get your S3-style API keys from your Internet Archive account settings: https://archive.org/account/s3.php
-
Create a
.envfile in the directory where you run thearchivercommand. Add your keys to it:INTERNET_ARCHIVE_ACCESS_KEY="YOUR_ACCESS_KEY_HERE" INTERNET_ARCHIVE_SECRET_KEY="YOUR_SECRET_KEY_HERE"
The script will automatically detect this file (or the equivalent environment variables) and use the authenticated API.
Unauthenticated Mode
If no credentials are found, the script falls back to the public, unauthenticated API. This is a "fire-and-forget" method that submits the capture request but does not wait to confirm if it was successful.
Help
For a full list of command-line flags, Archiver has built-in help displayed
with archiver --help:
usage: archiver [-h] [--version] [--file FILE]
[--sitemaps SITEMAPS [SITEMAPS ...]]
[--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
[--log-to-file LOG_FILE]
[--archive-sitemap-also]
[--rate-limit-wait RATE_LIMIT_IN_SEC]
[--random-order]
[urls ...]
A script to backup a web pages with Internet Archive
positional arguments:
urls the URLs of the pages to archive
options:
-h, --help show this help message and exit
--version show program's version number and exit
--file FILE path to a file containing urls to save (one url per
line)
--sitemaps SITEMAPS [SITEMAPS ...]
one or more URIs to sitemaps listing pages to
archive; local paths must be prefixed with 'file://'
--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
set the logging level, defaults to WARNING
--log-to-file LOG_FILE
redirect logs to a file
--archive-sitemap-also
also submit the URL of the sitemap to be archived
--rate-limit-wait RATE_LIMIT_IN_SEC
number of seconds to wait between page requests to
avoid flooding the archive site, defaults to 5; also
used as the backoff factor for retries
--random-order randomize the order of pages before archiving
Setting Up a Sitemap.xml for Github Pages
It is easy to automatically generate a sitemap for a Github Pages Jekyll site. Simply use jekyll/jekyll-sitemap.
Setup instructions can be found on the above site; they require changing just
a single line of your site's _config.yml.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wayback_machine_archiver-2.1.0.tar.gz.
File metadata
- Download URL: wayback_machine_archiver-2.1.0.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30907cdd225ea5a870744aa83f42e9d7a6ed7477af0ad0a4b5e21fe361114660
|
|
| MD5 |
add7575e6fcba81d2d8a80f3e5978e62
|
|
| BLAKE2b-256 |
6c01bc6df318f9d0d8b8d8db17f0533a776ef54f566cae5b5ed93508417fcd51
|
Provenance
The following attestation bundles were made for wayback_machine_archiver-2.1.0.tar.gz:
Publisher:
release.yml on agude/wayback-machine-archiver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wayback_machine_archiver-2.1.0.tar.gz -
Subject digest:
30907cdd225ea5a870744aa83f42e9d7a6ed7477af0ad0a4b5e21fe361114660 - Sigstore transparency entry: 458283188
- Sigstore integration time:
-
Permalink:
agude/wayback-machine-archiver@9462a8beab50dd186413bf6b253f8b85fcf4d870 -
Branch / Tag:
refs/tags/v2.1.0 - Owner: https://github.com/agude
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9462a8beab50dd186413bf6b253f8b85fcf4d870 -
Trigger Event:
release
-
Statement type:
File details
Details for the file wayback_machine_archiver-2.1.0-py3-none-any.whl.
File metadata
- Download URL: wayback_machine_archiver-2.1.0-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
313b19c5de468aecda87f70b0abc243dccdbd83ab0bd1a1603b877da71b7b00a
|
|
| MD5 |
55e5f86cd96d37cc667358b6ec72086e
|
|
| BLAKE2b-256 |
6af504e27cab91f9fa5cf0af9fa74ce0e5ec937f62834186edab685031bd4d4d
|
Provenance
The following attestation bundles were made for wayback_machine_archiver-2.1.0-py3-none-any.whl:
Publisher:
release.yml on agude/wayback-machine-archiver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wayback_machine_archiver-2.1.0-py3-none-any.whl -
Subject digest:
313b19c5de468aecda87f70b0abc243dccdbd83ab0bd1a1603b877da71b7b00a - Sigstore transparency entry: 458283190
- Sigstore integration time:
-
Permalink:
agude/wayback-machine-archiver@9462a8beab50dd186413bf6b253f8b85fcf4d870 -
Branch / Tag:
refs/tags/v2.1.0 - Owner: https://github.com/agude
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9462a8beab50dd186413bf6b253f8b85fcf4d870 -
Trigger Event:
release
-
Statement type: