A Python script to submit web pages to the Wayback Machine for archiving.
Project description
Wayback Machine Archiver
Wayback Machine Archiver (Archiver for short) is a command-line utility written in Python to back up web pages using the Internet Archive.
Installation
The best way to install Archiver is with pip:
pip install wayback-machine-archiver
This will give you access to the script simply by calling:
archiver --help
You can also install it directly from a local clone of this repository:
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
pip install .
All dependencies are handled automatically. Archiver supports Python 3.8+.
Usage
The archiver is simple to use from the command line.
Command-Line Examples
Archive a single page:
archiver https://alexgude.com
Archive all pages from a sitemap:
archiver --sitemaps https://alexgude.com/sitemap.xml
Archive from a local sitemap file:
(Note the file:// prefix is required)
archiver --sitemaps file://sitemap.xml
Archive from a text file of URLs: (The file should contain one URL per line)
archiver --file urls.txt
Combine multiple sources:
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml
Use advanced API options: (Capture a screenshot and skip if archived in the last 10 days)
archiver https://alexgude.com --capture-screenshot --if-not-archived-within 10d
Archive the sitemap URL itself:
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
Authentication (Required)
As of version 3.0.0, this tool requires authentication with the Internet Archive's SPN2 API. This change was made to ensure all archiving jobs are reliable and their final success or failure status can be confirmed. The previous, less reliable method for unauthenticated users has been removed.
If you run the script without credentials, it will exit with an error message.
To set up authentication:
-
Get your S3-style API keys from your Internet Archive account settings: https://archive.org/account/s3.php
-
Create a
.envfile in the directory where you run thearchivercommand. Add your keys to it:INTERNET_ARCHIVE_ACCESS_KEY="YOUR_ACCESS_KEY_HERE" INTERNET_ARCHIVE_SECRET_KEY="YOUR_SECRET_KEY_HERE"
The script will automatically detect this file (or the equivalent environment variables) and use the authenticated API.
Help
For a full list of command-line flags, Archiver has built-in help displayed
with archiver --help:
usage: archiver [-h] [--version] [--file FILE]
[--sitemaps SITEMAPS [SITEMAPS ...]]
[--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
[--log-to-file LOG_FILE]
[--archive-sitemap-also]
[--rate-limit-wait RATE_LIMIT_IN_SEC]
[--random-order] [--capture-all]
[--capture-outlinks] [--capture-screenshot]
[--delay-wb-availability] [--force-get]
[--skip-first-archive] [--email-result]
[--if-not-archived-within <timedelta>]
[--js-behavior-timeout <seconds>]
[--capture-cookie <cookie>]
[--user-agent <string>]
[urls ...]
A script to backup a web pages with Internet Archive
positional arguments:
urls Specifies the URLs of the pages to archive.
options:
-h, --help show this help message and exit
--version show program's version number and exit
--file FILE Specifies the path to a file containing URLs to save,
one per line.
--sitemaps SITEMAPS [SITEMAPS ...]
Specifies one or more URIs to sitemaps listing pages
to archive. Local paths must be prefixed with
'file://'.
--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Sets the logging level. Defaults to WARNING
(case-insensitive).
--log-to-file LOG_FILE
Redirects logs to a specified file instead of the
console.
--archive-sitemap-also
Submits the URL of the sitemap itself to be archived.
--rate-limit-wait RATE_LIMIT_IN_SEC
Specifies the number of seconds to wait between
submissions. A minimum of 5 seconds is enforced for
authenticated users. Defaults to 15.
--random-order Randomizes the order of pages before archiving.
SPN2 API Options:
Control the behavior of the Internet Archive capture API.
--capture-all Captures a web page even if it returns an error (e.g.,
404, 500).
--capture-outlinks Captures web page outlinks automatically. Note: this
can significantly increase the total number of
captures and runtime.
--capture-screenshot Captures a full page screenshot.
--delay-wb-availability
Reduces load on Internet Archive systems by making the
capture publicly available after ~12 hours instead of
immediately.
--force-get Bypasses the headless browser check, which can speed
up captures for non-HTML content (e.g., PDFs, images).
--skip-first-archive Speeds up captures by skipping the check for whether
this is the first time a URL has been archived.
--email-result Sends an email report of the captured URLs to the
user's registered email.
--if-not-archived-within <timedelta>
Captures only if the latest capture is older than
<timedelta> (e.g., '3d 5h').
--js-behavior-timeout <seconds>
Runs JS code for <N> seconds after page load to
trigger dynamic content. Defaults to 5, max is 30. Use
0 to disable for static pages.
--capture-cookie <cookie>
Uses an extra HTTP Cookie value when capturing the
target page.
--user-agent <string>
Uses a custom HTTP User-Agent value when capturing the
target page.
Setting Up a Sitemap.xml for Github Pages
It is easy to automatically generate a sitemap for a Github Pages Jekyll site. Simply use jekyll/jekyll-sitemap.
Setup instructions can be found on the above site; they require changing just
a single line of your site's _config.yml.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wayback_machine_archiver-3.5.0.tar.gz.
File metadata
- Download URL: wayback_machine_archiver-3.5.0.tar.gz
- Upload date:
- Size: 26.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a87fd7bcc41d16ad40d5e8746b1b58298a4bba2d99db8a81c61d246f982d1af
|
|
| MD5 |
144f0c650e7fbc57adaeb355982d5181
|
|
| BLAKE2b-256 |
7d1e9d9fd200f8124f274693dd6cd2e555a370f76294b41c3ef0c8fffb7a8593
|
Provenance
The following attestation bundles were made for wayback_machine_archiver-3.5.0.tar.gz:
Publisher:
release.yml on agude/wayback-machine-archiver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wayback_machine_archiver-3.5.0.tar.gz -
Subject digest:
6a87fd7bcc41d16ad40d5e8746b1b58298a4bba2d99db8a81c61d246f982d1af - Sigstore transparency entry: 832590854
- Sigstore integration time:
-
Permalink:
agude/wayback-machine-archiver@503af34c2a16f2fa7b95dd808bd84d07ea1bb14d -
Branch / Tag:
refs/tags/v3.5.0 - Owner: https://github.com/agude
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@503af34c2a16f2fa7b95dd808bd84d07ea1bb14d -
Trigger Event:
release
-
Statement type:
File details
Details for the file wayback_machine_archiver-3.5.0-py3-none-any.whl.
File metadata
- Download URL: wayback_machine_archiver-3.5.0-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a1af6bfef7c45b303c0356dd463af6f9e816e55399e0df8c7f099397b8469af
|
|
| MD5 |
8f7570d7ca3dc145286e65b930abf17c
|
|
| BLAKE2b-256 |
cd67a6a801d0dae347fd2b9ea25f3cf836212280067c09e9b4368b1f26704ed7
|
Provenance
The following attestation bundles were made for wayback_machine_archiver-3.5.0-py3-none-any.whl:
Publisher:
release.yml on agude/wayback-machine-archiver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wayback_machine_archiver-3.5.0-py3-none-any.whl -
Subject digest:
9a1af6bfef7c45b303c0356dd463af6f9e816e55399e0df8c7f099397b8469af - Sigstore transparency entry: 832590867
- Sigstore integration time:
-
Permalink:
agude/wayback-machine-archiver@503af34c2a16f2fa7b95dd808bd84d07ea1bb14d -
Branch / Tag:
refs/tags/v3.5.0 - Owner: https://github.com/agude
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@503af34c2a16f2fa7b95dd808bd84d07ea1bb14d -
Trigger Event:
release
-
Statement type: