Skip to main content

A Python script to submit web pages to the Wayback Machine for archiving.

Project description

Wayback Machine Archiver

Wayback Machine Archiver (Archiver for short) is a command-line utility written in Python to back up web pages using the Internet Archive.

Installation

The best way to install Archiver is with pip:

pip install wayback-machine-archiver

This will give you access to the script simply by calling:

archiver --help

You can also install it directly from a local clone of this repository:

git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
pip install .

All dependencies are handled automatically. Archiver supports Python 3.8+.

Usage

The archiver is simple to use from the command line.

Command-Line Examples

Archive a single page:

archiver https://alexgude.com

Archive all pages from a sitemap:

archiver --sitemaps https://alexgude.com/sitemap.xml

Archive from a local sitemap file: (Note the file:// prefix is required)

archiver --sitemaps file://sitemap.xml

Archive from a text file of URLs: (The file should contain one URL per line)

archiver --file urls.txt

Combine multiple sources:

archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml

Use advanced API options: (Capture a screenshot and skip if archived in the last 10 days)

archiver https://alexgude.com --capture-screenshot --if-not-archived-within 10d

Archive the sitemap URL itself:

archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also

Authentication (Required)

As of version 3.0.0, this tool requires authentication with the Internet Archive's SPN2 API. This change was made to ensure all archiving jobs are reliable and their final success or failure status can be confirmed. The previous, less reliable method for unauthenticated users has been removed.

If you run the script without credentials, it will exit with an error message.

To set up authentication:

  1. Get your S3-style API keys from your Internet Archive account settings: https://archive.org/account/s3.php

  2. Create a .env file in the directory where you run the archiver command. Add your keys to it:

    INTERNET_ARCHIVE_ACCESS_KEY="YOUR_ACCESS_KEY_HERE"
    INTERNET_ARCHIVE_SECRET_KEY="YOUR_SECRET_KEY_HERE"
    

The script will automatically detect this file (or the equivalent environment variables) and use the authenticated API.

Help

For a full list of command-line flags, Archiver has built-in help displayed with archiver --help:

usage: archiver [-h] [--version] [--file FILE]
                [--sitemaps SITEMAPS [SITEMAPS ...]]
                [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                [--log-to-file LOG_FILE]
                [--archive-sitemap-also]
                [--rate-limit-wait RATE_LIMIT_IN_SEC]
                [--random-order] [--capture-all]
                [--capture-outlinks] [--capture-screenshot]
                [--delay-wb-availability] [--force-get]
                [--skip-first-archive] [--email-result]
                [--if-not-archived-within <timedelta>]
                [--js-behavior-timeout <seconds>]
                [--capture-cookie <cookie>]
                [--user-agent <string>]
                [urls ...]

A script to backup a web pages with Internet Archive

positional arguments:
  urls                  Specifies the URLs of the pages to archive.

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --file FILE           Specifies the path to a file containing URLs to save,
                        one per line.
  --sitemaps SITEMAPS [SITEMAPS ...]
                        Specifies one or more URIs to sitemaps listing pages
                        to archive. Local paths must be prefixed with
                        'file://'.
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Sets the logging level. Defaults to WARNING
                        (case-insensitive).
  --log-to-file LOG_FILE
                        Redirects logs to a specified file instead of the
                        console.
  --archive-sitemap-also
                        Submits the URL of the sitemap itself to be archived.
  --rate-limit-wait RATE_LIMIT_IN_SEC
                        Specifies the number of seconds to wait between
                        submissions. A minimum of 5 seconds is enforced for
                        authenticated users. Defaults to 15.
  --random-order        Randomizes the order of pages before archiving.

SPN2 API Options:
  Control the behavior of the Internet Archive capture API.

  --capture-all         Captures a web page even if it returns an error (e.g.,
                        404, 500).
  --capture-outlinks    Captures web page outlinks automatically. Note: this
                        can significantly increase the total number of
                        captures and runtime.
  --capture-screenshot  Captures a full page screenshot.
  --delay-wb-availability
                        Reduces load on Internet Archive systems by making the
                        capture publicly available after ~12 hours instead of
                        immediately.
  --force-get           Bypasses the headless browser check, which can speed
                        up captures for non-HTML content (e.g., PDFs, images).
  --skip-first-archive  Speeds up captures by skipping the check for whether
                        this is the first time a URL has been archived.
  --email-result        Sends an email report of the captured URLs to the
                        user's registered email.
  --if-not-archived-within <timedelta>
                        Captures only if the latest capture is older than
                        <timedelta> (e.g., '3d 5h').
  --js-behavior-timeout <seconds>
                        Runs JS code for <N> seconds after page load to
                        trigger dynamic content. Defaults to 5, max is 30. Use
                        0 to disable for static pages.
  --capture-cookie <cookie>
                        Uses an extra HTTP Cookie value when capturing the
                        target page.
  --user-agent <string>
                        Uses a custom HTTP User-Agent value when capturing the
                        target page.

Setting Up a Sitemap.xml for Github Pages

It is easy to automatically generate a sitemap for a Github Pages Jekyll site. Simply use jekyll/jekyll-sitemap.

Setup instructions can be found on the above site; they require changing just a single line of your site's _config.yml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayback_machine_archiver-3.5.0.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wayback_machine_archiver-3.5.0-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file wayback_machine_archiver-3.5.0.tar.gz.

File metadata

  • Download URL: wayback_machine_archiver-3.5.0.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wayback_machine_archiver-3.5.0.tar.gz
Algorithm Hash digest
SHA256 6a87fd7bcc41d16ad40d5e8746b1b58298a4bba2d99db8a81c61d246f982d1af
MD5 144f0c650e7fbc57adaeb355982d5181
BLAKE2b-256 7d1e9d9fd200f8124f274693dd6cd2e555a370f76294b41c3ef0c8fffb7a8593

See more details on using hashes here.

Provenance

The following attestation bundles were made for wayback_machine_archiver-3.5.0.tar.gz:

Publisher: release.yml on agude/wayback-machine-archiver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wayback_machine_archiver-3.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for wayback_machine_archiver-3.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a1af6bfef7c45b303c0356dd463af6f9e816e55399e0df8c7f099397b8469af
MD5 8f7570d7ca3dc145286e65b930abf17c
BLAKE2b-256 cd67a6a801d0dae347fd2b9ea25f3cf836212280067c09e9b4368b1f26704ed7

See more details on using hashes here.

Provenance

The following attestation bundles were made for wayback_machine_archiver-3.5.0-py3-none-any.whl:

Publisher: release.yml on agude/wayback-machine-archiver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page