Skip to main content

A Python script to submit web pages to the Wayback Machine for archiving.

Project description

Wayback Machine Archiver

Wayback Machine Archiver (Archiver for short) is a command-line utility written in Python to back up web pages using the Internet Archive.

Installation

The best way to install Archiver is with pip:

pip install wayback-machine-archiver

This will give you access to the script simply by calling:

archiver --help

You can also install it directly from a local clone of this repository:

git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
pip install .

All dependencies are handled automatically. Archiver supports Python 3.8+.

Usage

The archiver is simple to use from the command line. The examples below work regardless of which execution mode you are using.

Command-Line Examples

Archive a single page:

archiver https://alexgude.com

Archive all pages from a sitemap:

archiver --sitemaps https://alexgude.com/sitemap.xml

Archive from a local sitemap file: (Note the file:// prefix is required)

archiver --sitemaps file://sitemap.xml

Archive from a text file of URLs: (The file should contain one URL per line)

archiver --file urls.txt

Combine multiple sources:

archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml

Archive the sitemap URL itself:

archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also

Execution Modes

The script runs in one of two modes, which it selects automatically based on whether it finds Internet Archive credentials.

Authenticated Mode (Recommended)

This is the preferred mode. The script uses the Internet Archive's Save Page Now 2 (SPN2) API to submit a capture job, wait for it to complete, and confirm the final success or failure.

To enable this mode:

  1. Get your S3-style API keys from your Internet Archive account settings: https://archive.org/account/s3.php

  2. Create a .env file in the directory where you run the archiver command. Add your keys to it:

    INTERNET_ARCHIVE_ACCESS_KEY="YOUR_ACCESS_KEY_HERE"
    INTERNET_ARCHIVE_SECRET_KEY="YOUR_SECRET_KEY_HERE"
    

The script will automatically detect this file (or the equivalent environment variables) and use the authenticated API.

Unauthenticated Mode

If no credentials are found, the script falls back to the public, unauthenticated API. This is a "fire-and-forget" method that submits the capture request but does not wait to confirm if it was successful.

Help

For a full list of command-line flags, Archiver has built-in help displayed with archiver --help:

usage: archiver [-h] [--version] [--file FILE]
                [--sitemaps SITEMAPS [SITEMAPS ...]]
                [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                [--log-to-file LOG_FILE]
                [--archive-sitemap-also]
                [--rate-limit-wait RATE_LIMIT_IN_SEC]
                [--random-order]
                [urls ...]

A script to backup a web pages with Internet Archive

positional arguments:
  urls                  the URLs of the pages to archive

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --file FILE           path to a file containing urls to save (one url per
                        line)
  --sitemaps SITEMAPS [SITEMAPS ...]
                        one or more URIs to sitemaps listing pages to
                        archive; local paths must be prefixed with 'file://'
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        set the logging level, defaults to WARNING
  --log-to-file LOG_FILE
                        redirect logs to a file
  --archive-sitemap-also
                        also submit the URL of the sitemap to be archived
  --rate-limit-wait RATE_LIMIT_IN_SEC
                        number of seconds to wait between page requests to
                        avoid flooding the archive site, defaults to 5; also
                        used as the backoff factor for retries
  --random-order        randomize the order of pages before archiving

Setting Up a Sitemap.xml for Github Pages

It is easy to automatically generate a sitemap for a Github Pages Jekyll site. Simply use jekyll/jekyll-sitemap.

Setup instructions can be found on the above site; they require changing just a single line of your site's _config.yml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayback_machine_archiver-2.0.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wayback_machine_archiver-2.0.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file wayback_machine_archiver-2.0.0.tar.gz.

File metadata

  • Download URL: wayback_machine_archiver-2.0.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wayback_machine_archiver-2.0.0.tar.gz
Algorithm Hash digest
SHA256 e9bc469be0f10291de4c895e5429fc6e5328f79177eb5b0d5018f697977137d9
MD5 b6e307a93f0482a837c4a693391df631
BLAKE2b-256 997d7edc7a31089151cfe290dc89076bf3f9e0a7a2483ac3370c100ecfd70f7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for wayback_machine_archiver-2.0.0.tar.gz:

Publisher: release.yml on agude/wayback-machine-archiver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wayback_machine_archiver-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for wayback_machine_archiver-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6984f2b39433b3817bd627cf292be54fdeaf14d003ff70567971b99da2565101
MD5 01bce2504d629d2ecf43ce9c52aefbc6
BLAKE2b-256 25bbfd472ba056a09486bc4a4cd6d7a7b24a8999b18db5088f59bff2e3eeafb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for wayback_machine_archiver-2.0.0-py3-none-any.whl:

Publisher: release.yml on agude/wayback-machine-archiver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page