Skip to main content

A Python script to submit web pages to the Wayback Machine for archiving.

Project description

Wayback Machine Archiver (Archiver for short) is a commandline utility writen in Python to backup Github Pages using the Internet Archive.

Installation

The best way to install Archiver is with pip:

pip install wayback-machine-archiver

This will give you access to the script simply by calling:

archiver --help

You can also clone this repository:

git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
python ./wayback_machine_archiver/archiver.py --help

If you clone the repository, Archiver can be installed as a local application using the setup.py script:

git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
./setup.py install

Which, like using pip, will give you access to the script by calling archiver.

Archiver requires the ``requests` library <https://github.com/kennethreitz/requests>`__ by Kenneth Reitz. Archiver supports Python 2.7, and Python 3.4+.

Usage

The simplest way to schedule a backup is by specifying the URL of a web page, like so:

archiver https://alexgude.com

This will submit the main page of my blog, alexgude.com, to the Wayback Machine for archiving.

You can also archive all the URLs specified in a `sitemap.xml <https://en.wikipedia.org/wiki/Sitemaps>`__ as follows:

archiver --sitemaps https://alexgude.com/sitemap.xml

This will backup every page listed in the sitemap of my website, alexgude.com.

You can also pass a sitemap.xml file (requires the file:// prefix) to the archiver:

archiver --sitemaps file://sitemap.xml

You can backup multiple pages by specifying multiple URLs or sitemaps:

archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml https://alexgude.com/sitemaps.xml

You can also backup multiple URLs by writing them to a file (for example, urls.txt), one URL per line, and passing that file to archiver:

archiver --file urls.txt

Sitemaps often exclude themselves, so you can request that the sitemap itself be backed up using the flag --archive-sitemap-also:

archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also

Help

For a full list of commandline flags, Archiver has a built-in help displayed with archiver --help:

usage: Github Pages Archiver [-h] [--version] [--file FILE]
                             [--sitemaps SITEMAPS [SITEMAPS ...]]
                             [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                             [--log-to-file LOG_FILE] [--archive-sitemap-also]
                             [--jobs JOBS]
                             [--rate-limit-wait RATE_LIMIT_IN_SEC]
                             [urls [urls ...]]

A script to backup a web pages with Internet Archive

positional arguments:
  urls                  the URLs of the pages to archive

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --file FILE           path to a file containing urls to save (one url per
                        line)
  --sitemaps SITEMAPS [SITEMAPS ...]
                        one or more URLs to sitemaps listing pages to archive
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        set the logging level, defaults to WARNING
  --log-to-file LOG_FILE
                        redirect logs to a file
  --archive-sitemap-also
                        also submit the URL of the sitemap to be archived
  --jobs JOBS, -j JOBS  run this many concurrent URL submissions, defaults to
                        1
  --rate-limit-wait RATE_LIMIT_IN_SEC
                        number of seconds to wait between page requests to
                        avoid flooding the archive site, defaults to 5; also
                        used as the backoff factor for retries

Setting Up a Sitemap.xml for Github Pages

It is easy to automatically generate a sitemap for a Github Pages Jekyll site. Simply use jekyll/jekyll-sitemap.

Setup instructions can be found on the above site; they require changing just a single line of your site’s _config.yml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayback-machine-archiver-1.8.0.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wayback_machine_archiver-1.8.0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file wayback-machine-archiver-1.8.0.tar.gz.

File metadata

  • Download URL: wayback-machine-archiver-1.8.0.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.1

File hashes

Hashes for wayback-machine-archiver-1.8.0.tar.gz
Algorithm Hash digest
SHA256 b4e78da84451af172c25e94d11aebec970571fa27f432aa0f867ce302b1a8a44
MD5 8923108f8aea829fbf64b7b37b9ceb74
BLAKE2b-256 f7ca698ab4d7e913a41ca1771e4b95b8813b490ea729c3a4492a83a7faf3a689

See more details on using hashes here.

File details

Details for the file wayback_machine_archiver-1.8.0-py3-none-any.whl.

File metadata

  • Download URL: wayback_machine_archiver-1.8.0-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.1

File hashes

Hashes for wayback_machine_archiver-1.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b363d582225703fb4f58f80a5b7ab0009bb8332abbd17d5cdd4cd8664c88d59
MD5 a6fa6c4b342f03a375539464e39037c2
BLAKE2b-256 5dc5c32ad123d07cb08139052408715e20b46a3b5c49b63c4394c7d5895028b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page