A Python script to submit web pages to the Wayback Machine for archiving.
Project description
Wayback Machine Archiver (Archiver for short) is a commandline utility writen in Python to backup Github Pages using the Internet Archive.
Installation
The best way to install Archiver is with pip:
pip install wayback-machine-archiver
This will give you access to the script simply by calling:
archiver --help
You can also clone this repository:
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
python ./wayback_machine_archiver/archiver.py --help
If you clone the repository, Archiver can be installed as a local application using the setup.py script:
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
./setup.py install
Which, like using pip, will give you access to the script by calling archiver.
Archiver requires the ``requests` library <https://github.com/kennethreitz/requests>`__ by Kenneth Reitz. Archiver supports Python 2.7, and Python 3.4+.
Usage
The simplest way to schedule a backup is by specifying the URL of a web page, like so:
archiver https://alexgude.com
This will submit the main page of my blog, alexgude.com, to the Wayback Machine for archiving.
You can also archive all the URLs specified in a `sitemap.xml <https://en.wikipedia.org/wiki/Sitemaps>`__ as follows:
archiver --sitemaps https://alexgude.com/sitemap.xml
This will backup every page listed in the sitemap of my website, alexgude.com.
You can also pass a sitemap.xml file (requires the file:// prefix) to the archiver:
archiver --sitemaps file://sitemap.xml
You can backup multiple pages by specifying multiple URLs or sitemaps:
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml https://alexgude.com/sitemaps.xml
You can also backup multiple URLs by writing them to a file (for example, urls.txt), one URL per line, and passing that file to archiver:
archiver --file urls.txt
Sitemaps often exclude themselves, so you can request that the sitemap itself be backed up using the flag --archive-sitemap-also:
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
Help
For a full list of commandline flags, Archiver has a built-in help displayed with archiver --help:
usage: archiver [-h] [--version] [--file FILE] [--sitemaps SITEMAPS [SITEMAPS ...]] [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-to-file LOG_FILE] [--archive-sitemap-also] [--jobs JOBS] [--rate-limit-wait RATE_LIMIT_IN_SEC] [urls [urls ...]] A script to backup a web pages with Internet Archive positional arguments: urls the URLs of the pages to archive optional arguments: -h, --help show this help message and exit --version show program's version number and exit --file FILE path to a file containing urls to save (one url per line) --sitemaps SITEMAPS [SITEMAPS ...] one or more URIs to sitemaps listing pages to archive; local paths must be prefixed with 'file://' --log {DEBUG,INFO,WARNING,ERROR,CRITICAL} set the logging level, defaults to WARNING --log-to-file LOG_FILE redirect logs to a file --archive-sitemap-also also submit the URL of the sitemap to be archived --jobs JOBS, -j JOBS run this many concurrent URL submissions, defaults to 1 --rate-limit-wait RATE_LIMIT_IN_SEC number of seconds to wait between page requests to avoid flooding the archive site, defaults to 5; also used as the backoff factor for retries
Setting Up a Sitemap.xml for Github Pages
It is easy to automatically generate a sitemap for a Github Pages Jekyll site. Simply use jekyll/jekyll-sitemap.
Setup instructions can be found on the above site; they require changing just a single line of your site’s _config.yml.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for wayback-machine-archiver-1.9.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | bda20104ac7aa1be5318e221133913297acafc10fc9e4e532bd40022d3ce3fcc |
|
MD5 | 81378a840200364e5d6f6e28baa7db72 |
|
BLAKE2b-256 | feea28bfaa458d332ea206ebb211aafe11160c9fe2f33d8303103791c2a68c00 |
Hashes for wayback_machine_archiver-1.9.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18b0727f966e2502fb755fab65b7be8d8f5bcf520c7966910c258f48f7eca32a |
|
MD5 | 4e62b1edd7356af12609ea6259abfe31 |
|
BLAKE2b-256 | 4442f443d57dcf87ee609d01695d4b8dc3031ccec1b7107ab3f4b5df46975a0c |