A Python script to submit web pages to the Wayback Machine for archiving.
Project description
Wayback Machine Archiver (Archiver for short) is a commandline utility writen in Python to backup Github Pages using the Internet Archive.
Installation
The best way to install Archiver is with pip:
pip install wayback-machine-archiver
This will give you access to the script simply by calling:
archiver --help
You can also clone this repository:
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
python ./wayback_machine_archiver/archiver.py --help
If you clone the repository, Archiver can be installed as a local application using the setup.py script:
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
./setup.py install
Which, like using pip, will give you access to the script by calling archiver.
Archiver requires the ``requests` library <https://github.com/kennethreitz/requests>`__ by Kenneth Reitz. Archiver supports Python 2.7, and Python 3.4+.
Usage
The simplest way to schedule a backup is by specifying the URL of a web page, like so:
archiver https://alexgude.com
This will submit the main page of my blog, alexgude.com, to the Wayback Machine for archiving.
You can also archive all the URLs specified in a `sitemap.xml <https://en.wikipedia.org/wiki/Sitemaps>`__ as follows:
archiver --sitemaps https://alexgude.com/sitemap.xml
This will backup every page listed in the sitemap of my website, alexgude.com.
You can also pass a sitemap.xml file (requires the file:// prefix) to the archiver:
archiver --sitemaps file://sitemap.xml
You can backup multiple pages by specifying multiple URLs or sitemaps:
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml https://alexgude.com/sitemaps.xml
You can also backup multiple URLs by writing them to a file (for example, urls.txt), one URL per line, and passing that file to archiver:
archiver --file urls.txt
Sitemaps often exclude themselves, so you can request that the sitemap itself be backed up using the flag --archive-sitemap-also:
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
Help
For a full list of commandline flags, Archiver has a built-in help displayed with archiver --help:
usage: archiver [-h] [--version] [--file FILE] [--sitemaps SITEMAPS [SITEMAPS ...]] [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-to-file LOG_FILE] [--archive-sitemap-also] [--jobs JOBS] [--rate-limit-wait RATE_LIMIT_IN_SEC] [urls [urls ...]] A script to backup a web pages with Internet Archive positional arguments: urls the URLs of the pages to archive optional arguments: -h, --help show this help message and exit --version show program's version number and exit --file FILE path to a file containing urls to save (one url per line) --sitemaps SITEMAPS [SITEMAPS ...] one or more URIs to sitemaps listing pages to archive; local paths must be prefixed with 'file://' --log {DEBUG,INFO,WARNING,ERROR,CRITICAL} set the logging level, defaults to WARNING --log-to-file LOG_FILE redirect logs to a file --archive-sitemap-also also submit the URL of the sitemap to be archived --jobs JOBS, -j JOBS run this many concurrent URL submissions, defaults to 1 --rate-limit-wait RATE_LIMIT_IN_SEC number of seconds to wait between page requests to avoid flooding the archive site, defaults to 5; also used as the backoff factor for retries
Setting Up a Sitemap.xml for Github Pages
It is easy to automatically generate a sitemap for a Github Pages Jekyll site. Simply use jekyll/jekyll-sitemap.
Setup instructions can be found on the above site; they require changing just a single line of your site’s _config.yml.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wayback-machine-archiver-1.9.1.tar.gz
.
File metadata
- Download URL: wayback-machine-archiver-1.9.1.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bda20104ac7aa1be5318e221133913297acafc10fc9e4e532bd40022d3ce3fcc |
|
MD5 | 81378a840200364e5d6f6e28baa7db72 |
|
BLAKE2b-256 | feea28bfaa458d332ea206ebb211aafe11160c9fe2f33d8303103791c2a68c00 |
File details
Details for the file wayback_machine_archiver-1.9.1-py3-none-any.whl
.
File metadata
- Download URL: wayback_machine_archiver-1.9.1-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18b0727f966e2502fb755fab65b7be8d8f5bcf520c7966910c258f48f7eca32a |
|
MD5 | 4e62b1edd7356af12609ea6259abfe31 |
|
BLAKE2b-256 | 4442f443d57dcf87ee609d01695d4b8dc3031ccec1b7107ab3f4b5df46975a0c |