Python WayBack for web archive replay and live web proxy
pywb is a python implementation of web archival replay tools, sometimes also known as ‘Wayback Machine’.
Additionally, pywb includes an extensive index query api for querying information about archived content.
The software can run as a traditional web application or an HTTP or HTTPS proxy server, and has been tested on Linux, OS X and Windows platforms.
With release 0.9.0, pywb provides new simplified, directory-based init system to create and run your own web archive replay system (wayback machine) directly from archive collections on disk.
A new utility, wb-manager performs the most common collection management tasks from the command line.
If you do not have any web archive files (WARCS), you can create easiely create one from any page by using the free https://webrecorder.io/ service
For example, you may visit https://webrecorder.io/record/http://example.com, then (after a few seconds), click Download -> Web Archive (WARC) to get the WARC file (.warc.gz)
Everything you have seen in your browser during the recording session was archived.
Each collections contains an arbitrary amount of WARC files.
Once you have at least one WARC/ARC file, you can set up a quick collection as follows, including installing pywb:
pip install pywb wb-manager init my_coll wb-manager add my_coll <path/to/warc> wayback
Point your browser to http://localhost:8080/my_coll/<url>/ where <url> is a url you recorded before into your WARC/ARC file. (If you just recorded http://example.com/, you should be able to view http://localhost:8080/my_coll/http://example.com/)
If all worked well, you should see your archived version of <url>. Congrats, you are now running your own web archive!
Existing archives of WARCs/ARCs files can be used with pywb with minimal amount of setup. By using wb-manager add, WARC/ARC files will automatically be placed in the collection archive directory and indexed.
If you have a large number of existing CDX index files, pywb will be able to read them as well without having to reindex. It is recommended that any index files be converted to the latest JSON based format, which can be done by running: wb-manager cdx-convert <path/to/cdx>
To setup a collection with existing ARC/WARCs and CDX index files, you can:
This will fully migrate your archive and indexes the collection. Any new WARCs added with wb-manager add will be indexed and added to the existing collection. You may use the auto-indexing features (explained below) to add new content to the existing collection.
Legacy installation instructions contain additional information and testing examples, and use a custom config.yaml file. These instructions are from previous releases but still compatible with pywb 0.9.x.
pywb makes it easy to customize most aspects of the UI around archived content, including a custom banner insert, query calendar, search and home pages, via HTML Jinja2 templates.
You can see a list of all available UI templates by running: wb-manager template --list
To copy a default template to the file system (for modification), you can run wb-manager template --add <template_name> <collection>
pywb now supports custom user metadata for each collection. The metadata may be specified in the metadata.yaml in each collection’s directory.
The metadata is accessible to all UI templates and may be displayed to the user as needed.
pywb now also includes support for automatic indexing of any web archive files (WARC or ARC).
Whenever a WARC/ARC file is added or changed, pywb will update the internal index automatically and make the archived content instantly available for replay, without manual intervention or restart. (Of course, indexing will take some time if adding many gigabytes of data all at once, but is quite useful for smaller archive updates).
To enable auto-indexing, you can run the wayback -a when running command line, or run wb-manager autoindex <path/to/coll> as a seperate program.
To run with the bundled sample and test suite, you’ll need to clone pywb locally:
To run tests on your system, you may run python setup.py test
(The HTTPS proxy tests require the optional certauth package and are skipped if the package is not installed)
There is now also a downloadable point-and-click Web Archive Player which provides a native OS X and Windows desktop client application for browsing web archives, built using pywb.
You can use this tool to quickly check the contents of any WARC or ARC file through a simple point-and-click GUI interface, no command line tools needed.
In addition to the standard Wayback Machine, pywb tool suite includes a number of useful command-line and web server tools. The tools should be available to use after installing with pip install pywb:
See CHANGES.rst for an up-to-date changelist.
In addition to replaying archived web content, pywb can serve as a rewriting proxy to the live web. This allows pywb to serve live content, and inject customized code into any web page on the fly. This allow for a variety of use cases beyond archive replay.
For example, the pywb-webrecorder demonstrates a way to use pywb live web rewriting together with a recording proxy (warcprox) to record content while browsing.
The via.hypothes.is project provides an example of using pywb to inject annotations into any live web page.
pywb can also be used as an actual HTTP and/or HTTPS proxy server. See pywb Proxy Mode Usage for more details on configuring proxy mode.
To run as an HTTPS proxy server, pywb uses the certauth tool for generating a custom self-signed root certificate, which can be used to replay HTTPS content from the archive. (The certificate should be used with caution within a controlled setting).
Using these features requiring an extra dependency: installing certauth with pip install certauth. (This will also install the pyOpenSSL package which is used to handle the ssl functionality).
When running in proxy mode, the current collection and current timestamp are not included in the page url and need to be set separeately. pywb provides several options for ‘resolving’ the collection and timestamp:
For more info, see Proxy Mode Usage.
The pywb-proxy-demo project also contains a working configuration of proxy mode deployment.
The command-line wayback utility starts pywb using the standard Python library WSGIRef server. This should be sufficient for basic usage and testing, but is not recommended for production. In the future, a different default option will be provided.
Since pywb conforms to the Python WSGI specification, it can be run with any standard WSGI container/server and can be embedded in larger applications.
When running with a different container, specify pywb.apps.wayback as the WSGI application module.
For production deployments, uWSGI with gevent is the recommended container and the uwsgi.ini and ``run-uwsgi.sh scripts in this repo provides examples of running pywb with uWSGI.
pywb is compatible with the standard Wayback Machine url format, which was developed by the Internet Archive:
Replay: http://<host>/<collection>/<timestamp>/<original url>
Query Listing: http://<host>/<collection>/*/<original url>
Users are encouraged to fork and contribute to this project to improve any and all aspects of web archival replay and web proxy services.
Please take a look at list of current issues and feel free to open new ones.