Skip to main content

Python WayBack Machine for web archive replay

Project description

** Note: the 0.2.2 has been re-versioned with 0.3.0 to indicate the number of changes. Future release will be on the 0.3.x line **

https://travis-ci.org/ikreymer/pywb.png?branch=master https://coveralls.io/repos/ikreymer/pywb/badge.png?branch=master

pywb is a python implementation of web archival replay tools, sometimes also known as ‘Wayback Machine’.

pywb allows high-fidelity replay (browsing) of archived web data stored in standardized ARC and WARC.

Latest Changes

See CHANGES.rst for up-to-date changelist.

Quick Install & Run Samples

  1. git clone https://github.com/ikreymer/pywb.git

  2. python setup.py install

  3. wayback to run samples

  4. Browse to http://localhost:8080/pywb/*/example.com to see capture of http://example.com

(The installation page contains additional installation and testing examples.)

Configure with Archived Content

If you have existing WARC or ARC files (.warc, .warc.gz, .arc, .arc.gz), you should be able to view their contents in pywb after creating sorted .cdx index files of their contents. This process can be done by running the cdx-indexer script and only needs to be done once.

(See the note below if you already have .cdx files for your archives)

Given an archive of warcs at myarchive/warcs

  1. Create a dir for indexs, .eg. myarchive/cdx

  2. Run cdx-indexer --sort myarchive/cdx myarchive/warcs to generate .cdx files for each warc/arc file in myarchive/warcs

  3. Edit config.yaml to contain the following. You may replace pywb with a name of your choice – it will be the path to your collection. (Multiple collections can be added for different sets of .cdx files as well)

collections:
   pywb: ./my_archive/cdx/


archive_paths: ./my_archive/warcs/
  1. Run wayback to start session. If your archives contain http://my-archive-page.example.com, all captures should be accessible by browsing to http://localhost:8080/pywb/*/my-archived-page.example.com

    (You can also use run-uwsgi.sh or run-gunicorn.sh to launch using those WSGI containers)

See INSTALL.rst for additional installation info.

Use existing .cdx index files

If you already have .cdx files for your archive, you can skip the first two steps above.

pywb recommends using SURT (Sort-friendly URI Reordering Transform) sorted urls and the cdx-indexer automatically generates indexs in this format.

However, pywb is compatible with regular url keyed indexes also. If you would like to use non-SURT ordered .cdx files, simply add this field to the config:

surt_ordered: false

About Wayback Machine

pywb is compatible with the standard Wayback Machine url format:

http://<host>/<collection>/<timestamp>/<original url>

Some examples of this url from other wayback machines (not implemented via pywb):

http://web.archive.org/web/20140312103519/http://www.example.com http://www.webarchive.org.uk/wayback/archive/20100513010014/http://www.example.com/

A listing of archived content, often in calendar form, is available when a * is used instead of timestamp.

The Wayback Machine often uses an html parser to rewrite relative and absolute links, as well as absolute links found in javascript, css and some xml.

pywb provides these features as a starting point.

Additional Documentation

  • For additional/up-to-date configuration details, consult the current config.yaml

  • The wiki will have additional technical documentation about various aspects of pywb

Contributions

You are encouraged to fork and contribute to this project to improve web archiving replay!

Please take a look at list of current issues and feel free to open new ones.

Release history Release notifications | RSS feed

This version

0.3.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywb-0.3.0.tar.gz (68.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page