Skip to main content

InterPlanetary Wayback (ipwb): Web Archive integration with IPFS

Project description

https://raw.githubusercontent.com/oduwsdl/ipwb/master/docs/logo.png

InterPlanetary Wayback (ipwb)

Peer-To-Peer Permanence of Web Archives

https://api.travis-ci.org/oduwsdl/ipwb.png?branch=master https://img.shields.io/pypi/v/ipwb.svg

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the IPFS network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returns, and combines the header and payload from IPFS at the time of replay.

InterPlanetary Wayback primarily consists of two scripts:

  • ipwb/indexer.py - archival indexing script that takes the path to a WARC input, extracts the HTTP headers, HTTP payload (response body), and relevant parts of the WARC-response record header from the WARC specified and creates byte string representations. The indexer then pushes the byte strings into IPFS using a locally running ipfs daemon then creates a CDXJ file with this metadata for replay.py.

  • ipwb/replay.py - rudimentary replay script to resolve requests for archival content contained in IPFS for replay in the browser.

A pictorial representation of the ipwb indexing and replay process:

https://raw.githubusercontent.com/oduwsdl/ipwb/master/docs/diagram_72.png

Installing

The latest release of ipwb can be installed using pip:

$ pip install ipwb

The latest development version containing changes not yet released can be installed from source:

$ git clone https://github.com/oduwsdl/ipwb
$ cd ipwb
$ pip install -r requirements.txt
$ pip install ./

Setup

The InterPlanetary Filesystem (ipfs) daemon must be installed and running before starting ipwb. See the Install IPFS page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:

$ ipfs daemon

If you encounter a conflict with the default API port of 5001 when starting the daemon, running the following prior to launching the daemon will change the API port to access to one of your choosing (here, shown to be 5002):

$ ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002

Indexing

In a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push a WARC into IPFS:

$ ipwb index (path to warc or warc.gz)

…for example, from the root of the ipwb repository:

ipwb index ipwb/samples/warcs/sample-1.warc.gz

indexer.py, the default script called by the ipwb binary, parititions the WARC into WARC Records, extracts the WARC Response headers, HTTP response headers, and HTTP response body (payload). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to stdout by default but can be redirected to a file, e.g.,

$ ipwb index (path to warc or warc.gz) >> myArchiveIndex.cdxj

Replaying

An archival replay system is also included with ipwb to re-experience the content disseminated to IPFS . The replay system can be launched with:

$ ipwb replay

Once started, the replay system’s web interface can be accessed through a web browser, e.g., http://127.0.0.1:5000/http://www.cs.odu.edu/~salam/ with the sample CDXJ file.

Debugging

The ipwb indexing and replay system can also be run from source using a virtualenv and calling the indexer.py and replay.py scripts in the module’s ipwb directly from the project’s root.

Project History

This repo contains the code for integrating WARCs and IPFS as developed at the Archives Unleashed Hackathon in Toronto, Canada in March 2016. The project was also presented at:

License

MIT

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipwb-0.2016.12.7.1631.tar.gz (3.7 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page