Skip to main content

InterPlanetary Wayback (ipwb): Web Archive integration with IPFS

Project description

This repo contains the code for the initial integration between WARCs and IPFS as developed at the Archives Unleashed Hackathon in Toronto, Canada in March 2016.

Two main components exist in the protype:

  • ipwbindexer.py - takes the path to a WARC input, extracts the HTTP headers, HTTP payload (response body), and relevant parts of the WARC response header from the WARC specified. Creates temp files of these. Pushes temp files into IPFS using a locally running ipfs daemon. Creates a CDXJ file with this metadata for replay.py.

  • replay.py - a very rudimentary replay script to resolve fetches for IPFS-content for on-demand replay in the browser. Plagued with zombies. A placeholder until we get more familiar with modifying the pywb codebase for a truer replay system.

Running

Before running the code, ipfs must be installed. See the Install IPFS page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:

ipfs daemon

Indexing

In a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push a WARC into IPFS:

./ipwbindexer.py (path to warc or warc.gz)

…for example, from the root of the ipwb repository:

./ipwbindexer.py samples/warcs/sample-1.warc.gz

indexer.py parititions the WARC into WARC Records, extracts the WARC Response headers, HTTP response headers, and HTTP response body (payload). Relevant information is extracted from the WARC Response headers, temp files are created for the HTTP response headers and payload, and these two temp files are pushed into IPFS.

Replaying

(TODO)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipwb-0.2016.5.10.1423.tar.gz (5.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page