Skip to main content

InterPlanetary Wayback (ipwb): Web Archive integration with IPFS

Project description

image

InterPlanetary Wayback (ipwb)

Peer-To-Peer Permanence of Web Archives

Build Status pypi codecov

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the IPFS network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returned, and combines the header and payload from IPFS at the time of replay.

InterPlanetary Wayback primarily consists of two scripts:

  • ipwb/indexer.py - archival indexing script that takes the path to a WARC input, extracts the HTTP headers, HTTP payload (response body), and relevant parts of the WARC-response record header from the WARC specified and creates byte string representations. The indexer then pushes the byte strings into IPFS using a locally running IPFS daemon then creates a CDXJ file with this metadata for replay.py.
  • ipwb/replay.py - rudimentary replay script to resolve requests for archival content contained in IPFS for replay in the browser.

A pictorial representation of the ipwb indexing and replay process:

image

Installing

InterPlanetary Wayback requires Python 2.7+ though we are working on having it work on Python 3 as well (see #51).

The latest release of ipwb can be installed using pip:

$ pip install ipwb

The latest development version containing changes not yet released can be installed from source:

$ git clone https://github.com/oduwsdl/ipwb
$ cd ipwb
$ pip install -r requirements.txt
$ pip install ./

Setup

The InterPlanetary Filesystem (ipfs) daemon must be installed and running before starting ipwb. See the Install IPFS page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:

$ ipfs daemon

If you encounter a conflict with the default API port of 5001 when starting the daemon, running the following prior to launching the daemon will change the API port to access to one of your choosing (here, shown to be 5002):

$ ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002

Indexing

In a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push a WARC into IPFS:

$ ipwb index (path to warc or warc.gz)

...for example, from the root of the ipwb repository:

$ ipwb index ipwb/samples/warcs/salam-home.warc

The ipwb indexer parititions the WARC into WARC Records and extracts the WARC Response headers, HTTP response headers, and the HTTP response bodies (payloads). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to STDOUT by default but can be redirected to a file, e.g.,

$ ipwb index (path to warc or warc.gz) >> myArchiveIndex.cdxj

Replaying

An archival replay system is also included with ipwb to re-experience the content disseminated to IPFS . The replay system can be launched using the provided sample data with:

$ ipwb replay

A CDXJ index can also be provided and used by the ipwb replay system by specifying the path of the index file as a parameter to the replay system:

$ ipwb replay <path/to/cdxj>

ipwb also supports using an IPFS hash or any HTTP location as the source of the CDXJ:

$ ipwb replay http://myDomain/files/myIndex.cdxj
$ ipwb replay QmYwAPJzv5CZsnANOTaREALhashYgPpHdWEz79ojWnPbdG

Once started, the replay system's web interface can be accessed through a web browser, e.g., http://localhost:5000/ by default.

Using Docker

A pre-built Docker image is made available that can be run as following:

$ docker container run -it --rm -p 5000:5000 oduwsdl/ipwb

The container will run an IPFS daemon, index a sample WARC file, and replay it using the newly created index. It will take a few seconds to be ready, then the replay will be accessible at http://localhost:5000/ with a sample archived page.

To index and replay your own WARC file, bind mount your data folders inside the container using -v (or --volume) flag and run commands accordingly. The provided docker image has designated /data directory, inside which there are warc, cdxj, and ipfs folders where host folders can be mounted separately or as a single mount point at the parent /data directory. Assuming that the host machine has a /path/to/data folder under which there are warc, cdxj, and ipfs folders and a WARC file at /path/to/data/warc/custom.warc.gz.

$ docker container run -it --rm -v /path/to/data:/data oduwsdl/ipwb ipwb index -o /data/cdxj/custom.cdxj /data/warc/custom.warc.gz
$ docker container run -it --rm -v /path/to/data:/data -p 5000:5000 oduwsdl/ipwb ipwb replay /data/cdxj/custom.cdxj

If the host folder structure is something other than /some/path/{warc,cdxj,ipfs} then these volumes need to be mounted separately.

To build an image from the source, run the following command from the directory where the source code is checked out. The name of the locally built image could be anything, but we use oduwsdl/ipwb to be consistent with the above commands.

$ docker image build -t oduwsdl/ipwb .

Help

Usage of sub-commands in ipwb can be accessed through providing the -h or --help flag, like any of the below.

$ ipwb -h
usage: ipwb [-h] [-d DAEMON_ADDRESS] [-v] {index,replay} ...

InterPlanetary Wayback (ipwb)

optional arguments:
  -h, --help            show this help message and exit
  -d DAEMON_ADDRESS, --daemon DAEMON_ADDRESS
                        Location of ipfs daemon (default 127.0.0.1:5001)
  -v, --version         Report the version of ipwb

ipwb commands:
  Invoke using "ipwb <command>", e.g., ipwb replay

  {index,replay}
    index               Index a WARC file for replay in ipwb
    replay              Start the ipwb replay system
$ ipwb index -h
usage: ipwb [-h] [-e] [-c] [--compressFirst] [-o OUTFILE] [--debug]
            index <warcPath> [index <warcPath> ...]

Index a WARC file for replay in ipwb

positional arguments:
  index <warcPath>      Path to a WARC[.gz] file

optional arguments:
  -h, --help            show this help message and exit
  -e                    Encrypt WARC content prior to adding to IPFS
  -c                    Compress WARC content prior to adding to IPFS
  --compressFirst       Compress data before encryption, where applicable
  -o OUTFILE, --outfile OUTFILE
                        Path to an output CDXJ file, defaults to STDOUT
  --debug               Convenience flag to help with testing and debugging
$ ipwb replay -h
usage: ipwb replay [-h] [-P [<host:port>]] [index]

Start the ipwb relay system

positional arguments:
  index                 path, URI, or multihash of file to use for replay

optional arguments:
  -h, --help            show this help message and exit
  -P [<host:port>], --proxy [<host:port>]
                        Proxy URL

Project History

This repo contains the code for integrating WARCs and IPFS as developed at the Archives Unleashed: Web Archive Hackathon in Toronto, Canada in March 2016. The project was also presented at:

License

MIT

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipwb-0.2018.7.28.1813.tar.gz (3.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ipwb-0.2018.7.28.1813-py2-none-any.whl (3.9 MB view details)

Uploaded Python 2

File details

Details for the file ipwb-0.2018.7.28.1813.tar.gz.

File metadata

  • Download URL: ipwb-0.2018.7.28.1813.tar.gz
  • Upload date:
  • Size: 3.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.14.0 CPython/2.7.15

File hashes

Hashes for ipwb-0.2018.7.28.1813.tar.gz
Algorithm Hash digest
SHA256 d9f59dba1b87ce20a5542ab27623595c4302ae8c6b1b1131f20271515cb19c8e
MD5 8555e2b2dde317af491a3dc769b25803
BLAKE2b-256 b6c29bf14f77a981f3554f415a68a26491f7d3d3cde25e78c83254bc31da52dc

See more details on using hashes here.

File details

Details for the file ipwb-0.2018.7.28.1813-py2-none-any.whl.

File metadata

  • Download URL: ipwb-0.2018.7.28.1813-py2-none-any.whl
  • Upload date:
  • Size: 3.9 MB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.14.0 CPython/2.7.15

File hashes

Hashes for ipwb-0.2018.7.28.1813-py2-none-any.whl
Algorithm Hash digest
SHA256 97153c5c47f4eabf4f0ad410f2413dba2f7623a5ed3b92f044208445d830fc8d
MD5 b3a8e13d1744e656ef3709c060665e6a
BLAKE2b-256 3992fc82223548440c8d33202f1107aab748aacb4435cd19e4453ec7c234b96e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page