This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

InterPlanetary Wayback (ipwb): Web Archive integration with IPFS

Project Description

InterPlanetary Wayback (ipwb)

Peer-To-Peer Permanence of Web Archives

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the IPFS network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returns, and combines the header and payload from IPFS at the time of replay.

InterPlanetary Wayback primarily consists of two scripts:

  • ipwb/indexer.py - archival indexing script that takes the path to a WARC input, extracts the HTTP headers, HTTP payload (response body), and relevant parts of the WARC-response record header from the WARC specified and creates byte string representations. The indexer then pushes the byte strings into IPFS using a locally running ipfs daemon then creates a CDXJ file with this metadata for replay.py.
  • ipwb/replay.py - rudimentary replay script to resolve requests for archival content contained in IPFS for replay in the browser.

A pictorial representation of the ipwb indexing and replay process:

Installing

The latest release of ipwb can be installed using pip:

$ pip install ipwb

The latest development version containing changes not yet released can be installed from source:

$ git clone https://github.com/oduwsdl/ipwb
$ cd ipwb
$ pip install -r requirements.txt
$ pip install ./

Setup

The InterPlanetary Filesystem (ipfs) daemon must be installed and running before starting ipwb. See the Install IPFS page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:

$ ipfs daemon

If you encounter a conflict with the default API port of 5001 when starting the daemon, running the following prior to launching the daemon will change the API port to access to one of your choosing (here, shown to be 5002):

$ ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002

Indexing

In a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push a WARC into IPFS:

$ ipwb index (path to warc or warc.gz)

…for example, from the root of the ipwb repository:

$ ipwb index ipwb/samples/warcs/salam-home.warc

indexer.py, the default script called by the ipwb binary, parititions the WARC into WARC Records, extracts the WARC Response headers, HTTP response headers, and HTTP response body (payload). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to stdout by default but can be redirected to a file, e.g.,

$ ipwb index (path to warc or warc.gz) >> myArchiveIndex.cdxj

Replaying

An archival replay system is also included with ipwb to re-experience the content disseminated to IPFS . The replay system can be launched using the provided sample data with:

$ ipwb replay

A CDXJ index can also be provided and used by the ipwb replay system by specifying the path of the index file as a parameter to the replay system:

$ ipwb replay <path/to/cdxj>

ipwb also supports using an IPFS hash or any HTTP location as the source of the CDXJ:

$ ipwb replay http://myDomain/files/myIndex.cdxj
$ ipwb replay QmYwAPJzv5CZsnANOTaREALhashYgPpHdWEz79ojWnPbdG

Once started, the replay system’s web interface can be accessed through a web browser, e.g., http://127.0.0.1:5000/http://www.cs.odu.edu/~salam/ with the sample CDXJ file.

Help

Usage of sub-commands in ipwb can be accessed through providing the -h or –help flag, like any of the below.

$ ipwb -h
usage: ipwb [-h] [-d DAEMON_ADDRESS] [-o OUTFILE] [-v] {index,replay} ...

InterPlanetary Wayback (ipwb)

optional arguments:
  -h, --help            show this help message and exit
  -d DAEMON_ADDRESS, --daemon DAEMON_ADDRESS
                        Location of ipfs daemon (default 127.0.0.1:5001)
  -o OUTFILE, --outfile OUTFILE
                        Filename of newly created CDXJ index file
  -v, --version         Report the version of ipwb


ipwb commands:
  Invoke using "ipwb <command>", e.g., ipwb replay

  {index,replay}
    index               Index a WARC file for replay in ipwb
    replay              Start the ipwb replay system
$ ipwb index -h
usage: ipwb [-h] [-e] index <warcPath>

Index a WARC file for replay in ipwb

positional arguments:
  index <warcPath>  Path to a WARC[.gz] file

optional arguments:
  -h, --help        show this help message and exit
  -e                Encrypt WARC content prior to disseminating to IPFS
$ ipwb replay -h
usage: ipwb [-h] [-e] index <warcPath>

Index a WARC file for replay in ipwb

positional arguments:
  index <warcPath>  Path to a WARC[.gz] file

optional arguments:
  -h, --help        show this help message and exit
  -e                Encrypt WARC content prior to disseminating to IPFS
Katja:ipwb machawk1$ ipwb replay -h
usage: ipwb replay [-h] [index]

positional arguments:
  index       CDXJ file to use for replay

optional arguments:
  -h, --help  show this help message and exit

Debugging

The ipwb indexing and replay system can also be run from source using a virtualenv and calling the indexer.py and replay.py scripts in the module’s ipwb directly from the project’s root.

Project History

This repo contains the code for integrating WARCs and IPFS as developed at the Archives Unleashed Hackathon in Toronto, Canada in March 2016. The project was also presented at:

License

MIT

Release History

Release History

This version
History Node

0.2017.6.23.1142

History Node

0.2017.6.15.114

History Node

0.2017.6.14.2129

History Node

0.2017.6.8.2056

History Node

0.2017.6.8.2031

History Node

0.2017.6.7.1721

History Node

0.2017.6.7.1533

History Node

0.2017.6.7.1527

History Node

0.2017.6.6.2230

History Node

0.2017.5.31.1427

History Node

0.2017.5.31.1350

History Node

0.2017.5.31.1322

History Node

0.2017.5.22.1457

History Node

0.2017.5.19.1553

History Node

0.2017.5.18.1306

History Node

0.2017.5.14.2222

History Node

0.2017.5.13.2201

History Node

0.2017.3.24.2307

History Node

0.2017.3.13.2024

History Node

0.2017.3.13.1653

History Node

0.2017.3.13.1642

History Node

0.2017.3.13.1602

History Node

0.2017.3.13.1253

History Node

0.2017.3.6.1829

History Node

0.2017.3.6.1411

History Node

0.2017.3.6.1347

History Node

0.2017.3.4.1951

History Node

0.2017.3.1.2245

History Node

0.2017.2.24.53

History Node

0.2017.2.23.146

History Node

0.2017.2.18.2104

History Node

0.2017.2.18.2033

History Node

0.2017.2.18.2010

History Node

0.2017.2.18.1853

History Node

0.2017.2.15.1431

History Node

0.2017.2.13.2214

History Node

0.2017.2.8.2316

History Node

0.2017.2.8.2239

History Node

0.2017.2.8.1714

History Node

0.2017.2.8.1626

History Node

0.2017.2.7.1051

History Node

0.2017.2.7.910

History Node

0.2017.1.30.1719

History Node

0.2017.1.30.1451

History Node

0.2017.1.30.1427

History Node

0.2017.1.30.1411

History Node

0.2017.1.19.1415

History Node

0.2017.1.19.1332

History Node

0.2017.1.9.127

History Node

0.2017.1.7.2059

History Node

0.2017.1.7.2052

History Node

0.2017.1.7.2037

History Node

0.2017.1.5.1145

History Node

0.2017.1.5.1114

History Node

0.2017.1.5.23

History Node

0.2017.1.4.1542

History Node

0.2017.1.4.1538

History Node

0.2017.1.4.1526

History Node

0.2016.12.10.1711

History Node

0.2016.12.10.1701

History Node

0.2016.12.9.412

History Node

0.2016.12.9.221

History Node

0.2016.12.7.1631

History Node

0.2016.12.7.1500

History Node

0.2016.12.4.2336

History Node

0.2016.12.4.2311

History Node

0.2016.12.4.2257

History Node

0.2016.12.4.2128

History Node

0.2016.12.4.2114

History Node

0.2016.12.4.2102

History Node

0.2016.11.28.2153

History Node

0.2016.11.28.1745

History Node

0.2016.11.14.7

History Node

0.2016.11.2.1321

History Node

0.2016.11.2.1312

History Node

0.2016.11.2.1307

History Node

0.2016.10.13.1544

History Node

0.2016.10.13.1533

History Node

0.2016.10.12.2144

History Node

0.2016.10.10.1649

History Node

0.2016.9.28.1747

History Node

0.2016.9.28.1741

History Node

0.2016.9.28.1511

History Node

0.2016.9.28.1453

History Node

0.2016.9.28.1428

History Node

0.2016.9.28.1424

History Node

0.2016.9.19.1310

History Node

0.2016.9.19.1225

History Node

0.2016.9.19.1217

History Node

0.2016.9.19.1140

History Node

0.2016.9.19.1118

History Node

0.2016.9.19.1048

History Node

0.2016.9.14.2210

History Node

0.2016.6.27.1409

History Node

0.2016.6.27.1222

History Node

0.2016.6.8.1651

History Node

0.2016.6.8.1639

History Node

0.2016.5.10.1615

History Node

0.2016.5.10.1600

History Node

0.2016.5.10.1513

History Node

0.2016.5.10.1423

History Node

0.2016.5.10.1416

History Node

0.2016.5.10.1411

History Node

0.2016.5.10.1407

History Node

0.2016.5.10.1400

History Node

0.2016.5.10.1334

History Node

0.2016.5.10

History Node

0.2016.5.9

History Node

0.1

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
ipwb-0.2017.6.23.1142-py2-none-any.whl (3.9 MB) Copy SHA256 Checksum SHA256 py2 Wheel Jun 23, 2017
ipwb-0.2017.6.23.1142.tar.gz (3.9 MB) Copy SHA256 Checksum SHA256 Source Jun 23, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting