Skip to main content

Streaming WARC (and ARC) IO library

Project description

https://travis-ci.org/webrecorder/warcio.svg?branch=master https://coveralls.io/repos/github/webrecorder/warcio/badge.svg?branch=master

Background

This library provides a fast, standalone way to read and write WARC Format commonly used in web archives. Supports Python 2.7+ and Python 3.3+ (using six, the only external dependency)

Install with: pip install warcio

This library is a spin-off of the WARC reading and writing component of the pywb high-fidelity replay library, a key component of Webrecorder

The library is designed for fast, low-level access to web archival content, oriented around a stream of WARC records rather than files.

Reading WARC Records

A key feature of the library is to be able to iterate over a stream of WARC records using the ArchiveIterator

It includes the following features: - Reading a WARC/ARC stream - On the fly ARC to WARC record conversion - Decompressing and de-chunking HTTP payload content stored in WARC/ARC files.

For example, the following prints the the url for each WARC response record:

from warcio.archiveiterator import ArchiveIterator

with open('path/to/file', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            print(record.rec_headers.get_header('WARC-Target-URI'))

The stream object could be a file on disk or a remote network stream. The ArchiveIterator reads the WARC content in a single pass. The record is represented by an ArcWarcRecord object which contains the format (ARC or WARC), record type, the record headers, http headers (if any), and raw stream for reading the payload.

class ArcWarcRecord(object):
    def __init__(self, *args):
        (self.format, self.rec_type, self.rec_headers, self.raw_stream,
         self.http_headers, self.content_type, self.length) = args

Reading WARC Content

The raw_stream can be used to read the rest of the payload directly. A special ArcWarcRecord.content_stream() function provides a stream that automatically decompresses and de-chunks the HTTP payload, if it is compressed and/or transfer-encoding chunked.

ARC Files

The library provides support for reading (but not writing ARC) files. The ARC format is legacy but is important to support in a consistent matter. The ArchiveIterator can equally iterate over ARC and WARC files to emit ArcWarcRecord objects. The special arc2warc option converts ARC records to WARCs on the fly, allowing for them to be accessed using the same API.

(Special WARCIterator and ARCIterator subclasses of ArchiveIterator are also available to read only WARC or only ARC files).

WARC and ARC Streaming

For example, here is a snippet for reading an ARC and a WARC using the same API.

The example streams a WARC and ARC file over HTTP using requests, printing the warcinfo record (or ARC header) and any response records (or all ARC records) that contain HTML:

import requests
from warcio.archiveiterator import ArchiveIterator

def print_records(url):
    resp = requests.get(url, stream=True)

    for record in ArchiveIterator(resp.raw, arc2warc=True):
        if record.rec_type == 'warcinfo':
            print(record.raw_stream.read())

        elif record.rec_type == 'response':
            if record.http_headers.get_header('Content-Type') == 'text/html':
                print(record.rec_headers.get_header('WARC-Target-URI'))
                print(record.content_stream().read())
                print('')

# WARC
print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz')


# ARC with arc2warc
print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz')

Writing WARC Records

The library provides a simple and extensible interface for writing WARC records conformant to WARC 1.0 ISO standard (see draft)

The library comes with a basic WARCWriter class for writing to a single WARC file and BufferWARCWriter for writing to an in-memory buffer. The BaseWARCWriter can be extended to support more complex operations.

(There is no support for writing legacy ARC files)

The following example loads http://example.com/, creates a WARC response record, and writes it, gzip compressed, to example.warc.gz The block and payload digests are computed automatically.

from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders

import requests

with open('example.warc.gz', 'wb') as output:
    writer = WARCWriter(output, gzip=True)

    resp = requests.get('http://example.com/',
                        headers={'Accept-Encoding': 'identity'},
                        stream=True)

    # get raw headers from urllib3
    headers_list = resp.raw.headers.items()

    http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')

    record = writer.create_warc_record('http://example.com/', 'response',
                                        payload=resp.raw,
                                        http_headers=http_headers)

    writer.write_record(record)

The library also includes additional semantics for: - Creating warcinfo and revisit records - Writing response and request records together - Writing custom WARC records - Reading a full WARC record from a stream

Please refer to warcwriter.py and test/test_writer.py for additional examples.

WARCIO CLI: Indexing and Recompression

The library currently ships with two simple command line tools.

Index

The warcio index cmd will print a simple index of the records in the warc file as newline delimited JSON lines (NDJSON).

WARC header fields to include in the index can be specified via the -f flag, and are included in the JSON block (in order, for convenience).

warcio index ./test/data/example-iana.org-chunked.warc -f warc-type,warc-target-uri,content-length
{"warc-type": "warcinfo", "content-length": "137"}
{"warc-type": "response", "warc-target-uri": "http://www.iana.org/", "content-length": "7566"}
{"warc-type": "request", "warc-target-uri": "http://www.iana.org/", "content-length": "76"}

HTTP header fields can be included by prefixing them with the prefix http:. The special field offset refers to the record offset within the warc file.

warcio index ./test/data/example-iana.org-chunked.warc -f offset,content-type,http:content-type,warc-target-uri
{"offset": "0", "content-type": "application/warc-fields"}
{"offset": "405", "content-type": "application/http;msgtype=response", "http:content-type": "text/html; charset=UTF-8", "warc-target-uri": "http://www.iana.org/"}
{"offset": "8379", "content-type": "application/http;msgtype=request", "warc-target-uri": "http://www.iana.org/"}

(Note: this library does not produce CDX or CDXJ format indexes often associated with web archives. To create these indexes, please see the cdxj-indexer tool which extends warcio indexing to provide this functionality)

Recompress

The recompress command allows for re-compressing or normalizing WARC (or ARC) files to a record-compressed, gzipped WARC file.

Each WARC record is compressed individually and concatenated. This is the ‘canonical’ WARC storage format used by Webrecorder and other web archiving institutions, and usually stored with a .warc.gz extension.

It can be used to: - Compress an uncompressed WARC - Convert any ARC file to a compressed WARC - Fix an improperly compressed WARC file (eg. a WARC compressed entirely instead of by record)

warcio recompress ./input.arc.gz ./output.warc.gz

License

warcio is licensed under the Apache 2.0 License and is part of the Webrecorder project.

See NOTICE and LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warcio-1.5.0.tar.gz (63.6 kB view details)

Uploaded Source

Built Distribution

warcio-1.5.0-py2.py3-none-any.whl (68.3 kB view details)

Uploaded Python 2Python 3

File details

Details for the file warcio-1.5.0.tar.gz.

File metadata

  • Download URL: warcio-1.5.0.tar.gz
  • Upload date:
  • Size: 63.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for warcio-1.5.0.tar.gz
Algorithm Hash digest
SHA256 dc4f120c8bd19a2212cb33d6e395474fd53d6a09fb60298575bec1fa19b2317b
MD5 b0fbef80dd60340f1f524103cf8bffe7
BLAKE2b-256 1d93bdc4f9625404696c52b3a521921d833ee3f759b29173c91ffdefeca0d85a

See more details on using hashes here.

File details

Details for the file warcio-1.5.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for warcio-1.5.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f900364e760366163dd50ce430cb65bbfce30b7cff16efdd6d2b6e24ae12a1e2
MD5 aff1dcb50a84512ce455677a1e6addaa
BLAKE2b-256 0e09534c8c635ba795013ee0546f7632a3b284b954cf3c9deed5265bb065a19f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page