Skip to main content

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

Project description

Warctools

WARC (Web ARChive) file tools for python 2/3 based on the WARC 1.0 spec and compatible with the Internet Archive's ARC File Format originally developed by Hanzo Archives.

Install

pip install warctools

Python Usage

from hanzo import warctools

Python Examples

Write a WARC file:

import os

from hanzo import warctools


def write():
    headers = [
        (b'WARC-Type', b'warcinfo'),
        (b'WARC-Date', b'2019-11-19T23:08:51.182451Z'),
        (b'WARC-Filename', b'CRAWL-20191119230851-00000-hostname.warc.gz'),
        (b'WARC-Record-ID', b'<urn:uuid:8cc5dcae-0b21-11ea-842b-525476278032>')
    ]
    content_type = b'application/warc-fields'
    content = 'This\nis\nonly\na\ntest\n'.encode()
    fname = 'test.warc.gz'

    mode = 'ab'
    if not os.path.exists(fname):
        mode = 'wb'

    with open(fname, mode) as _fh:
        content = (content_type, content)
        record = warctools.WarcRecord(headers=headers, content=content)
        record.write_to(_fh, gzip="record")

Command-line Usage

warcvalid

Returns 0 if the arguments are all valid W/ARC files, non-zero on error.

[warctools] $ warcvalid -h
Usage: warcvalid [options] warc warc warc

Options:
  -h, --help            show this help message and exit
  -l LIMIT, --limit=LIMIT
  -I INPUT_FORMAT, --input=INPUT_FORMAT
  -L LOG_LEVEL, --log-level=LOG_LEVEL

warcdump

Writes human readable summary of warcfiles. Autodetects input format when filenames are passed, i.e recordgzip vs plaintext, WARC vs ARC. Assumes uncompressed warc on stdin if no args.

[warctools] $ warcdump -h
Usage: warcdump [options] warc warc warc

Options:
  -h, --help            show this help message and exit
  -l LIMIT, --limit=LIMIT
  -I INPUT_FORMAT, --input=INPUT_FORMAT
  -L LOG_LEVEL, --log-level=LOG_LEVEL

warcfilter

Searches all headers for regex pattern. Autodetects and stdin like warcdump. Prints out a WARC format by default. Use -i to invert search. Use -U to constrain to url. Use -T to constrain to record type. Use -C to constrain to content-type.

$ warcfilter -h
Usage: warcfilter [options] pattern warc warc warc

Options:
  -h, --help            show this help message and exit
  -l LIMIT, --limit=LIMIT
                        limit (ignored)
  -I INPUT_FORMAT, --input=INPUT_FORMAT
                        input format (ignored)
  -i, --invert          invert match
  -U, --url             match on url
  -T, --type            match on (warc) record type
  -C, --content-type    match on (warc) record content type
  -H, --http-content-type
                        match on http payload content type
  -D, --warc-date       match on WARC-Date header
  -L LOG_LEVEL, --log-level=LOG_LEVEL
                        log level(ignored)

warc2warc

Autodetects compression on file args. Assumes uncompressed stdin if none. Use -Z to write compressed output, i.e warc2warc -Z input > input.gz. Should ignore buggy records in input.

[warctools] $ warc2warc -h
Usage: warc2warc [options] url (url ...)

Options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output=OUTPUT
                        output warc file
  -l LIMIT, --limit=LIMIT
  -I INPUT_FORMAT, --input=INPUT_FORMAT
                        (ignored)
  -Z, --gzip            compress output, record by record
  -D, --decode_http     decode http messages (strip chunks, gzip)
  -L LOG_LEVEL, --log-level=LOG_LEVEL
  --wget-chunk-fix      skip transfer-encoding headers in http records, when
                        decoding them (-D)

arc2warc

Creates a crappy WARC file from arc files on input. A handful of headers are preserved. Use -Z to write compressed output, i.e arc2warc -Z input.arc > input.warc.gz

[warctools] $ arc2warc -h
Usage: arc2warc [options] arc (arc ...)

Options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output=OUTPUT
                        output warc file
  -l LIMIT, --limit=LIMIT
  -Z, --gzip            compress
  -L LOG_LEVEL, --log-level=LOG_LEVEL
  --description=DESCRIPTION
  --operator=OPERATOR
  --publisher=PUBLISHER
  --audience=AUDIENCE
  --resource=RESOURCE
  --response=RESPONSE

warcindex

DEPRECATED, use CDX-writer branch.

#WARC-filename offset warc-type warc-subject-uri warc-record-id content-type content-length
warccrap/mywarc.warc 1196018 request /images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd1255a8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=request 193
warccrap/mywarc.warc 1196631 response http://www.hanzoarchives.com/images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd2614f8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=response 3279474

Notes

  1. arc2warc uses the conversion rules from the earlier arc2warc.c as a starter for converting the headers
  2. I haven't profiled the code yet (and don't plan to until it falls over)
  3. Warcvalid barely skirts some of the iso standard, missing things:
    • strict whitespace
    • required headers check
    • mime quoted printable header encoding
    • treating headers as utf8

ToDo

  1. Lots more testing
  2. Support pre-1.0 WARC files
  3. Add more documentation
  4. Support more commandline options for output and filenames
  5. S3 urls

Credits

Originally developed by "tef" thomas.figg@hanzoarchives.com.

@internetarchive

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warctools-5.0.1.tar.gz (35.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

warctools-5.0.1-py3-none-any.whl (42.4 kB view details)

Uploaded Python 3

File details

Details for the file warctools-5.0.1.tar.gz.

File metadata

  • Download URL: warctools-5.0.1.tar.gz
  • Upload date:
  • Size: 35.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.7

File hashes

Hashes for warctools-5.0.1.tar.gz
Algorithm Hash digest
SHA256 7621dae273673df9f1b2ea2cb292e7fa66d3eefa6b302269e457407a3fbe218e
MD5 73bda8e2360b3398604d119059286020
BLAKE2b-256 0462106611f9a00330abd8bb16dba3df696153d4a3b68cfb5cfded32bad7668e

See more details on using hashes here.

File details

Details for the file warctools-5.0.1-py3-none-any.whl.

File metadata

  • Download URL: warctools-5.0.1-py3-none-any.whl
  • Upload date:
  • Size: 42.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.7

File hashes

Hashes for warctools-5.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2f792cb69d76b3f1ffdcea77ba07a5f3bc8260da92d9b5118c082505ba236004
MD5 3aba580d259e2f210afa01fb6ce00ac4
BLAKE2b-256 adf17aa41b739339d07aabd8a0b55c4050fa1cb756798377c8f57b9e7695c54b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page