Skip to main content

Library and command line tool for WARC file reporting and processing

Project description

WarcRead is a library and command line tool for WARC file reporting and processing.

Latest release:

0.4.0 (2025-07-01)

Release notes:

CHANGELOG.rst

License:

LICENSE

Table of Contents

Quick Start

# Install with pipx
pipx install lockss-warcread

# Verify installation and discover all the commands
warcread --help

# Get all URLs and content types in a pile of WARC files
warcread report --warc mywarcs*.warc.gz --url --content-type

# Get all URLs and content types in a set of WARC files listed in mylist.txt
warcread report --warcs mylist.txt --url --content-type

# Extract the payload ("contents") of https://example.com/about.html
# from within a pile of WARC files
warcread extract --warc mywarcs*.warc.gz --target-url 'https://example.com/about.html'

Installation

WarcRead is available from the Python Package Index (PyPI) as lockss-debugpanel (https://pypi.org/project/lockss-warcread).

  • To install WarcRead in your own non-virtual environment, we recommend using pipx:

    pipx install lockss-warcread
  • To install WarcRead globally for every user, you can use pipx as root with the --global flag (provided you are running a recent enough pipx):

    pipx install --global lockss-warcread
  • To install WarcRead in a Python virtual environment, simply use pip:

    pip install lockss-warcread

The installation process adds a lockss.warcread Python Library and a warcread Command Line Tool. You can check at the command line that the installation is functional by running warcread version or warcread --help.

Command Line Tool

WarcRead is invoked at the command line as:

warcread

or as a Python module:

python -m lockss.warcread

Help messages and this document use warcread throughout, but the two invocation styles are interchangeable.

Synopsis

WarcRead uses Commands, in the style of programs like git, dnf/yum, apt/apt-get, and the like. You can see the list of available Commands by invoking warcread --help:

Usage: warcread [-h] {copyright,ext,extract,license,rep,report,version} ...

Tool for WARC file reporting and processing

Commands:
  {copyright,ext,extract,license,rep,report,version}
    copyright           print the copyright and exit
    ext                 synonym for: extract
    extract             extract parts of response records
    license             print the software license and exit
    rep                 synonym for: report
    report              output tab-separated report over response records
    version             print the version number and exit

Help:
  -h, --help            show this help message and exit

WARC File Options

Commands expect one or more WARC files to process. The set of WARC files to process is derived from:

  • The WARC files listed as --warc/-w options.

  • The WARC files found in the files listed as --warcs/-W options.

Examples:

warcread report --warc mywarc01.warc.gz --warc mywarc02.warc.gz --warc mywarc03.warc.gz ... --url

warcread report -w mywarc01.warc.gz -w mywarc02.warc.gz -w mywarc03.warc.gz ... --url

warcread report --warc mywarc01.warc.gz mywarc02.warc.gz mywarc03.warc.gz ... --url

warcread report -w mywarc01.warc.gz mywarc02.warc.gz mywarc03.warc.gz ... --url

warcread report --warcs mylist1.txt --warcs mylist2.txt --warcs mylist3.txt ... --url

warcread report -W mylist1.txt -W mylist2.txt -W mylist3.txt ... --url

warcread report -warcs mylist1.txt mylist2.txt mylist3.txt ... --url

warcread report -W mylist1.txt mylist2.txt mylist3.txt ... --url

Commands

The available commands are:

Command

Abbreviation

Purpose

extract

ext

extract parts of response records

report

rep

output tab-separated report over response records

Top-Level Program

The top-level executable alone does not perform any action or default to a given command:

Usage: warcread [-h] {copyright,ext,extract,license,rep,report,version} ...
warcread: error: the following arguments are required: {copyright,ext,extract,license,rep,report,version}

extract (ext)

The extract (or alternatively ext) command can be used to look for the WARC response record for a given target URL in a set of WARC files, and extract the WARC record’s headers (--warc-headers/--wh/-A), the HTTP response’s headers (--http-headers/--hh/-H), or the HTTP response’s payload (--http-payload/--hp/-P):

Usage: warcread extract [-h] [-w WARC [WARC ...]] [-W WARCS [WARCS ...]] [-t TARGET_URL] [-H] [-P] [-A]

Required Arguments:
  -t, --target-url TARGET_URL
                        target URL

Optional Arguments:
  -w, --warc WARC [WARC ...]
                        (WARCs) add one or more WARC files to the set of WARC files to process (default: [])
  -W, --warcs WARCS [WARCS ...]
                        (WARCs) add the WARC files listed in one or more files to the set of WARC files to process (default: [])
  -H, --hh, --http-headers
                        (action) extract HTTP headers for target URL (default: False)
  -P, --hp, --http-payload
                        (action) extract HTTP payload for target URL (default: False)
  -A, --wh, --warc-headers
                        (action) extract WARC headers for target URL (default: False)

Help:
  -h, --help            show this help message and exit

The command needs:

  • One or more WARC files, from the WARC File Options (--warc/-w options, --warcs/-W options).

  • A target URL, from the --target-url/-t option.

report (rep)

The report (or alternatively rep) command can be used to produce a tabular (tab-separated) report over a set of WARC files, listing one or more columns of information about each:

Usage: warcread report [-h] [-w WARC [WARC ...]] [-W WARCS [WARCS ...]] [-c] [-n] [-d] [-p] [-r] [-s] [-m] [-u] [-D] [-F]

Optional Arguments:
  -w, --warc WARC [WARC ...]
                        (WARCs) add one or more WARC files to the set of WARC files to process (default: [])
  -W, --warcs WARCS [WARCS ...]
                        (WARCs) add the WARC files listed in one or more files to the set of WARC files to process (default: [])
  -c, --content-type    (column) output HTTP Content-Type (e.g. text/xml; charset=UTF-8) (default: False)
  -n, --http-code       (column) output HTTP response code (e.g. 404) (default: False)
  -d, --http-date       (column) output HTTP Date (default: False)
  -p, --http-protocol   (column) output HTTP protocol (e.g. HTTP/1.1) (default: False)
  -r, --http-reason     (column) output HTTP reason (e.g. Not Found) (default: False)
  -s, --http-status     (column) output HTTP status (e.g. HTTP/1.1 404 Not Found) (default: False)
  -m, --media-type      (column) output media type of HTTP Content-Type (e.g. text/xml) (default: False)
  -u, --url             (column) output URL of WARC record (default: False)
  -D, --warc-date       (column) output date of WARC record (default: False)
  -F, --warc-file       (column) output name of WARC file of origin (default: False)

Help:
  -h, --help            show this help message and exit

The command needs:

  • One or more WARC files, from the WARC File Options (--warc/-w options, --warcs/-W options).

  • One or more column options, chosen among --content-type/-c, --http-code/-n, --http-date/-d, --http-protocol/-p, --http-reason/-r, --http-status/-s, --media-type/-m, --url/-u, --warc-date/-D, and --warc-file/-F. Note that currently --url/-u is not always on.

Library

The lockss.debugpanel.warcutil module contains a variety of utilities for WARC file processing. The module is documented inline with Python docstrings, which can be viewed with pydoc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lockss_warcread-0.4.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lockss_warcread-0.4.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file lockss_warcread-0.4.0.tar.gz.

File metadata

  • Download URL: lockss_warcread-0.4.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.3 Linux/6.15.2-arch1-1

File hashes

Hashes for lockss_warcread-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2eeb7e02eb5dfd77e94fd21fa47a9d3cdb53eb705318a134ef84fe1815a88916
MD5 b15bb098b9eca15962a896faa83cb22e
BLAKE2b-256 2a7c9f93af285de022988fd71a5e02241bfc5ef125a8d577c7ce19997301c6ad

See more details on using hashes here.

File details

Details for the file lockss_warcread-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: lockss_warcread-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.3 Linux/6.15.2-arch1-1

File hashes

Hashes for lockss_warcread-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be46f51da60c14a9ab9f6e12bf520eff843e7f4d04437589359b3068e63e8390
MD5 e4cf94a7a19280f746e62597d114ec47
BLAKE2b-256 f801f39d95a0846dabda00245a39bc0356a6375048ed74bd5169b30ea0c9374c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page