Library and command line tool for WARC file reporting and processing
Project description
WarcRead is a library and command line tool for WARC file reporting and processing.
- Latest release:
0.4.0 (2025-07-01)
- Release notes:
- License:
Table of Contents
Quick Start
# Install with pipx
pipx install lockss-warcread
# Verify installation and discover all the commands
warcread --help
# Get all URLs and content types in a pile of WARC files
warcread report --warc mywarcs*.warc.gz --url --content-type
# Get all URLs and content types in a set of WARC files listed in mylist.txt
warcread report --warcs mylist.txt --url --content-type
# Extract the payload ("contents") of https://example.com/about.html
# from within a pile of WARC files
warcread extract --warc mywarcs*.warc.gz --target-url 'https://example.com/about.html'
Installation
WarcRead is available from the Python Package Index (PyPI) as lockss-debugpanel (https://pypi.org/project/lockss-warcread).
To install WarcRead in your own non-virtual environment, we recommend using pipx:
pipx install lockss-warcread
To install WarcRead globally for every user, you can use pipx as root with the --global flag (provided you are running a recent enough pipx):
pipx install --global lockss-warcread
To install WarcRead in a Python virtual environment, simply use pip:
pip install lockss-warcread
The installation process adds a lockss.warcread Python Library and a warcread Command Line Tool. You can check at the command line that the installation is functional by running warcread version or warcread --help.
Command Line Tool
WarcRead is invoked at the command line as:
warcread
or as a Python module:
python -m lockss.warcread
Help messages and this document use warcread throughout, but the two invocation styles are interchangeable.
Synopsis
WarcRead uses Commands, in the style of programs like git, dnf/yum, apt/apt-get, and the like. You can see the list of available Commands by invoking warcread --help:
Usage: warcread [-h] {copyright,ext,extract,license,rep,report,version} ...
Tool for WARC file reporting and processing
Commands:
{copyright,ext,extract,license,rep,report,version}
copyright print the copyright and exit
ext synonym for: extract
extract extract parts of response records
license print the software license and exit
rep synonym for: report
report output tab-separated report over response records
version print the version number and exit
Help:
-h, --help show this help message and exit
WARC File Options
Commands expect one or more WARC files to process. The set of WARC files to process is derived from:
The WARC files listed as --warc/-w options.
The WARC files found in the files listed as --warcs/-W options.
Examples:
warcread report --warc mywarc01.warc.gz --warc mywarc02.warc.gz --warc mywarc03.warc.gz ... --url warcread report -w mywarc01.warc.gz -w mywarc02.warc.gz -w mywarc03.warc.gz ... --url warcread report --warc mywarc01.warc.gz mywarc02.warc.gz mywarc03.warc.gz ... --url warcread report -w mywarc01.warc.gz mywarc02.warc.gz mywarc03.warc.gz ... --url warcread report --warcs mylist1.txt --warcs mylist2.txt --warcs mylist3.txt ... --url warcread report -W mylist1.txt -W mylist2.txt -W mylist3.txt ... --url warcread report -warcs mylist1.txt mylist2.txt mylist3.txt ... --url warcread report -W mylist1.txt mylist2.txt mylist3.txt ... --url
Commands
The available commands are:
Command |
Abbreviation |
Purpose |
|---|---|---|
ext |
extract parts of response records |
|
rep |
output tab-separated report over response records |
Top-Level Program
The top-level executable alone does not perform any action or default to a given command:
Usage: warcread [-h] {copyright,ext,extract,license,rep,report,version} ...
warcread: error: the following arguments are required: {copyright,ext,extract,license,rep,report,version}
extract (ext)
The extract (or alternatively ext) command can be used to look for the WARC response record for a given target URL in a set of WARC files, and extract the WARC record’s headers (--warc-headers/--wh/-A), the HTTP response’s headers (--http-headers/--hh/-H), or the HTTP response’s payload (--http-payload/--hp/-P):
Usage: warcread extract [-h] [-w WARC [WARC ...]] [-W WARCS [WARCS ...]] [-t TARGET_URL] [-H] [-P] [-A]
Required Arguments:
-t, --target-url TARGET_URL
target URL
Optional Arguments:
-w, --warc WARC [WARC ...]
(WARCs) add one or more WARC files to the set of WARC files to process (default: [])
-W, --warcs WARCS [WARCS ...]
(WARCs) add the WARC files listed in one or more files to the set of WARC files to process (default: [])
-H, --hh, --http-headers
(action) extract HTTP headers for target URL (default: False)
-P, --hp, --http-payload
(action) extract HTTP payload for target URL (default: False)
-A, --wh, --warc-headers
(action) extract WARC headers for target URL (default: False)
Help:
-h, --help show this help message and exit
The command needs:
One or more WARC files, from the WARC File Options (--warc/-w options, --warcs/-W options).
A target URL, from the --target-url/-t option.
report (rep)
The report (or alternatively rep) command can be used to produce a tabular (tab-separated) report over a set of WARC files, listing one or more columns of information about each:
Usage: warcread report [-h] [-w WARC [WARC ...]] [-W WARCS [WARCS ...]] [-c] [-n] [-d] [-p] [-r] [-s] [-m] [-u] [-D] [-F]
Optional Arguments:
-w, --warc WARC [WARC ...]
(WARCs) add one or more WARC files to the set of WARC files to process (default: [])
-W, --warcs WARCS [WARCS ...]
(WARCs) add the WARC files listed in one or more files to the set of WARC files to process (default: [])
-c, --content-type (column) output HTTP Content-Type (e.g. text/xml; charset=UTF-8) (default: False)
-n, --http-code (column) output HTTP response code (e.g. 404) (default: False)
-d, --http-date (column) output HTTP Date (default: False)
-p, --http-protocol (column) output HTTP protocol (e.g. HTTP/1.1) (default: False)
-r, --http-reason (column) output HTTP reason (e.g. Not Found) (default: False)
-s, --http-status (column) output HTTP status (e.g. HTTP/1.1 404 Not Found) (default: False)
-m, --media-type (column) output media type of HTTP Content-Type (e.g. text/xml) (default: False)
-u, --url (column) output URL of WARC record (default: False)
-D, --warc-date (column) output date of WARC record (default: False)
-F, --warc-file (column) output name of WARC file of origin (default: False)
Help:
-h, --help show this help message and exit
The command needs:
One or more WARC files, from the WARC File Options (--warc/-w options, --warcs/-W options).
One or more column options, chosen among --content-type/-c, --http-code/-n, --http-date/-d, --http-protocol/-p, --http-reason/-r, --http-status/-s, --media-type/-m, --url/-u, --warc-date/-D, and --warc-file/-F. Note that currently --url/-u is not always on.
Library
The lockss.debugpanel.warcutil module contains a variety of utilities for WARC file processing. The module is documented inline with Python docstrings, which can be viewed with pydoc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lockss_warcread-0.4.0.tar.gz.
File metadata
- Download URL: lockss_warcread-0.4.0.tar.gz
- Upload date:
- Size: 18.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.3 Linux/6.15.2-arch1-1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2eeb7e02eb5dfd77e94fd21fa47a9d3cdb53eb705318a134ef84fe1815a88916
|
|
| MD5 |
b15bb098b9eca15962a896faa83cb22e
|
|
| BLAKE2b-256 |
2a7c9f93af285de022988fd71a5e02241bfc5ef125a8d577c7ce19997301c6ad
|
File details
Details for the file lockss_warcread-0.4.0-py3-none-any.whl.
File metadata
- Download URL: lockss_warcread-0.4.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.3 Linux/6.15.2-arch1-1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be46f51da60c14a9ab9f6e12bf520eff843e7f4d04437589359b3068e63e8390
|
|
| MD5 |
e4cf94a7a19280f746e62597d114ec47
|
|
| BLAKE2b-256 |
f801f39d95a0846dabda00245a39bc0356a6375048ed74bd5169b30ea0c9374c
|