Skip to main content

Tool and library for handling Web ARChive (WARC) files.

Project description

WARCAT: Web ARChive (WARC) Archiving Tool

Tool and library for handling Web ARChive (WARC) files.

Quick Start


  • Python 3

Install stable version:

pip-3 install warcat

Or install latest version:

git clone git://
pip-3 install -r requirements.txt
python3 install

Example Run:

python3 -m warcat --help
python3 -m warcat list example/at.warc.gz
python3 -m warcat verify megawarc.warc.gz --progress
python3 -m warcat extract megawarc.warc.gz --output-dir /tmp/megawarc/ --progress

Supported commands

Naively join archives into one
Extract files from archive
List commands available
List contents of archive
Load archive and write it back out
Split archives into individual records
Verify digest and validate conformance



>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)
>>> record = warc.records[0]
>>> record.warc_type
>>> record.content_length
>>> record.header.version
>>> record.header.fields.list()
[('WARC-Type', 'warcinfo'), ('Content-Type', 'application/warc-fields'), ('WARC-Date', '2013-04-09T00:11:14Z'), ('WARC-Record-ID', '<urn:uuid:972777d2-4177-4c63-9fde-3877dacc174e>'), ('WARC-Filename', 'at.warc.gz'), ('WARC-Block-Digest', 'sha1:3C6SPSGP5QN2HNHKPTLYDHDPFYKYAOIX'), ('Content-Length', '233')]
>>> record.header.fields['content-type']
>>> record.content_block.fields.list()
[('software', 'Wget/1.13.4-2608 (linux-gnu)'), ('format', 'WARC File Format 1.0'), ('conformsTo', ''), ('robots', 'classic'), ('wget-arguments', '"" "--warc-file=at" ')]
>>> record.content_block.fields['software']
'Wget/1.13.4-2608 (linux-gnu)'
>>> record.content_block.payload.length
>>> bytes(warc)[:60]
b'WARC/1.0\r\nWARC-Type: warcinfo\r\nContent-Type: application/war'
>>> bytes(record.content_block.fields)[:60]
b'software: Wget/1.13.4-2608 (linux-gnu)\r\nformat: WARC File Fo'


The library may not be entirely thread-safe yet.


The goal of the Warcat project is to create a tool and library as easy and fast as manipulating any other archive such as tar and zip archives.

Warcat is designed to handle large, gzip-ed files by partially extracting them as needed.

Warcat is provided without warranty and cannot guarantee the safety of your files. Remember to make backups and test them!


This implementation is based loosely on draft ISO 28500 papers WARC_ISO_28500_version1_latestdraft.pdf and warc_ISO_DIS_28500.pdf which can be found at .

File format

Here’s a quick description:

A WARC file contains one or more Records concatenated together. Each Record contains Named Fields, newline, a Content Block, newline, and newline. A Content Block may be two types: {binary data} or {Named Fields, newline, and binary data}. Named Fields consists of string, colon, string, and newline.

A Record may be compressed with gzip. Filenames ending with .warc.gz indicate one or more gzip compressed files concatenated together.


Travis build status


Always remember to test. Continue testing:

python3 -m unittest discover -p '*'


  • Smart archive join
  • Regex filtering of records
  • Generate index to disk (eg, for fast resume)
  • Grab files like wget and archive them
  • See TODO and FIXME markers in code
  • etc.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for Warcat, version 2.2.5
Filename, size File type Python version Upload date Hashes
Filename, size Warcat-2.2.5.tar.gz (57.9 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page