Skip to main content

Tool and library for handling Web ARChive (WARC) files.

Project description

WARCAT: Web ARChive (WARC) Archiving Tool

Tool and library for handling Web ARChive (WARC) files.

Quick Start

Requirements:

  • Python 3

Install stable version:

pip-3 install warcat

Or install latest version:

git clone git://github.com/chfoo/warcat.git
pip-3 install -r requirements.txt
python3 setup.py install

Example Run:

python3 -m warcat --help
python3 -m warcat list example/at.warc.gz
python3 -m warcat verify megawarc.warc.gz --progress
python3 -m warcat extract megawarc.warc.gz --output-dir /tmp/megawarc/ --progress

Supported commands

concat

Naively join archives into one

extract

Extract files from archive

help

List commands available

list

List contents of archive

pass

Load archive and write it back out

split

Split archives into individual records

verify

Verify digest and validate conformance

Library

Example:

>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)
8
>>> record = warc.records[0]
>>> record.warc_type
'warcinfo'
>>> record.content_length
233
>>> record.header.version
'1.0'
>>> record.header.fields.list()
[('WARC-Type', 'warcinfo'), ('Content-Type', 'application/warc-fields'), ('WARC-Date', '2013-04-09T00:11:14Z'), ('WARC-Record-ID', '<urn:uuid:972777d2-4177-4c63-9fde-3877dacc174e>'), ('WARC-Filename', 'at.warc.gz'), ('WARC-Block-Digest', 'sha1:3C6SPSGP5QN2HNHKPTLYDHDPFYKYAOIX'), ('Content-Length', '233')]
>>> record.header.fields['content-type']
'application/warc-fields'
>>> record.content_block.fields.list()
[('software', 'Wget/1.13.4-2608 (linux-gnu)'), ('format', 'WARC File Format 1.0'), ('conformsTo', 'http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf'), ('robots', 'classic'), ('wget-arguments', '"http://www.archiveteam.org/" "--warc-file=at" ')]
>>> record.content_block.fields['software']
'Wget/1.13.4-2608 (linux-gnu)'
>>> record.content_block.payload.length
0
>>> bytes(warc)[:60]
b'WARC/1.0\r\nWARC-Type: warcinfo\r\nContent-Type: application/war'
>>> bytes(record.content_block.fields)[:60]
b'software: Wget/1.13.4-2608 (linux-gnu)\r\nformat: WARC File Fo'

About

The goal of the Warcat project is to create a tool and library as easy and fast as manipulating any other archive such as tar and zip archives.

Warcat is designed to handle large, gzip-ed files by partially extracting them as needed.

Warcat is provided without warranty and cannot guarantee the safety of your files. Remember to make backups and test them!

Specification

This implementation is based loosely on draft ISO 28500 papers WARC_ISO_28500_version1_latestdraft.pdf and warc_ISO_DIS_28500.pdf which can be found at http://bibnum.bnf.fr/WARC/ .

File format

Here’s a quick description:

A WARC file contains one or more Records concatenated together. Each Record contains Named Fields, newline, a Content Block, newline, and newline. A Content Block may be two types: {binary data} or {Named Fields, newline, and binary data}. Named Fields consists of string, colon, string, and newline.

A Record may be compressed with gzip. Filenames ending with .warc.gz indicate one or more gzip compressed files concatenated together.

Alternatives

Warcat is inspired by

Development

Travis build status

Testing

Always remember to test. Continue testing:

python3 -m unittest discover -p '*_test.py'
nosetests3

To-do

  • Smart archive join

  • Regex filtering of records

  • Generate index to disk (eg, for fast resume)

  • Grab files like wget and archive them

  • See TODO and FIXME markers in code

  • etc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Warcat-2.2.5.tar.gz (57.9 kB view details)

Uploaded Source

File details

Details for the file Warcat-2.2.5.tar.gz.

File metadata

  • Download URL: Warcat-2.2.5.tar.gz
  • Upload date:
  • Size: 57.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for Warcat-2.2.5.tar.gz
Algorithm Hash digest
SHA256 a3d7af6b4f1cbc6244833e19904932647c8f57d46a2b770fc499a1ec5ca8a8bd
MD5 30393815f749e7ae79b2cf14639c8cc7
BLAKE2b-256 51783abb1702eae1ac1dec44a0d1d366ff10394679894b7a2acc6b6efd0db898

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page