Skip to main content

WACZ Format Tools

Project description

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz <path/to/WARC>

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created.

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone.

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index.

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-e --extra-pages

Overrides the extra pages index generation with the passed extra jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl -e extra_pages.jsonl

-c --copy-pages

Overrides the behavior of --pages and --extra-pages options to copy existing pages.jsonl and/or extraPages.jsonl as-is directly into the WACZ rather than attempting to match each page to WARC record. The files are still parsed for basic correctness.

wacz create tests/fixtures/example-collection.warc --pages pages/pages.jsonl --extra-pages pages/extraPages.jsonl --copy-pages

-t --text

You can add a full text index by including the --text tag.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

-l --log-directory

Adds log files in specified directory to WACZ

wacz create tests/fixtures/example-collection.warc -l tests/fixtures/logs

--ts

Overrides the ts metadata value in the datapackage.json file.

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file.

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file.

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file.

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used (sha256 or md5).

wacz create tests/fixtures/example-collection.warc --hash-type md5

--signing-url

An optional URL for WACZ signing server which will be used to add a signature to the new WACZ.

This URL should point to an authsign /sign api endpoint.

See the section on --verify-auth for more info on signing and verification.

--signing-token

An optional, secret token passed to signing server to allow access. See authsign for more details.

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

--verify-auth

New option in 0.4.0, this option also verifies the WACZ is signed, using authsign

The verification can be done locally, or via remote signing/verification server.

To use remote server, add --verifier-url which should be a URL pointing to the authsign /verify endpoint.

To run locally, the authsign must be installed, which can be done by running pip install wacz[signing].

See WACZ Authentication Spec on WACZ authentication.

This feature and the specification are still in development (alpha-quality) and are subject to change.

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wacz-0.5.0.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

wacz-0.5.0-py3-none-any.whl (27.3 kB view details)

Uploaded Python 3

File details

Details for the file wacz-0.5.0.tar.gz.

File metadata

  • Download URL: wacz-0.5.0.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for wacz-0.5.0.tar.gz
Algorithm Hash digest
SHA256 5feb272b192ad954a66ccb50b417255d79eac573204b2471ced3f038fcd24d2a
MD5 b93778564c96aa168385dd817d99cb6d
BLAKE2b-256 c36f65c5aa43de50c9c780da521092d432d59bd26d390587f9d89483147a6eed

See more details on using hashes here.

File details

Details for the file wacz-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: wacz-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 27.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for wacz-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f98d611b273c14d5403f86c299b30e3270a02a78e61d7bcd96af3820bf85e47a
MD5 43fdc9bdd3f788daa42f74f4787720a9
BLAKE2b-256 f85367166fdb21277c1e228f8b60102553a718a6b51a278baa2edac2efe604da

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page