Skip to main content

metawarc: a command-line tool for metadata extraction from files inside WARC (web archive)

Project description

metawarc (pronounced me-ta-warc) is a command line WARC files processing tools. Its goal is to make CLI interaction with files inside WARC archives so easy as possible. It provides a simple metawarc command that allows to extract metadata from images, documents and other files inside WARC archives.

1   Main features

  • Built-in WARC support
  • Metadata extraction for a lot of file formats
  • Low memory footprint
  • Documentation
  • Test coverage

2   File formats supported

MS Office OLE: .doc, .xls, .ppt

MS Office XML: .docx, .xlsx, .pptx

Adobe PDF: .pdf

Images: .png, .jpg, .tiff, .jpeg, .jp2

3   Installation

3.1   Any OS

A universal installation method (that works on Windows, Mac OS X, Linux, …, and always provides the latest version) is to use pip:

# Make sure we have an up-to-date version of pip and setuptools:
$ pip install --upgrade pip setuptools

$ pip install --upgrade metawarc

(If pip installation fails for some reason, you can try easy_install metawarc as a fallback.)

3.2   Python version

Python version 3.6 or greater is required.

4   Usage

Synopsis:

$ metawarc [command] [flags]  inputfile

See also metawarc --help and metawarc [command] --help for help for each command.

4.1   Examples

Extract metadata of all supported file types from ‘digital.gov.ru.warc.gz’ and output results to default filename ‘metadata.jsonl’:

$ metawarc metadata digital.gov.ru.warc.gz

Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to default filename ‘metadata.jsonl’:

$ metawarc metadata --filetypes doc,docx digital.gov.ru.warc.gz

Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to filename ‘digital_meta.jsonl’:

$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz

5   Commands

5.1   Metadata command

Extracts metadata from files inside .warc files. Returns JSON lines output for each file found.

Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to filename ‘digital_meta.jsonl’:

$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz

5.2   Analyze command

Returns list of mime mimetypes with stats as number of files and total files size for each mime type

Analyzes ‘digital.gov.ru.warc.gz’ and output results of list of mime types as table to console

$ metawarc analyze digital.gov.ru.warc.gz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for metawarc, version 1.0.2
Filename, size File type Python version Upload date Hashes
Filename, size metawarc-1.0.2.tar.gz (7.4 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page