metawarc: a command-line tool for metadata extraction from files inside WARC (web archive)
Project description
metawarc (pronounced me-ta-warc) is a command line WARC files processing tools. Its goal is to make CLI interaction with files inside WARC archives so easy as possible. It provides a simple metawarc command that allows to extract metadata from images, documents and other files inside WARC archives.
1 Main features
Built-in WARC support
Metadata extraction for a lot of file formats
Low memory footprint
Documentation
Test coverage
2 File formats supported
MS Office OLE: .doc, .xls, .ppt MS Office XML: .docx, .xlsx, .pptx Adobe PDF: .pdf Images: .png, .jpg, .tiff, .jpeg, .jp2
3 Installation
3.1 Any OS
A universal installation method (that works on Windows, Mac OS X, Linux, …, and always provides the latest version) is to use pip:
# Make sure we have an up-to-date version of pip and setuptools:
$ pip install --upgrade pip setuptools
$ pip install --upgrade metawarc
(If pip installation fails for some reason, you can try easy_install metawarc as a fallback.)
3.2 Python version
Python version 3.6 or greater is required.
4 Usage
Synopsis:
$ metawarc [command] [flags] inputfile
See also metawarc --help and metawarc [command] --help for help for each command.
4.1 Examples
Extract metadata of all supported file types from ‘digital.gov.ru.warc.gz’ and output results to default filename ‘metadata.jsonl’:
$ metawarc metadata digital.gov.ru.warc.gz
Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to default filename ‘metadata.jsonl’:
$ metawarc metadata --filetypes doc,docx digital.gov.ru.warc.gz
Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to filename ‘digital_meta.jsonl’:
$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz
5 Commands
Metadata command
Extracts metadata from files inside .warc files. Returns JSON lines output for each file found.
Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to filename ‘digital_meta.jsonl’:
$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.