Skip to main content

metawarc: a command-line tool for data extraction from WARC files (web archives)

Project description

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)


metawarc (pronounced me-ta-warc) is a command line WARC files processing tools.

Its goal is to make CLI interaction with files inside WARC archives so easy as possible.

It provides a simple metawarc command that allows to extract metadata from images, documents and other files inside

WARC archives.

Main features


  • Built-in WARC support

  • Metadata extraction for a lot of file formats

  • Low memory footprint

  • Documentation

  • Test coverage

File formats supported


  • MS Office OLE: .doc, .xls, .ppt

  • MS Office XML: .docx, .xlsx, .pptx

  • Adobe PDF: .pdf

  • Images: .png, .jpg, .tiff, .jpeg, .jp2

Installation


Any OS


A universal installation method (that works on Windows, Mac OS X, Linux, …,

and always provides the latest version) is to use pip:

# Make sure we have an up-to-date version of pip and setuptools:

$ pip install --upgrade pip setuptools



$ pip install --upgrade metawarc

(If pip installation fails for some reason, you can try

easy_install metawarc as a fallback.)

Python version


Python version 3.6 or greater is required.

Usage


Synopsis:

$ metawarc [command] [flags]  inputfile

See also metawarc --help and metawarc [command] --help for help for each command.

Examples


Extract metadata of all supported file types from ‘digital.gov.ru.warc.gz’ and output results to default filename ‘metadata.jsonl’:

$ metawarc metadata digital.gov.ru.warc.gz

Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to default filename ‘metadata.jsonl’:

$ metawarc metadata --filetypes doc,docx digital.gov.ru.warc.gz

Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to filename ‘digital_meta.jsonl’:

$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz

Commands


Metadata command


Extracts metadata from files inside .warc files. Returns JSON lines output for each file found.

Extract metadata for .doc and .docx file types from ‘digital.gov.ru.warc.gz’ and output results to filename ‘digital_meta.jsonl’:

$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz

Analyze command


Returns list of mime mimetypes with stats as number of files and total files size for each mime type.

Will be merged or replaced by ‘stats’ command that uses sqlite db to speed up data processing

Analyzes ‘digital.gov.ru.warc.gz’ and output results of list of mime types as table to console

$ metawarc analyze digital.gov.ru.warc.gz

Index command


Generates ‘metawarc.db’ SQLite database with records HTTP metadata. Requred for ‘stats’ command to calculate stats quickly

Analyzes ‘digital.gov.ru.warc.gz’ and writes ‘metawarc.db’ with HTTP metadata.

$ metawarc index digital.gov.ru.warc.gz

Index command


Same as ‘analyze’ command but uses ‘metawarc.db’ to speed up data processing. Returns total length and count of records by each mime or file extension.

Processes data in ‘metawarc.db’ and prints total length and count for each mime

$ metawarc stats -m mimes

Processes data in ‘metawarc.db’ and prints total length and count for each file extension

$ metawarc stats -m exts

Export command


Extracts HTTP headers, WARC headers or text content from WARC file and saves as NDJSON (JSON lines) data file.

Exports http headers from ‘digital.gov.ru.warc.gz’ and writes as ‘headers.jsonl’

$ metawarc export -t headers -o headers.jsonl digital.gov.ru.warc.gz

Exports WarcIO index from ‘digital.gov.ru.warc.gz’ and writes as ‘data.jsonl’ with fields listed in ‘-f’ option.

$ metawarc export -t warcio -f offset,length,filename,http:status,http:content-type,warc-type,warc-target-uri -o data.jsonl digital.gov.ru.warc.gz

Exports text (HTML) content from ‘digital.gov.ru.warc.gz’ and writes as ‘content.jsonl’

$ metawarc export -t content -o content.jsonl digital.gov.ru.warc.gz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metawarc-1.1.1.tar.gz (15.4 kB view details)

Uploaded Source

File details

Details for the file metawarc-1.1.1.tar.gz.

File metadata

  • Download URL: metawarc-1.1.1.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.64.0 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/1.5.0 colorama/0.4.5 CPython/3.10.0

File hashes

Hashes for metawarc-1.1.1.tar.gz
Algorithm Hash digest
SHA256 c42da6fbfe6c8b562338ffbfc93516773e87c856f9603e6d37b27ab9b5014120
MD5 2d308e8bd206cffdaacb6c34c51c7d48
BLAKE2b-256 a37a29700e38e3451bcadd00c096d93ef37735a99c4aec539e86a0cd8ad3c5b2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page