Command-line tool for metadata extraction from WARC (Web ARChive) files

These details have not been verified by PyPI

Project links

Project description

metawarc

A command-line tool and API for indexing, querying, and extracting metadata from WARC (Web ARChive) files.

metawarc (pronounced me-ta-warc) makes it easy to work with archived web content without manually parsing WARC files. Index once into DuckDB + Parquet sidecars, then list, filter, dump, or serve records through the CLI or REST API.

Main features

WARC indexing with DuckDB catalog and Parquet sidecars
Metadata extraction for PDFs, Office documents, images, and HTML links
Filter records by MIME type, extension, URL pattern, or SQL fragment
REST API with ReDoc documentation and MCP server for agent integrations
Low memory footprint and resumable re-indexing

Supported file formats

Category	Extensions
MS Office OLE	`.doc`, `.xls`, `.ppt`
MS Office XML	`.docx`, `.xlsx`, `.pptx`
Adobe PDF	`.pdf`
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.jp2`
HTML (links)	`.html`, and other HTML MIME types

Installation

pip install --upgrade pip setuptools
pip install --upgrade metawarc

For local development:

git clone https://github.com/ruarxive/metawarc.git
cd metawarc
pip install -e ".[dev]"

Requirements: Python 3.9+

How it works

index — scan WARC files and build warcindex.db plus data/*_records.parquet and data/*_headers.parquet
index-content (optional) — extract typed metadata into additional Parquet tables (_links, _pdfs, _images, …)
Query — use list-files, stats, dump, get, or the HTTP API to explore and export data

sample.warc.gz  ──index──►  warcindex.db
                    └──►  data/<uuid>_records.parquet
                    └──►  data/<uuid>_headers.parquet

Quickstart

Index a WARC file, inspect stats, and list PDFs:

# 1. Build the index
metawarc index sample.warc.gz

# 2. See what is in the archive
metawarc stats -m exts

# 3. List PDF records
metawarc list-files -e pdf

# 4. Export one file by URL
metawarc get "http://example.com/report.pdf" -o report.pdf

Index a whole directory of archives:

metawarc index 'archives/**/*.warc.gz' -o warcindex.db

Usage examples

Indexing

Index a single file to a custom database path:

metawarc index crawl-2024.warc.gz -o /data/crawl.db

Re-scan and overwrite existing Parquet sidecars:

metawarc index sample.warc.gz --rescan

Index quietly (no progress bars):

metawarc index '*.warc.gz' --silent

Content metadata extraction

Extract links from all indexed WARC files:

metawarc index-content -t links

Extract PDF and image metadata for one archive:

metawarc index-content crawl-2024.warc.gz -t pdfs,images

Re-extract Office document metadata after a schema change:

metawarc index-content -t ooxmldocs,oledocs --rescan

Statistics

MIME type breakdown:

metawarc stats -m mimes -d warcindex.db

Extension breakdown:

metawarc stats -m exts

Listing records

All HTML pages:

metawarc list-files -m text/html

Spreadsheets by extension:

metawarc list-files -e xls,xlsx,csv

Large PDFs (SQL WHERE fragment):

metawarc list-files -q "ext = 'pdf' and content_length > 5000000"

Export the listing to CSV:

metawarc list-files -e pdf -o pdf_records.csv

Limit to specific WARC file IDs (from the files table in DuckDB):

metawarc list-files -w abc123def456 -m application/pdf

Dumping payloads

Dump all ZIP files:

metawarc dump -m application/zip -o exports/zip

Dump images:

metawarc dump -e png,jpg,jpeg -o exports/images

Dump large PDFs:

metawarc dump -q "ext = 'pdf' and content_length > 10000000" -o exports/bigpdf

Each dump directory also contains records.csv with offsets, URLs, and WARC IDs.

Exporting extracted metadata

PDF metadata as JSON Lines:

metawarc dump-metadata -t pdfs -o pdfs.jsonl

Image metadata for one WARC file:

metawarc dump-metadata -i 'crawl-2024.warc.gz' -t images -o images.jsonl

Print link metadata to stdout:

metawarc dump-metadata -t links

Fetching a single record

By URL:

metawarc get "http://example.com/page.html" -o page.html

By WARC record ID:

metawarc get "<urn:uuid:...>" -o record.bin

REST API

Start the server (default port 8000):

metawarc serve --dbfile warcindex.db
# or
METAWARC_DB_PATH=warcindex.db metawarc serve --port 8000

Open http://localhost:8000/ for interactive ReDoc documentation.

API examples

List indexed WARC files:

curl http://localhost:8000/warcs/list

List HTML records (paginated):

curl 'http://localhost:8000/records/list?exts=html&limit=50'

Filter by URL substring:

curl 'http://localhost:8000/records/list?url_pattern=example.com&limit=10'

Get record metadata:

curl http://localhost:8000/records/get/<wf_id>/record/<record_id>

Get HTTP headers as JSON dict:

curl 'http://localhost:8000/records/get/<wf_id>/headers/<record_id>?mode=dict'

Download record payload:

curl -OJ http://localhost:8000/records/get/<wf_id>/data/<record_id>

MCP server

Expose the API to MCP-compatible agents (default port 8191):

metawarc mcp --dbfile warcindex.db --port 8191

Configuration

Variable	Default	Description
`METAWARC_DB_PATH`	`warcindex.db`	DuckDB index file for API/MCP
`METAWARC_PORT`	`8000`	REST API port
`METAWARC_MCP_PORT`	`8191`	MCP server port
`METAWARC_DEBUG`	`true`	Enable debug logging
`METAWARC_LOG_JSON`	`true`	Emit structured JSON logs
`METAWARC_TITLE`	`Metawarc API`	OpenAPI title

Command reference

Command	Description
`index`	Build DuckDB index and Parquet sidecars
`index-content`	Extract links, PDF, image, or Office metadata
`stats`	Print MIME or extension statistics
`list-files`	List matching records
`dump`	Export record payloads to disk
`dump-metadata`	Export extracted metadata as JSONL
`get`	Fetch a single record by URL or WARC ID
`serve`	Run the REST API server
`mcp`	Run the MCP server

Run metawarc <command> --help for all flags.

Development

pip install -e ".[dev]"
pytest
flake8 metawarc tests

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.3.1

Jul 9, 2026

1.1.1

Oct 28, 2022

1.0.2

May 11, 2020

1.0.1

May 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metawarc-1.3.1.tar.gz (27.8 kB view details)

Uploaded Jul 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metawarc-1.3.1-py2.py3-none-any.whl (27.5 kB view details)

Uploaded Jul 9, 2026 Python 2Python 3

File details

Details for the file metawarc-1.3.1.tar.gz.

File metadata

Download URL: metawarc-1.3.1.tar.gz
Upload date: Jul 9, 2026
Size: 27.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for metawarc-1.3.1.tar.gz
Algorithm	Hash digest
SHA256	`8caedd8b148944e63c018dca03734ac10e6fb7363c13bd3c9bc83c635a445f30`
MD5	`21dc547a8f786e05d24e0b943dadc4f9`
BLAKE2b-256	`e8b33338a7168443d4c19c6707a05bbcb7c71aa5478f75341ce1fdd5308b9d9b`

See more details on using hashes here.

File details

Details for the file metawarc-1.3.1-py2.py3-none-any.whl.

File metadata

Download URL: metawarc-1.3.1-py2.py3-none-any.whl
Upload date: Jul 9, 2026
Size: 27.5 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for metawarc-1.3.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b4e0ac250cd1caa713ffd1094bfc133636dc237931914b0d30f7419bb62ef52`
MD5	`b6a8858705b4e3e6274ec39a9023c561`
BLAKE2b-256	`449dbd79a608fcc82d74ad84ab0f67a1fc512a8d49eac1aad229cb30cec643ac`

See more details on using hashes here.

metawarc 1.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

metawarc

Main features

Supported file formats

Installation

How it works

Quickstart

Usage examples

Indexing

Content metadata extraction

Statistics

Listing records

Dumping payloads

Exporting extracted metadata

Fetching a single record

REST API

API examples

MCP server

Configuration

Command reference

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes