Skip to main content

WarcDB: Web crawl data as SQLite databases

Project description

WarcDB: Web crawl data as SQLite databases.

WarcDB is a an SQLite-based file format that makes web crawl data easier to share and query.

It is based on the standardized Web ARChive format, used by web archives, and defined in ISO 28500:2017.

Usage

pip install warcdb
# Load the `archive.warcdb` file with data.
warcdb import archive.warcdb ./tests/google.warc ./tests/frontpages.warc.gz "https://tselai.com/data/google.warc"

warcdb enable-fts ./archive.warcdb response payload

# Search for records that mention "stocks" in their response body
warcdb search ./archive.warcdb response "stocks" -c "WARC-Record-ID"

As you can see you can use any mix of local/remote and raw/compressed archives.

For example to get a part of the Common Crawl January 2022 Crawl Archive in a streaming fashion:

warcdb import archive.warcdb "https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-05/segments/1642320306346.64/warc/CC-MAIN-20220128212503-20220129002503-00719.warc.gz

You can also import WARC files contained in WACZ files, that are created by tools like ArchiveWeb.Page, Browsertrix-Crawler, and Scoop.

warcdb import archive.warcdb tests/scoop.wacz

How It Works

Individual .warc files are read and parsed and their data is inserted into an SQLite database with the relational schema seen below.

Schema

If there is a new major or minor version of warcdb you may need to migrate existing databases to use the new database schema (if there have been any changes). To do this you first upgrade warcdb, and then import into the database, which will make sure all migrations have been run. If you want to migrate the database explicitly you can:

warcdb migrate archive.warcdb

If there are no migrations to run the migrate command will do nothing.

Here's the relational schema of the .warcdb file.

WarcDB Schema

Motivation

From the WARC formal specification:

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.

Many organizations such as Commoncrawl, WebRecorder, Archive.org and libraries around the world, use the warc format to archive and store web data.

The full datasets of these services range in the few pebibytes(PiB), making them impractical to query using non-distributed systems.

This project aims to make subsets such data easier to access and query using SQL.

Currently, this is implemented on top of SQLite and is a wrapper around the excellent SQLite-Utils utility.

"wrapper" means that all existing sqlite-utils CLI commands can be called as expected like

sqlite-utils <command> archive.warcdb`

or

warcdb <command> example.warcdb

Examples

Populate with wget

wget --warc-file tselai "https://tselai.com"

warcdb import archive.warcdb tselai.warc.gz

Get all response headers

sqlite3 archive.warcdb <<SQL
select  json_extract(h.value, '$.header') as header, 
        json_extract(h.value, '$.value') as value
from response,
     json_each(http_headers) h
SQL

Get Cookie Headers for requests and responses

sqlite3 archive.warcdb <<SQL
select json_extract(h.value, '$.header') as header, json_extract(h.value, '$.value') as value
from response,
     json_each(http_headers) h
where json_extract(h.value, '$.header') like '%Cookie%'
union
select json_extract(h.value, '$.header') as header, json_extract(h.value, '$.value') as value
from request,
     json_each(http_headers) h
where json_extract(h.value, '$.header') like '%Cookie%'
SQL

Develop

You can use poetry to install dependencies and run the tests:

$ git clone https://github.com/Florents-Tselai/WarcDB.git
$ cd WarcDB
$ poetry install
$ poetry run pytest

Then when you are ready to publish to PyPI:

$ poetry publish --build

Resources on WARC

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warcdb-0.2.2.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

warcdb-0.2.2-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file warcdb-0.2.2.tar.gz.

File metadata

  • Download URL: warcdb-0.2.2.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.6 Darwin/23.0.0

File hashes

Hashes for warcdb-0.2.2.tar.gz
Algorithm Hash digest
SHA256 9408c521325f566af7e2b037beb195610df6b3ab8ea5aa3ca6693920467bf731
MD5 bf60a17263cadeaaea4cbf92047d8c65
BLAKE2b-256 235b604e63012fa2561a9baea3d2c346fc4382e1f7b1aa572493d0446b1f935e

See more details on using hashes here.

File details

Details for the file warcdb-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: warcdb-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.6 Darwin/23.0.0

File hashes

Hashes for warcdb-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d2cab6a7567c53a3461c56a628932f1ec6802b9e34afceb330b7a78365bb9e67
MD5 19c6a9cc885cbd6dbaa0c014aa5e91a3
BLAKE2b-256 8200b84cde93f22393b52f38c48792169bc4fcf66437e0afd467548c311a4b44

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page