Skip to main content

CDXJ Indexer for WARC and ARC files

Project description

CDXJ Indexer

A command-line tool for generating CDXJ (and CDX) indexes from WARC and ARC files. The indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from pywb)

Install with pip install cdxj-indexer or install locally with python setup.py install

The indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and request record access for WARC files. See the examples below and the command line -h option for latest features. (This is a work in progress).

Usage examples

Generate CDXJ index:

> cdxj-indexer /path/to/archive-file.warc.gz
com,example)/ 20170730223850 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK", "length": "1219", "offset": "771", "filename": "example-20170730223917.warc.gz"}

CDX Index (11 field):

> cdxj-indexer -11 /path/to/archive-file.warc.gz
CDX N b a m s k r M S V g
com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz

More advanced use cases: add additonal http headers as fields. http: prefix specifies current record headers, while req.http: specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:

> cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz
com,example)/ 20170801032435 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7", "length": "1207", "offset": "834", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 03:24:35 GMT", "referrer": "https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/"}
org,iana)/domains/example 20170801032437 {"url": "http://www.iana.org/domains/example", "mime": "text/html", "status": "302", "digest": "RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2", "length": "675", "offset": "2652", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 02:35:05 GMT", "referrer": "http://example.com/"}

The CDXJ Indexer extends the Indexer functionality in warcio and should be flexible to extend.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdxj_indexer-1.4.1.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

cdxj_indexer-1.4.1-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file cdxj_indexer-1.4.1.tar.gz.

File metadata

  • Download URL: cdxj_indexer-1.4.1.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.2

File hashes

Hashes for cdxj_indexer-1.4.1.tar.gz
Algorithm Hash digest
SHA256 c9aa4291853710613c9776423d7f375d32b236435af68acd486cab6d17d59c5e
MD5 a0ea0526dc747ca8ac85dfc567adcfe3
BLAKE2b-256 6b975767b83520d4480d7ce698bf799bb92ae74f1b0f67adbaf01418285a488e

See more details on using hashes here.

File details

Details for the file cdxj_indexer-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: cdxj_indexer-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.2

File hashes

Hashes for cdxj_indexer-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e161ed2364156ca7f13474f830bb4ae9b57dfa78b5f6597011749f3c1afe9724
MD5 9208a442dc0c0108dbe874685573b061
BLAKE2b-256 ba898e1bae68d0c21f4d42003c8b1dd9b1f9a099a5440d8d590c8ffac9972b35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page