Summarize web archive capture index (CDX) files
Project description
CDX Summary
Summarize web archive capture index (CDX) files.
Installation
$ pip install cdxsummary
Alternatively, install from the source.
$ python3 setup.py install
To run the tool as a one-off Docker container, build the image as following, which will place the cdxsummary
executable as the entrypoint script of the container.
$ docker image build -t cdxsummary .
$ docker container run -it --rm cdxsummary
Features
- Summarize local CDX files or remote ones over HTTP
- Handle
gz
andbz2
compression seamlessly - Handle CDX data input to
STDIN
from pipe - Support Internet Archive Petabox web item summarization
- Support Wayback Machine CDX Server API summarization
- Seamless authorization to Internet Archive via the
ia
CLI tool - Human-friendly summary by default, but support summarized or detailed JSON reports
- Self-aware, as the input can be a previously generated JSON report in place of CDX data
- Summary includes:
- An overview of numbers of captures, consecutive unique URIs, unique hosts, accumulated WARC records size, and the first and last datetimes
- A grid of media types and status codes and their respective capture counts
- A grid of path and query segment length and their respective capture counts
- A grid of year and month and their respective capture counts
- Top-N (configurable) hosts and their capture counts
- A random sample of N (configurable) memento URIs for
200 OK
HTML pages
Usage
$ cdxsummary --help
usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input]
Summarize web archive capture index (CDX) files.
positional arguments:
input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-')
optional arguments:
-h, --help show this help message and exit
-a [QUERY], --api [QUERY]
CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL
-i, --item Treat the input argument as a Petabox item identifier instead of a file path
-j, --json Generate summary in JSON format
-l, --load Load JSON report instead of CDX
-o [FILE], --out [FILE]
Write output to the given file (default: STDOUT)
-r, --report Generate non-summarized JSON report
-s [N], --samples [N]
Number of sample memento URLs in summary (default: 10)
-t [N], --tophosts [N]
Number of hosts with maximum captures in summary (default: 10)
-v, --version Show version number
Sample Output
$ cdxsummary sample.cdx.gz
CDX Overview
────────────────────────────────────
Total Captures in CDX 74,460
Consecutive Unique URLs 71,599
Consecutive Unique Hosts 12,133
Total WARC Records Size 10.2 GB
First Memento Date Mar 18 2021
Last Memento Date Mar 18 2021
────────────────────────────────────
MIME Type and Status Code Distribution
───────────────────────────────────────────────
MIME 2XX 3XX 4XX 5XX Other TOTAL
───────────────────────────────────────────────
HTML 25,853 8,419 6,138 177 1 40,588
Image 9,337 8 39 0 0 9,384
CSS 4,027 0 0 0 0 4,027
JavaScript 4,219 0 0 0 0 4,219
JSON 192 1 24 1 0 218
XML 463 9 80 13 0 565
Text 5,729 185 128 5 0 6,047
PDF 3,282 12 1 0 0 3,295
Font 83 0 0 0 0 83
Audio 7 0 0 0 0 7
Video 36 0 0 0 0 36
Other 1,250 4,443 270 28 0 5,991
───────────────────────────────────────────────
TOTAL 54,478 13,077 6,680 224 1 74,460
───────────────────────────────────────────────
Path and Query Segments
─────────────────────────────────────────────
Path Q0 Q1 Q2 Q3 Q4 Other TOTAL
─────────────────────────────────────────────
P0 3,625 296 52 38 19 13 4,043
P1 22,874 1,309 625 151 48 110 25,117
P2 12,790 1,357 624 173 190 84 15,218
P3 9,558 809 231 110 61 113 10,882
P4 5,770 694 150 30 16 126 6,786
Other 8,515 3,375 252 36 94 142 12,414
─────────────────────────────────────────────
TOTAL 63,132 7,840 1,934 538 428 588 74,460
─────────────────────────────────────────────
Year and Month Distribution
───────────────────────────────────────────────────
Year 01 02 03 04 05 06 07 08 09 10 11 12 TOTAL
───────────────────────────────────────────────────
2021 0 0 74,460 0 0 0 0 0 0 0 0 0 74,460
───────────────────────────────────────────────────
Top 10 Out of 12,133 Hosts
───────────────────────────────
Host Captures
───────────────────────────────
cdc.gov 550
facebook.com 508
sec.gov 476
youtube.com 382
fws.gov 374
twitter.com 370
census.gov 317
online.star.bnl.gov 298
biomarkers.nlm.nih.gov 289
cancer.gov 248
───────────────────────────────
OTHERS (12,123 Hosts) 70,648
───────────────────────────────
Random Sample of 10 OK HTML Mementos
────────────────────────────────────────────────
* https://web.archive.org/web/20210318000647/https://www.anl.gov/argonne-impacts
* https://web.archive.org/web/20210318000929/http://www.usarmyjrotc.com/instructor/automation/jcims.php
* https://web.archive.org/web/20210318000243/https://loc.gov/help/
* https://web.archive.org/web/20210318000148/http://gp2.pawg.cap.gov/group-2-squadrons/reading-composite-sqdn-811
* https://web.archive.org/web/20210318001600/https://era.nih.gov/help-tutorials/iedison
* https://web.archive.org/web/20210318000451/https://www.ftc.gov/policy/hearings-competition-consumer-protection
* https://web.archive.org/web/20210318000124/https://asap.gov/
* https://web.archive.org/web/20210318001530/https://espfl.epa.gov/secondary/dataMap
* https://web.archive.org/web/20210318000510/https://roundme.com/embed/ro6VYzBNE5vePdZ3xyph
* https://web.archive.org/web/20210318000510/https://prevention.cancer.gov/news-and-events/videos-and-webinars
$ cdxsummary --json sample.cdx.gz
{
"captures": 74460,
"urls": 71599,
"hosts": 12133,
"bytes": 10237687828,
"first": "20210318000104",
"last": "20210318003748",
"tophosts": {
"cdc.gov": 550,
"facebook.com": 508,
"sec.gov": 476,
"youtube.com": 382,
"fws.gov": 374,
"twitter.com": 370,
"census.gov": 317,
"online.star.bnl.gov": 298,
"biomarkers.nlm.nih.gov": 289,
"cancer.gov": 248
},
"mimestatus": {
"HTML": {
"2XX": 25853,
"3XX": 8419,
"4XX": 6138,
"5XX": 177,
"Other": 1
},
"Image": {
"2XX": 9337,
"3XX": 8,
"4XX": 39,
"5XX": 0,
"Other": 0
},
"CSS": {
"2XX": 4027,
"3XX": 0,
"4XX": 0,
"5XX": 0,
"Other": 0
},
"JavaScript": {
"2XX": 4219,
"3XX": 0,
"4XX": 0,
"5XX": 0,
"Other": 0
},
"JSON": {
"2XX": 192,
"3XX": 1,
"4XX": 24,
"5XX": 1,
"Other": 0
},
"XML": {
"2XX": 463,
"3XX": 9,
"4XX": 80,
"5XX": 13,
"Other": 0
},
"Text": {
"2XX": 5729,
"3XX": 185,
"4XX": 128,
"5XX": 5,
"Other": 0
},
"PDF": {
"2XX": 3282,
"3XX": 12,
"4XX": 1,
"5XX": 0,
"Other": 0
},
"Font": {
"2XX": 83,
"3XX": 0,
"4XX": 0,
"5XX": 0,
"Other": 0
},
"Audio": {
"2XX": 7,
"3XX": 0,
"4XX": 0,
"5XX": 0,
"Other": 0
},
"Video": {
"2XX": 36,
"3XX": 0,
"4XX": 0,
"5XX": 0,
"Other": 0
},
"Revisit": {
"2XX": 0,
"3XX": 0,
"4XX": 0,
"5XX": 0,
"Other": 0
},
"Other": {
"2XX": 1250,
"3XX": 4443,
"4XX": 270,
"5XX": 28,
"Other": 0
}
},
"pathquery": {
"P0": {
"Q0": 3625,
"Q1": 296,
"Q2": 52,
"Q3": 38,
"Q4": 19,
"Other": 13
},
"P1": {
"Q0": 22874,
"Q1": 1309,
"Q2": 625,
"Q3": 151,
"Q4": 48,
"Other": 110
},
"P2": {
"Q0": 12790,
"Q1": 1357,
"Q2": 624,
"Q3": 173,
"Q4": 190,
"Other": 84
},
"P3": {
"Q0": 9558,
"Q1": 809,
"Q2": 231,
"Q3": 110,
"Q4": 61,
"Other": 113
},
"P4": {
"Q0": 5770,
"Q1": 694,
"Q2": 150,
"Q3": 30,
"Q4": 16,
"Other": 126
},
"Other": {
"Q0": 8515,
"Q1": 3375,
"Q2": 252,
"Q3": 36,
"Q4": 94,
"Other": 142
}
},
"yearmonth": {
"2021": {
"01": 0,
"02": 0,
"03": 74460,
"04": 0,
"05": 0,
"06": 0,
"07": 0,
"08": 0,
"09": 0,
"10": 0,
"11": 0,
"12": 0
}
},
"samples": [
[
"20210318000647",
"https://www.anl.gov/argonne-impacts"
],
[
"20210318000929",
"http://www.usarmyjrotc.com/instructor/automation/jcims.php"
],
[
"20210318000243",
"https://loc.gov/help/"
],
[
"20210318000148",
"http://gp2.pawg.cap.gov/group-2-squadrons/reading-composite-sqdn-811"
],
[
"20210318001600",
"https://era.nih.gov/help-tutorials/iedison"
],
[
"20210318000451",
"https://www.ftc.gov/policy/hearings-competition-consumer-protection"
],
[
"20210318000124",
"https://asap.gov/"
],
[
"20210318001530",
"https://espfl.epa.gov/secondary/dataMap"
],
[
"20210318000510",
"https://roundme.com/embed/ro6VYzBNE5vePdZ3xyph"
],
[
"20210318000510",
"https://prevention.cancer.gov/news-and-events/videos-and-webinars"
]
]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cdxsummary-0.1.1b5.tar.gz
(25.8 kB
view details)
Built Distribution
File details
Details for the file cdxsummary-0.1.1b5.tar.gz
.
File metadata
- Download URL: cdxsummary-0.1.1b5.tar.gz
- Upload date:
- Size: 25.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95174f2358e3c57e6164501bd3ca9796400dcdb5d800bec3712759b5166e2e71 |
|
MD5 | c522a899134bc508c4999fdfcdfe8b83 |
|
BLAKE2b-256 | 04910e801561efc2a4e921945f98f6effafd5bc6348b12f6535334fd75d2dc34 |
File details
Details for the file cdxsummary-0.1.1b5-py3-none-any.whl
.
File metadata
- Download URL: cdxsummary-0.1.1b5-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4c5548f98dd021e8a2b2861bdd41d41bd3b7a1ea83bd11b61ccb19b5381262d |
|
MD5 | 2ea264a52cde46bbfeef66eecb293cbe |
|
BLAKE2b-256 | bcfe73dac95eb23cb2521975ebf0f6d6ee46c11ddc2331eb271db91ee5978897 |