A toolkit for working with CDX indices
Project description
cdx_toolkit is a set of tools for working with CDX indices of web crawls and archives, including those at CommonCrawl and the Internet Archive’s Wayback Machine.
CommonCrawl uses Ilya Kramer’s pywb to serve the CDX API, which is somewhat different from the Internet Archive’s CDX API. cdx_toolkit hides these differences as best it can. cdx_toolkit also knits together the monthly Common Crawl CDX indices into a single, virtual index.
https://github.com/webrecorder/pywb/wiki/CDX-Server-API https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
Example
import cdx_toolkit cdx = cdx_toolkit.CDXFetcher(source='cc', cc_duration='90d') url = 'commoncrawl.org/*' print(url, 'size estimate', cdx.get_size_estimate(url)) for obj in cdx.items(url, limit=10): print(obj)
at the moment will print:
size estimate 6000 http://commoncrawl.org/ 200 http://commoncrawl.org/ 200 http://commoncrawl.org/ 200 http://www.commoncrawl.org/ 301 https://www.commoncrawl.org/ 301 http://www.commoncrawl.org/ 301 http://commoncrawl.org/ 200 http://commoncrawl.org/2011/12/mapreduce-for-the-masses/ 200 http://commoncrawl.org/2012/03/data-2-0-summit/ 200 http://commoncrawl.org/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages/ 200
Command-line tools
The above example can also be done as
$ cdx_size.py 'commoncrawl.org/*' --cc $ cdx_iter.py 'commoncrawl.org/*' --cc --limit 10 --cc-duration='90d'
or
$ cdx_size.py 'commoncrawl.org/*' --ia $ cdx_iter.py 'commoncrawl.org/*' --ia --limit 10
cdx_iter can generate jsonl or csv outputs; see
$ cdx_iter.py --help
for details.
Status
cdx_toolkit has reached the “I hacked this together out of some other code for a hackathon this weekend” stage of development.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for cdx_toolkit-0.9.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9a30e3a965f652f0636f528f037d37fc58d973ad37dc474d288822438433cbc |
|
MD5 | b1737f04b6b4dff0a5b668fc8d339f22 |
|
BLAKE2b-256 | 643c1d23cf7be05dff4d9a602fb4b5b6a88b2bf87b543d6d9e303e825edb49b1 |