cdx-toolkit·PyPI

A toolkit for working with CDX indices

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Environment
- Console
Intended Audience
- Developers
- Information Technology
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

cdx_toolkit is a set of tools for working with CDX indices of web crawls and archives, including those at CommonCrawl and the Internet Archive’s Wayback Machine.

CommonCrawl uses Ilya Kramer’s pywb to serve the CDX API, which is somewhat different from the Internet Archive’s CDX API. cdx_toolkit hides these differences as best it can. cdx_toolkit also knits together the monthly Common Crawl CDX indices into a single, virtual index.

Installing

$ pip install cdx_toolkit

or clone this repo and use python setup.py install.

Example

import cdx_toolkit

cdx = cdx_toolkit.CDXFetcher(source='cc', cc_duration='90d')
url = 'commoncrawl.org/*'

print(url, 'size estimate', cdx.get_size_estimate(url))

for obj in cdx.items(url, limit=10):
    print(obj)

at the moment will print:

size estimate 6000
http://commoncrawl.org/ 200
http://commoncrawl.org/ 200
http://commoncrawl.org/ 200
http://www.commoncrawl.org/ 301
https://www.commoncrawl.org/ 301
http://www.commoncrawl.org/ 301
http://commoncrawl.org/ 200
http://commoncrawl.org/2011/12/mapreduce-for-the-masses/ 200
http://commoncrawl.org/2012/03/data-2-0-summit/ 200
http://commoncrawl.org/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages/ 200

Command-line tools

The above example can also be done as

$ cdx_size 'commoncrawl.org/*' --cc
$ cdx_iter 'commoncrawl.org/*' --cc --limit 10 --cc-duration='90d'

$ cdx_size 'commoncrawl.org/*' --ia
$ cdx_iter 'commoncrawl.org/*' --ia --limit 10

cdx_iter can generate jsonl or csv outputs; see

$ cdx_iter --help

for details.

CDX Jargon, Field Names, and such

A capture is a single crawled url, be it a copy of a webpage, a redirect to another page, an error such as 404 (page not found), or a revisit record (page identical to a previous capture.)

The url used by cdx_tools can be wildcarded in two ways. One way is *.example.com, which in CDX jargon sets matchType=’domain’, and will return captures for blog.example.com, support.example.com, etc. The other, example.com/*, will return captures for any page on example.com.

A timestmap represents year-month-day-time as a string of digits run togther. Example: January 5, 2016 at 12:34:56 UTC is 20160105123456. These timestamps are a field in the index, and are also used to pick specify the dates used by –from=, –to, and –closest on the command-line. (Programmatically, use from_ts=, to=, and closest=.)

An urlkey is a SURT, which is a munged-up url suitable for deduplication and sorting. This sort order is how CDX indices efficiently support queries like *.example.com. The SURTs for www.example.com and example.com are identical, which is handy when these 2 hosts actually have identical web content. The original url should be present in all records, if you want to know exactly what it is.

CDX Indices support a paged interface for efficient access to large sets of URLs. cdx_toolkit uses this interface under the hood. cdx_toolkit is also polite to CDX servers by being single-threaded and serial. If it’s not fast enough for you, consider downloading Common Crawl’s index files directly.

A digest is a sha1 checksum of the contents of a capture. The purpose of a digest is to be able to easily figure out if 2 captures have identical content.

Common Crawl publishes a new index each month. cdx_toolkit will automatically start using new ones as published. The –cc-duration command-line flag (and cc_duration= constructor argument) specifies how many days back to look. The default is ‘365d’, 365 days.

CDX implementations do not efficiently support reversed sort orders, so cdx_toolkit results will be ordered by ascending SURT and by ascending timestamp. However, since CC has an individual index for each month, and because most users want more recent results, cdx_toolkit defaults to querying CC’s CDX indices in decreasing month order, but each month’s result will be in ascending SURT and ascending timestamp. If you’d like pure ascending, set –cc-sort or cc_sort= to ‘ascending’. You may want to also specify –from or from_ts= to set a starting timestamp.

The main problem with this ascending sort order is that it’s a pain to get the most recent N captures: –limit and limit= will return the oldest N captures.

TODO

Add a call to download a capture from ia or cc, given an URL and a timestamp.

Status

cdx_toolkit has reached the “I hacked this together out of some other code for a hackathon this weekend” stage of development.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Environment
- Console
Intended Audience
- Developers
- Information Technology
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.9.37

Sep 9, 2024

0.9.36

Sep 2, 2024

0.9.35

Feb 17, 2024

0.9.34

Mar 2, 2022

0.9.33

Jul 3, 2021

0.9.31

Mar 31, 2021

0.9.30

Feb 8, 2021

0.9.29

Nov 3, 2020

0.9.28

Oct 8, 2020

0.9.27

Aug 26, 2020

0.9.25

Nov 5, 2019

0.9.24

Jan 17, 2019

0.9.23

Dec 10, 2018

0.9.22

Oct 24, 2018

0.9.21

Oct 22, 2018

0.9.20

Oct 12, 2018

0.9.19

Sep 16, 2018

0.9.18

Sep 16, 2018

0.9.17

Sep 15, 2018

0.9.14

Jul 25, 2018

0.9.13

Jul 1, 2018

0.9.12

Jun 21, 2018

0.9.11

Jun 2, 2018

0.9.10

May 2, 2018

0.9.9

Apr 29, 2018

0.9.8

Mar 23, 2018

0.9.7

Mar 23, 2018

0.9.6

Mar 16, 2018

0.9.5

Mar 12, 2018

0.9.4

Mar 8, 2018

This version

0.9.3

Mar 6, 2018

0.9.2

Mar 3, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

cdx_toolkit-0.9.3-py3-none-any.whl (12.1 kB view details)

Uploaded Mar 6, 2018 Python 3

File details

Details for the file cdx_toolkit-0.9.3-py3-none-any.whl.

File metadata

Download URL: cdx_toolkit-0.9.3-py3-none-any.whl
Upload date: Mar 6, 2018
Size: 12.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for cdx_toolkit-0.9.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`275e6dbeadd57cbc139550b7426dcb50471b9d8fdda341c0dea949419138dee4`
MD5	`4fb81283d27449a763e36cc0e200940b`
BLAKE2b-256	`96a60529a9cac8393d344fb7c27e566111dda92249e5c280d1201e47061789f1`

See more details on using hashes here.

cdx-toolkit 0.9.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installing

Example

Command-line tools

CDX Jargon, Field Names, and such

TODO

Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes