Skip to main content

A polite and user-friendly downloader for Common Crawl data.

Project description

ccdown

A polite downloader for Common Crawl data, written in Rust.

crates.io PyPI docs.rs CI License


Install

cargo install ccdown
Other methods

From source

git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .

Pre-built binaries

Grab the latest release for your platform from the releases page.

Usage

1. Download the path manifest for a crawl

ccdown download-paths CC-MAIN-2025-08 warc ./paths

Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table

Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM

2. Download the actual data

ccdown download ./paths/warc.paths.gz ./data

Options

Flag Description Default
-t Number of concurrent downloads 10
-r Max retries per file 1000
-p Show progress bars off
-f Flat file output (no directory structure) off
-n Numbered output (for Ungoliant Pipeline) off
-s Abort on unrecoverable errors (401, 403, 404) off

Example

ccdown download -p -t 5 ./paths/warc.paths.gz ./data

Note: Keep threads at 10 or below. Too many concurrent requests will get you 403'd by the server, and those errors are unrecoverable.

Python bindings

Install

pip install ccdown

Usage

from ccdown import Client

client = Client(threads=10, retries=1000, progress=True)

# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")

# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")

# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")

# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")

API

Client(threads=10, retries=1000, progress=False) — Create a client with shared config.

client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.

client.download(path_file) — Returns a builder with chainable options:

  • .files_only() — flatten directory structure
  • .numbered() — enumerate output files (for Ungoliant)
  • .strict() — abort on unrecoverable HTTP errors
  • .to(dst) — execute the download

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ccdown-0.6.1.tar.gz (133.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file ccdown-0.6.1.tar.gz.

File metadata

  • Download URL: ccdown-0.6.1.tar.gz
  • Upload date:
  • Size: 133.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ccdown-0.6.1.tar.gz
Algorithm Hash digest
SHA256 3c92074bdd75999894120ee9dbaea7d33b20e6b413776b72b688660a78c76df0
MD5 9797ac36ceacd0c21a9c110bd7c7ce9d
BLAKE2b-256 d569e2a7a3cef20a0283f765b1db7395485db94c48f128ea3217f8b7bf396d4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1.tar.gz:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 03a5759a2ed1ba01b334dfe2381fbaee6c406c556e3110f9b28cb6638121bff2
MD5 0576bbf6f780fc1e7b2e7025fa1ca688
BLAKE2b-256 4bcd6c2b6847c5ae9e82fd762c62f74900d850c142390df7910df7e3b4abf4e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 be481df2b028bf322ace9ecf3663c59b03f1d710680d4d53dd96ab3f194b9a04
MD5 13a5e0bb2090a32fcc6589a2eef7604a
BLAKE2b-256 6b7bde290c59c0acfb88435b933f8f8b9ef5348f69688fd41e00d3a609627b7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 890adc886b54a5dc26a7840c019f1499dc30ca74c2b34c40c57274c1a80e8c5e
MD5 9b32edae5c8852f1a7615e5d127fcad8
BLAKE2b-256 bc53f8256d22f1f46395baed4bccd6ee5de18422b2de8a32e0c2e98bc6d68867

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 428424e665d56d7f46cf153158212c92be6a2cc921be00f7e529abc97b13ef26
MD5 de800207daf5d7e58902e54961609193
BLAKE2b-256 fa310abddb36ed390e10ec76ce6fa31404e3a3ca152c35086c8a1d87d0fec9ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2b275bb06d7a23bc7c8eab1c2230e53ff4847553b1febe8c83adb0bd75d39c99
MD5 9a9cd5746cb3327308060a4761a0c186
BLAKE2b-256 871134fcdcc03dad949c5189e19bf8b6c962a254277e02f24d6ecde3f2910b47

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f6e860a5287e7120a67300bff3fb3177e72350690c37cf0b1f1d6fcf8d918313
MD5 7c6cb83223b2bdccef10080293e428b6
BLAKE2b-256 6aca5eaea09b2508f7dc93b45a25a3717234f6f69b0bc4708c5b34f3208c4114

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8d96af7899a02063f595a476ac649c196e483aaf0276251fce5e23e0a1288616
MD5 5f9a8248b7950d2d7c867ce7fe0c84a0
BLAKE2b-256 925a4a777b9e9d1aa6a0cbfc30ad933cf8e05830f0dca8af6649e6041e49bd1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f0e91a683d19c49abb9f0a96f88ff7e8b6848e01690412595d8544e42b224cf2
MD5 cc4bf97c018bca7e14c174ebd163900a
BLAKE2b-256 2025311f08605bbafe8b499034c7da6f5f03b409be62c5b2225b06094654e555

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9799c123e3a22192102c5fd019701617bec24c88fc9db2016c56fd339ca7e735
MD5 6f338d18f37ace0c9104eb6fca8dfdf1
BLAKE2b-256 9f270be919d02e3d27cd1cd82a6e50adb7e931062d7348a61987261868251b2a

See more details on using hashes here.

Provenance

The following attestation bundles were made for ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python.yml on 4thel00z/ccdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page