A polite and user-friendly downloader for Common Crawl data.
Project description
A polite downloader for Common Crawl data, written in Rust.
Install
cargo install ccdown
Other methods
From source
git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .
Pre-built binaries
Grab the latest release for your platform from the releases page.
Usage
1. Download the path manifest for a crawl
ccdown download-paths CC-MAIN-2025-08 warc ./paths
Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table
Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM
2. Download the actual data
ccdown download ./paths/warc.paths.gz ./data
Options
| Flag | Description | Default |
|---|---|---|
-t |
Number of concurrent downloads | 10 |
-r |
Max retries per file | 1000 |
-p |
Show progress bars | off |
-f |
Flat file output (no directory structure) | off |
-n |
Numbered output (for Ungoliant Pipeline) | off |
-s |
Abort on unrecoverable errors (401, 403, 404) | off |
Example
ccdown download -p -t 5 ./paths/warc.paths.gz ./data
Note: Keep threads at 10 or below. Too many concurrent requests will get you
403'd by the server, and those errors are unrecoverable.
Python bindings
Install
pip install ccdown
Usage
from ccdown import Client
client = Client(threads=10, retries=1000, progress=True)
# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")
# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")
# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")
# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")
API
Client(threads=10, retries=1000, progress=False) — Create a client with shared config.
client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.
client.download(path_file) — Returns a builder with chainable options:
.files_only()— flatten directory structure.numbered()— enumerate output files (for Ungoliant).strict()— abort on unrecoverable HTTP errors.to(dst)— execute the download
License
MIT OR Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ccdown-0.6.1.tar.gz.
File metadata
- Download URL: ccdown-0.6.1.tar.gz
- Upload date:
- Size: 133.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c92074bdd75999894120ee9dbaea7d33b20e6b413776b72b688660a78c76df0
|
|
| MD5 |
9797ac36ceacd0c21a9c110bd7c7ce9d
|
|
| BLAKE2b-256 |
d569e2a7a3cef20a0283f765b1db7395485db94c48f128ea3217f8b7bf396d4a
|
Provenance
The following attestation bundles were made for ccdown-0.6.1.tar.gz:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1.tar.gz -
Subject digest:
3c92074bdd75999894120ee9dbaea7d33b20e6b413776b72b688660a78c76df0 - Sigstore transparency entry: 1204590361
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03a5759a2ed1ba01b334dfe2381fbaee6c406c556e3110f9b28cb6638121bff2
|
|
| MD5 |
0576bbf6f780fc1e7b2e7025fa1ca688
|
|
| BLAKE2b-256 |
4bcd6c2b6847c5ae9e82fd762c62f74900d850c142390df7910df7e3b4abf4e8
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
03a5759a2ed1ba01b334dfe2381fbaee6c406c556e3110f9b28cb6638121bff2 - Sigstore transparency entry: 1204590450
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.14, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be481df2b028bf322ace9ecf3663c59b03f1d710680d4d53dd96ab3f194b9a04
|
|
| MD5 |
13a5e0bb2090a32fcc6589a2eef7604a
|
|
| BLAKE2b-256 |
6b7bde290c59c0acfb88435b933f8f8b9ef5348f69688fd41e00d3a609627b7b
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
be481df2b028bf322ace9ecf3663c59b03f1d710680d4d53dd96ab3f194b9a04 - Sigstore transparency entry: 1204590424
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
890adc886b54a5dc26a7840c019f1499dc30ca74c2b34c40c57274c1a80e8c5e
|
|
| MD5 |
9b32edae5c8852f1a7615e5d127fcad8
|
|
| BLAKE2b-256 |
bc53f8256d22f1f46395baed4bccd6ee5de18422b2de8a32e0c2e98bc6d68867
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
890adc886b54a5dc26a7840c019f1499dc30ca74c2b34c40c57274c1a80e8c5e - Sigstore transparency entry: 1204590504
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
428424e665d56d7f46cf153158212c92be6a2cc921be00f7e529abc97b13ef26
|
|
| MD5 |
de800207daf5d7e58902e54961609193
|
|
| BLAKE2b-256 |
fa310abddb36ed390e10ec76ce6fa31404e3a3ca152c35086c8a1d87d0fec9ef
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
428424e665d56d7f46cf153158212c92be6a2cc921be00f7e529abc97b13ef26 - Sigstore transparency entry: 1204590408
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b275bb06d7a23bc7c8eab1c2230e53ff4847553b1febe8c83adb0bd75d39c99
|
|
| MD5 |
9a9cd5746cb3327308060a4761a0c186
|
|
| BLAKE2b-256 |
871134fcdcc03dad949c5189e19bf8b6c962a254277e02f24d6ecde3f2910b47
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp312-cp312-macosx_11_0_arm64.whl -
Subject digest:
2b275bb06d7a23bc7c8eab1c2230e53ff4847553b1febe8c83adb0bd75d39c99 - Sigstore transparency entry: 1204590484
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6e860a5287e7120a67300bff3fb3177e72350690c37cf0b1f1d6fcf8d918313
|
|
| MD5 |
7c6cb83223b2bdccef10080293e428b6
|
|
| BLAKE2b-256 |
6aca5eaea09b2508f7dc93b45a25a3717234f6f69b0bc4708c5b34f3208c4114
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl -
Subject digest:
f6e860a5287e7120a67300bff3fb3177e72350690c37cf0b1f1d6fcf8d918313 - Sigstore transparency entry: 1204590433
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d96af7899a02063f595a476ac649c196e483aaf0276251fce5e23e0a1288616
|
|
| MD5 |
5f9a8248b7950d2d7c867ce7fe0c84a0
|
|
| BLAKE2b-256 |
925a4a777b9e9d1aa6a0cbfc30ad933cf8e05830f0dca8af6649e6041e49bd1e
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
8d96af7899a02063f595a476ac649c196e483aaf0276251fce5e23e0a1288616 - Sigstore transparency entry: 1204590382
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0e91a683d19c49abb9f0a96f88ff7e8b6848e01690412595d8544e42b224cf2
|
|
| MD5 |
cc4bf97c018bca7e14c174ebd163900a
|
|
| BLAKE2b-256 |
2025311f08605bbafe8b499034c7da6f5f03b409be62c5b2225b06094654e555
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
f0e91a683d19c49abb9f0a96f88ff7e8b6848e01690412595d8544e42b224cf2 - Sigstore transparency entry: 1204590395
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9799c123e3a22192102c5fd019701617bec24c88fc9db2016c56fd339ca7e735
|
|
| MD5 |
6f338d18f37ace0c9104eb6fca8dfdf1
|
|
| BLAKE2b-256 |
9f270be919d02e3d27cd1cd82a6e50adb7e931062d7348a61987261868251b2a
|
Provenance
The following attestation bundles were made for ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python.yml on 4thel00z/ccdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ccdown-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
9799c123e3a22192102c5fd019701617bec24c88fc9db2016c56fd339ca7e735 - Sigstore transparency entry: 1204590466
- Sigstore integration time:
-
Permalink:
4thel00z/ccdown@7ba980a9259e5df769a93363c3c8f88f85db696c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@7ba980a9259e5df769a93363c3c8f88f85db696c -
Trigger Event:
push
-
Statement type: