A very fast website mirror script.
Project description
cuckooget
__ __
/\ \ /\ \__
___ __ __ ___\ \ \/'\ ___ ___ __ __\ \ ,_\
/'___\/\ \/\ \ /'___\ \ , < / __`\ / __`\ /'_ `\ /'__`\ \ \/
/\ \__/\ \ \_\ \/\ \__/\ \ \\`\ /\ \L\ \/\ \L\ \/\ \L\ \/\ __/\ \ \_
\ \____\\ \____/\ \____\\ \_\ \_\ \____/\ \____/\ \____ \ \____\\ \__\
\/____/ \/___/ \/____/ \/_/\/_/\/___/ \/___/ \/___L\ \/____/ \/__/
/\____/
\_/__/
What
A very fast website copy script using a cuckoo hash table & xxhash & DAG. There are still many problems. I feel sad about disappearing websites, and I’m thinking of ways to save them even faster.
Websites are our memories.
Let everyone rise up and preserve disappearing historical websites, leaving them for the future.
For all geeks and for those who love the internet. If you find an interesting website, please contact me.
Furthermore, with the -w option, you can set higher priorities based on the URL. I don't think other website mirroring software has this feature.
Collisions are avoided by the cuckoo hash table and generated by the ultra-fast xxhash. It consists of xxh32 and xxh64 as different hash values.
DeepWiki: https://deepwiki.com/haturatu/cuckooget
Install
Experimental: curl-impersonate support
This version uses curl-impersonate to avoid getting blocked by websites. It mimics the TLS/JA3 fingerprint of a real browser.
To use this feature, please checkout the feat/curl-impersonate branch.
git checkout feat/curl-impersonate
make && make install
deps:
curl https://pyenv.run | bash
pyenv install 3.12.3
python -m pip install maturin
GNU Make
I recommend installing it using GNU Make.
make
make install
For editable install:
make develop
Bash
chmod +x install.sh
./install.sh
Usage
$ ck -h
usage: ck [-h] [-c CONNECTIONS] [-w WEIGHTS [WEIGHTS ...]] [-v EXCLUDE [EXCLUDE ...]]
[-f]
url output_dir
Mirrors a website.
positional arguments:
url URL of the website to mirror
output_dir Directory to save the mirrored files
options:
-h, --help show this help message and exit
-c CONNECTIONS, --connections CONNECTIONS
Number of simultaneous connections (default: 50)
-w WEIGHTS [WEIGHTS ...], --weights WEIGHTS [WEIGHTS ...]
Strings to set URL priorities (can specify multiple separated
by spaces)
-v EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...]
URL patterns to exclude (can specify multiple separated by
spaces)
-f, --force Force re-download even if the download was already completed
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 309.8 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac67da52a998a8728e77adc574521f61995274bc13cccb9fa6a3d92b199c54a7
|
|
| MD5 |
4e4e442d9350bc2c2565171295435ee7
|
|
| BLAKE2b-256 |
95046bf211d26e26dc203eabd518089adcd331c0056018a755f9ac8594e4b62c
|
Provenance
The following attestation bundles were made for cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl:
Publisher:
publish.yml on haturatu/cuckooget
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl -
Subject digest:
ac67da52a998a8728e77adc574521f61995274bc13cccb9fa6a3d92b199c54a7 - Sigstore transparency entry: 1084554509
- Sigstore integration time:
-
Permalink:
haturatu/cuckooget@cecaab982d170cd854b3b2a07a419949596161ff -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/haturatu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cecaab982d170cd854b3b2a07a419949596161ff -
Trigger Event:
push
-
Statement type: