Skip to main content

A very fast website mirror script.

Project description

cuckooget

                      __                                      __      
                     /\ \                                    /\ \__   
  ___   __  __    ___\ \ \/'\     ___     ___      __      __\ \ ,_\  
 /'___\/\ \/\ \  /'___\ \ , <    / __`\  / __`\  /'_ `\  /'__`\ \ \/  
/\ \__/\ \ \_\ \/\ \__/\ \ \\`\ /\ \L\ \/\ \L\ \/\ \L\ \/\  __/\ \ \_ 
\ \____\\ \____/\ \____\\ \_\ \_\ \____/\ \____/\ \____ \ \____\\ \__\
 \/____/ \/___/  \/____/ \/_/\/_/\/___/  \/___/  \/___L\ \/____/ \/__/
                                                   /\____/            
                                                   \_/__/             

What

A very fast website copy script using a cuckoo hash table & xxhash & DAG. There are still many problems. I feel sad about disappearing websites, and I’m thinking of ways to save them even faster.

Websites are our memories.
Let everyone rise up and preserve disappearing historical websites, leaving them for the future.
For all geeks and for those who love the internet. If you find an interesting website, please contact me.

Furthermore, with the -w option, you can set higher priorities based on the URL. I don't think other website mirroring software has this feature.

Collisions are avoided by the cuckoo hash table and generated by the ultra-fast xxhash. It consists of xxh32 and xxh64 as different hash values.

DeepWiki: https://deepwiki.com/haturatu/cuckooget

Install

Experimental: curl-impersonate support

This version uses curl-impersonate to avoid getting blocked by websites. It mimics the TLS/JA3 fingerprint of a real browser.

To use this feature, please checkout the feat/curl-impersonate branch.

git checkout feat/curl-impersonate
make && make install

deps:

curl https://pyenv.run | bash

pyenv install 3.12.3
python -m pip install maturin

GNU Make

I recommend installing it using GNU Make.

make
make install

For editable install:

make develop

Bash

chmod +x install.sh
./install.sh

Usage

$ ck -h
usage: ck [-h] [-c CONNECTIONS] [-w WEIGHTS [WEIGHTS ...]] [-v EXCLUDE [EXCLUDE ...]]
          [-f]
          url output_dir

Mirrors a website.

positional arguments:
  url                   URL of the website to mirror
  output_dir            Directory to save the mirrored files

options:
  -h, --help            show this help message and exit
  -c CONNECTIONS, --connections CONNECTIONS
                        Number of simultaneous connections (default: 50)
  -w WEIGHTS [WEIGHTS ...], --weights WEIGHTS [WEIGHTS ...]
                        Strings to set URL priorities (can specify multiple separated
                        by spaces)
  -v EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...]
                        URL patterns to exclude (can specify multiple separated by
                        spaces)
  -f, --force           Force re-download even if the download was already completed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl (309.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ac67da52a998a8728e77adc574521f61995274bc13cccb9fa6a3d92b199c54a7
MD5 4e4e442d9350bc2c2565171295435ee7
BLAKE2b-256 95046bf211d26e26dc203eabd518089adcd331c0056018a755f9ac8594e4b62c

See more details on using hashes here.

Provenance

The following attestation bundles were made for cuckooget-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: publish.yml on haturatu/cuckooget

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page