Download snapshots from the Wayback Machine
Project description
archive wayback downloader
Downloading archived web pages from the Wayback Machine.
Internet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.
Installation
Pip
- Install the package
pip install pywaybackup
- Run the script
waybackup -h
Manual
- Clone the repository
git clone https://github.com/bitdruid/python-wayback-machine-downloader.git
- Install
pip install .
- in a virtual env or use
--break-system-package
- in a virtual env or use
Usage
This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
Arguments
-h
,--help
: Show the help message and exit.-a
,--about
: Show information about the script and exit.
Required Arguments
-u URL
,--url URL
: The URL of the web page to download. This argument is required.
Mode Selection (Choose One)
-c
,--current
: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files.-f
,--full
: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.-s
,--save
: Save a page to the Wayback Machine. (beta)
Optional Arguments
-
-l
,--list
: Only print the snapshots available within the specified range. Does not download the snapshots. -
-e
,--explicit
: Only download the explicit given url. No wildcard subdomains or paths. -
-o OUTPUT
,--output OUTPUT
: The folder where downloaded files will be saved. -
Range Selection:
Specify the range in years or a specific timestamp either start, end or both. If you specify therange
argument, thestart
andend
arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)-r RANGE
,--range RANGE
: Specify the range in years for which to search and download snapshots.--start
: Timestamp to start searching.--end
: Timestamp to end searching.
Additional
--csv
: Save a csv file with the list of snapshots inside the output folder.--no-redirect
: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (-c
).--verbosity [LEVEL]
: Set the verbosity: json (print json response), progress (show progress bar) or standard (default).--retry [RETRY_FAILED]
: Retry failed downloads. You can specify the number of retry attempts as an integer.--worker [AMOUNT]
: The number of worker to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many worker will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.
Examples
Download latest snapshot of all files:
waybackup -u http://example.com -c
Download latest snapshot of all files with retries:
waybackup -u http://example.com -c --retry 3
Download all snapshots sorted per timestamp with a specified range and follow redirects:
waybackup -u http://example.com -f -r 5 --redirect
Download all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 worker:
waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --worker 3
List available snapshots per timestamp without downloading:
waybackup -u http://example.com -f -l
Contributing
I'm always happy for some feature requests to improve the usability of this script. Feel free to give suggestions and report issues. Project is still far from being perfect.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pywaybackup-0.8.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5d34367417977538996a88d9ddc4a456e2d498b20320f34d670753e54c7fcf6 |
|
MD5 | 25bcda602ea024a18d1f6cde10494f1e |
|
BLAKE2b-256 | a4dd3a4132440b204639f9ca4eac5ec6f6fc56055a79ca2105fd78d99f9f113a |