Skip to main content

Download snapshots from the Wayback Machine

Project description

archive wayback downloader

PyPI PyPI - Downloads Release Python Version License: MIT

Downloading archived web pages from the Wayback Machine.

Internet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.

Installation

Pip

  1. Install the package
    pip install pywaybackup
  2. Run the script
    waybackup -h

Manual

  1. Clone the repository
    git clone https://github.com/bitdruid/python-wayback-machine-downloader.git
  2. Install
    pip install .
    • in a virtual env or use --break-system-package

Usage

This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

Arguments

  • -h, --help: Show the help message and exit.
  • -a, --about: Show information about the script and exit.

Required Arguments

  • -u URL, --url URL: The URL of the web page to download. This argument is required.

Mode Selection (Choose One)

  • -c, --current: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files.
  • -f, --full: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
  • -s, --save: Save a page to the Wayback Machine. (beta)

Optional Arguments

  • -l, --list: Only print the snapshots available within the specified range. Does not download the snapshots.

  • -e, --explicit: Only download the explicit given url. No wildcard subdomains or paths.

  • -o OUTPUT, --output OUTPUT: The folder where downloaded files will be saved.

  • Range Selection:
    Specify the range in years or a specific timestamp either start, end or both. If you specify the range argument, the start and end arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
    (year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)

    • -r RANGE, --range RANGE: Specify the range in years for which to search and download snapshots.
    • --start: Timestamp to start searching.
    • --end: Timestamp to end searching.

Additional

  • --csv: Save a csv file with the list of snapshots inside the output folder.
  • --no-redirect: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (-c).
  • --verbosity [LEVEL]: Set the verbosity: json (print json response), progress (show progress bar) or standard (default).
  • --retry [RETRY_FAILED]: Retry failed downloads. You can specify the number of retry attempts as an integer.
  • --worker [AMOUNT]: The number of worker to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many worker will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.

Examples

Download latest snapshot of all files:
waybackup -u http://example.com -c

Download latest snapshot of all files with retries:
waybackup -u http://example.com -c --retry 3

Download all snapshots sorted per timestamp with a specified range and follow redirects:
waybackup -u http://example.com -f -r 5 --redirect

Download all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 worker:
waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --worker 3

List available snapshots per timestamp without downloading:
waybackup -u http://example.com -f -l

Contributing

I'm always happy for some feature requests to improve the usability of this script. Feel free to give suggestions and report issues. Project is still far from being perfect.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywaybackup-0.8.1.tar.gz (11.9 kB view hashes)

Uploaded Source

Built Distribution

pywaybackup-0.8.1-py3-none-any.whl (11.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page