Skip to main content

Download snapshots from the Wayback Machine

Project description

python wayback machine downloader

PyPI PyPI - Downloads Python Version License: MIT

Downloading archived web pages from the Wayback Machine.

Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.

This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

Installation

Pip

  1. Install the package
    pip install pywaybackup
  2. Run the tool
    waybackup -h

Manual

  1. Clone the repository
    git clone https://github.com/bitdruid/python-wayback-machine-downloader.git
  2. Install
    pip install .
    • in a virtual env or use --break-system-package

Usage infos - important notes

  • Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
  • If you query an explicit file (e.g. a query-string ?query=this or login.html), the --explicit-argument is recommended as a wildcard query may lead to an empty result.
  • The tool will inform you if your query has an immense amount of snapshots which could consume your system memory and lead to a crash. Consider splitting your query into smaller jobs by specifying a range e.g. --start 2023 --end 2024 or --range 1.

Arguments

  • -h, --help: Show the help message and exit.
  • -a, --about: Show information about the tool and exit.

Required

  • -u, --url:
    The URL of the web page to download. This argument is required.

Mode Selection (Choose One)

  • -c, --current:
    Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
  • -f, --full:
    Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
  • -s, --save:
    Save a page to the Wayback Machine. (beta)

Optional query parameters

  • -l, --list:
    Only print the snapshots available within the specified range. Does not download the snapshots.

  • -e, --explicit:
    Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like login.html or ?query=this.

  • --filetype <filetype>:
    Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: --filetype jpg,css,js. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit html file in the path (common practice) then you cant filter them.

  • --limit <count>:
    Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected (with --cdxinject or --auto), the limit will have no effect.

  • Range Selection:
    Specify the range in years or a specific timestamp either start, end, or both. If you specify the range argument, the start and end arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
    (year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)

    • -r, --range:
      Specify the range in years for which to search and download snapshots.
    • --start:
      Timestamp to start searching.
    • --end:
      Timestamp to end searching.

Behavior manipulation

  • -o, --output:
    Defaults to waybackup_snapshots in the current directory. The folder where downloaded files will be saved.

  • --csv <path>:
    Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If --list is set, the CSV contains the CDX list of snapshots. If --current or --full is set, CSV contains downloaded files. Named as waybackup_<sanitized_url>.csv.

  • --skip <path>:
    Path defaults to output-dir. Checks for an existing waybackup_<sanitized_url>.csv for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download http://example.com/subdir1/ and later http://example.com, the second query will skip the first path.

  • --no-redirect:
    Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.

  • --verbosity <level>:
    Sets verbosity level. Options are json (prints JSON response) or progress (shows progress bar).

  • --log <path>:
    Path defaults to output-dir. Saves a log file with the output of the tool. Named as waybackup_<sanitized_url>.log.

  • --workers <count>:
    Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.

  • --retry <attempts>:
    Specifies number of retry attempts for failed downloads.

  • --delay <seconds>:
    Specifies delay between download requests in seconds. Default is no delay (0).

CDX Query Result Handling:

  • --cdxbackup <path>:
    Path defaults to output-dir. Saves the result of CDX query as a file. Useful for later downloading snapshots and overcoming refused connections by CDX server due to too many queries. Named as waybackup_<sanitized_url>.cdx.

  • --cdxinject <path>:
    Path defaults to output-dir. Injects a CDX query file to download snapshots. Ensure the query matches the previous --url for correct folder structure. Named as waybackup_<sanitized_url>.cdx.

Auto:

  • --auto:
    If set, csv, skip and cdxbackup/cdxinject are handled automatically. Keep the files and folders as they are. Otherwise they will not be recognized when restarting a download.

Examples

Download latest snapshot of all files:
waybackup -u http://example.com -c

Download latest snapshot of a specific file:
waybackup -u http://example.com/subdir/file.html -c

Download all snapshots sorted per timestamp with a specified range and do not follow redirects:
waybackup -u http://example.com -f -r 5 --no-redirect

Download all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 workers:
waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3

Download all snapshots from 2020 to 12th of December 2022 with 4 workers, save a csv and show a progress bar: waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --csv --verbosity progress

Download all snapshots and output a json response:
waybackup -u http://example.com -f --verbosity json

List available snapshots per timestamp without downloading and save a csv file to home folder:
waybackup -u http://example.com -f -l --csv /home/user/Downloads

Output path structure

The output path is currently structured as follows by an example for the query:
http://example.com/subdir1/subdir2/assets/:

For the current version (-c):

  • The requested path will only include all files/folders starting from your query-path.
your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    └── subdir1/
        └── subdir2/
            └── assets/
                ├── image.jpg
                ├── style.css
                ...

For all versions (-f):

  • Will currently create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    ├── yyyymmddhhmmss/
    │   ├── subidr1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ├── yyyymmddhhmmss/
    │   ├── subdir1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ...

Json Response

For download queries:

[
   {
      "file": "/your/path/waybackup_snapshots/example.com/yyyymmddhhmmss/index.html",
      "id": 1,
      "redirect_timestamp": "yyyymmddhhmmss",
      "redirect_url": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "response": 200,
      "timestamp": "yyyymmddhhmmss",
      "url_archive": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "url_origin": "http://example.com/"
   },
    ...
]

For list queries:

[
   {
      "digest": "DIGESTOFSNAPSHOT",
      "id": 1,
      "mimetype": "text/html",
      "status": "200",
      "timestamp": "yyyymmddhhmmss",
      "url": "http://example.com/"
   },
   ...
]

CSV Output

The csv contains the json response in a table format.

Debugging

Exceptions will be written into waybackup_error.log (each run overwrites the file).

Contributing

I'm always happy for some feature requests to improve the usability of this tool. Feel free to give suggestions and report issues. Project is still far from being perfect.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywaybackup-1.5.7.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

pywaybackup-1.5.7-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file pywaybackup-1.5.7.tar.gz.

File metadata

  • Download URL: pywaybackup-1.5.7.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pywaybackup-1.5.7.tar.gz
Algorithm Hash digest
SHA256 ad4e1a5168b2046d4ebbbe93a1edb5c9d152e4c30ba4d8cbed0cc1a3bc0d2113
MD5 76e0d69ddcf235521f85af277707e15e
BLAKE2b-256 1e344fb7a0dc0a3815899b85c17c1530aaff8d61ed7893c31364e84711e8037d

See more details on using hashes here.

File details

Details for the file pywaybackup-1.5.7-py3-none-any.whl.

File metadata

  • Download URL: pywaybackup-1.5.7-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pywaybackup-1.5.7-py3-none-any.whl
Algorithm Hash digest
SHA256 d7a563c12e259cd6d6d162e6de1905261763be4034aebee51941f4ce8872b913
MD5 85363b6c1de7626511f3c05a4b964bf4
BLAKE2b-256 50de87888e2e847abbaf200bcc0fa22996b6a61c932ed509f801acb6ee8eac48

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page