Download snapshots from the Wayback Machine

Project description

python wayback machine downloader

Python Version

Downloading archived web pages from the Wayback Machine.

Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.

This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

Installation

Pip

Install the package
pip install pywaybackup
Run the tool
waybackup -h

Manual

Clone the repository
git clone https://github.com/bitdruid/python-wayback-machine-downloader.git
Install
pip install .
- in a virtual env or use --break-system-package

Usage infos - important notes

Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
If you query an explicit file (e.g. a query-string ?query=this or login.html), the --explicit-argument is recommended as a wildcard query may lead to an empty result.
The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.

Arguments

-h, --help: Show the help message and exit.
-a, --about: Show information about the tool and exit.

Required

-u, --url:
The URL of the web page to download. This argument is required.

Mode Selection (Choose One)

-c, --current:
Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
-f, --full:
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
-s, --save:
Save a page to the Wayback Machine. (beta)

Optional query parameters

-e, --explicit:
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like login.html or ?query=this.
--filetype <filetype>:
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: --filetype jpg,css,js. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit html file in the path (common practice) then you cant filter them.
--limit <count>:
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set --keep.
Range Selection:
Specify the range in years or a specific timestamp either start, end, or both. If you specify the range argument, the start and end arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
- -r, --range:
  Specify the range in years for which to search and download snapshots.
- --start:
  Timestamp to start searching.
- --end:
  Timestamp to end searching.

Behavior manipulation

-o, --output:
Defaults to waybackup_snapshots in the current directory. The folder where downloaded files will be saved.

--log :
Saves a log file into the output-dir. Named as waybackup_<sanitized_url>.log.
--progress:
Shows a progress bar instead of the default output.
--workers <count>:
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
--no-redirect:
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
--retry <attempts>:
Specifies number of retry attempts for failed downloads.
--delay <seconds>:
Specifies delay between download requests in seconds. Default is no delay (0).

Special:

--reset:
If set, the job will be reset, and any existing cdx, db, csv files will be deleted. This allows you to start the job from scratch without considering previously downloaded data.
--keep:
If set, all files will be kept after the job is finished. This includes the cdx and db file. Without this argument, they will be deleted if the job finished successfully.

Examples

Download the latest snapshot of all available files:
waybackup -u http://example.com -c

Download the latest snapshot of a specific file (e.g., a login page):
waybackup -u http://example.com/login.html -c --explicit

Download all snapshots within the last 5 years and prevent redirects:
waybackup -u http://example.com -f -r 5 --no-redirect

Download all snapshots from a specific range (2020 to December 12, 2022) with 4 workers, and show a progress bar:
waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --progress

Download all snapshots and save the output in a specific folder with 3 workers:
waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3

Download all snapshots but only images and CSS files, filtering for specific filetypes (jpg, css):
waybackup -u http://example.com -f --filetype jpg,css

Download all timestamps but start over and ignore existing progress, log the output, and retry 3 times if any error occurs:
waybackup -u http://example.com -f --log --retry 3 --reset

Download the latest snapshot, follow no redirects but keep the database and cdx-file:
waybackup -u http://example.com -c --no-redirect --keep

Output path structure

The output path is currently structured as follows by an example for the query:
http://example.com/subdir1/subdir2/assets/:

For the current version (-c):

The requested path will only include all files/folders starting from your query-path.

your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    └── subdir1/
        └── subdir2/
            └── assets/
                ├── image.jpg
                ├── style.css
                ...

For all versions (-f):

Will currently create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.

your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    ├── yyyymmddhhmmss/
    │   ├── subidr1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ├── yyyymmddhhmmss/
    │   ├── subdir1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ...

CSV Output

Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.

For download queries:

[
   {
      "file": "/your/path/waybackup_snapshots/example.com/yyyymmddhhmmss/index.html",
      "id": 1,
      "redirect_timestamp": "yyyymmddhhmmss",
      "redirect_url": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "response": 200,
      "timestamp": "yyyymmddhhmmss",
      "url_archive": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "url_origin": "http://example.com/"
   },
    ...
]

For list queries:

[
   {
      "digest": "DIGESTOFSNAPSHOT",
      "id": 1,
      "mimetype": "text/html",
      "status": "200",
      "timestamp": "yyyymmddhhmmss",
      "url": "http://example.com/"
   },
   ...
]

Debugging

Exceptions will be written into waybackup_error.log (each run overwrites the file).

Known ToDos

currently there is no logic to handle if both a http and https version of a page is available

Contributing

I'm always happy for some feature requests to improve the usability of this tool. Feel free to give suggestions and report issues. Project is still far from being perfect.

Project details

Release history Release notifications | RSS feed

2.0.3

Nov 8, 2024

2.0.2

Nov 3, 2024

2.0.1

Oct 31, 2024

This version

2.0.0

Oct 19, 2024

1.5.7

Sep 11, 2024

1.5.6

Sep 9, 2024

1.5.5

Sep 8, 2024

1.5.4

Sep 5, 2024

1.5.3

Sep 2, 2024

1.5.1

Aug 25, 2024

1.5.0

Aug 24, 2024

1.4.2

Aug 4, 2024

1.4.1

Jul 30, 2024

1.4.0

Jul 25, 2024

1.3.2

Jul 8, 2024

1.3.1

Jul 2, 2024

1.3.0

Jun 29, 2024

1.2.6

Jun 29, 2024

1.2.5

Jun 27, 2024

1.2.4

Jun 25, 2024

1.2.3

Jun 15, 2024

1.2.2

Jun 11, 2024

1.2.1

Jun 9, 2024

1.2.0

Jun 8, 2024

1.1.0

Jun 4, 2024

1.0.3

Jun 3, 2024

1.0.2

May 31, 2024

1.0.1

Apr 22, 2024

0.8.1

Apr 12, 2024

0.8.0

Apr 8, 2024

0.7.1

Apr 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywaybackup-2.0.0.tar.gz (24.1 kB view details)

Uploaded Oct 19, 2024 Source

Built Distribution

pywaybackup-2.0.0-py3-none-any.whl (24.1 kB view details)

Uploaded Oct 19, 2024 Python 3

File details

Details for the file pywaybackup-2.0.0.tar.gz.

File metadata

Download URL: pywaybackup-2.0.0.tar.gz
Upload date: Oct 19, 2024
Size: 24.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.12.6

File hashes

Hashes for pywaybackup-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ea4284166f9255e11d21e289ef0d4235bd7672671748b5fd0446b434cdedd46b`
MD5	`a5b32b7ec86fdd207b28e60a98cc0d13`
BLAKE2b-256	`174b826e679a65a77912e5cfa764e019882389caa89cd8db6c6577048cd41922`

See more details on using hashes here.

File details

Details for the file pywaybackup-2.0.0-py3-none-any.whl.

File metadata

Download URL: pywaybackup-2.0.0-py3-none-any.whl
Upload date: Oct 19, 2024
Size: 24.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.12.6

File hashes

Hashes for pywaybackup-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fbd7b0fef1a4a9357581e87b85b96890afdac7e1fd258d6b5d045c00ecfcffbe`
MD5	`c369a172c2455ebc2711a6b279b4e1cd`
BLAKE2b-256	`0dc2c1affe9b06d2c96f556f352b37b3b0faff91a9d190b4d555a8ae06226b15`