Download and rewrite archived websites from the Internet Archive Wayback Machine.
Project description
Wayback Machine Downloader
A Python port of the original Ruby wayback-machine-downloader, built for users who prefer a Python-based workflow for downloading archived websites from the Internet Archive Wayback Machine.
This tool helps recover, mirror, and archive old websites from Wayback Machine snapshots. It is useful for digital preservation, website recovery, static site restoration, OSINT research, historical web analysis, and rebuilding sites that are no longer online.
This Python version includes a number of extra fixes, improvements, and quality-of-life changes over the original Ruby implementation.
Highlights
- Download the latest capture of each file for a target
- Download every timestamped capture with timestamp-prefixed file IDs
- Build a composite snapshot as of a point in time
- Resume interrupted runs using
.cdx.jsonand.downloaded.txt - Rewrite archived links for local browsing with
--local - Discover linked page assets with
--page-requisites - Recursively mirror subdomains with
--recursive-subdomains - Keep the implementation dependency-light and fully testable offline
Requirements
- Python 3.10 or newer
Installation
Install the published package from PyPI:
python -m pip install wayback-machine-downloader
The PyPI distribution name is wayback-machine-downloader; the import package
remains wayback_downloader.
Install the package in editable mode while developing:
python -m pip install -e .
Or run it directly from the repository:
python -m wayback_downloader --help
The package also exposes a console script after installation:
wayback-machine-downloader --help
Quick Start
Download the latest version of every file for a site:
python -m wayback_downloader https://example.com
List the planned captures without downloading:
python -m wayback_downloader --list https://example.com
Download all historical captures:
python -m wayback_downloader --all-timestamps https://example.com
Build a composite snapshot as of a specific timestamp:
python -m wayback_downloader --snapshot-at 20130101000000 https://example.com
Rewrite an existing downloaded tree for local browsing:
python -m wayback_downloader --local-only ./websites/example.com
Output Layout
By default, downloads are written under:
./websites/<backup-name>/
<backup-name> is usually the target host. For example:
websites/example.com/
The downloader also uses two state files in the output directory:
.cdx.jsonCached snapshot listing fetched from the CDX API..downloaded.txtLogical file IDs that have been written successfully.
These files let later runs resume instead of starting from scratch. Use
--reset to delete them before a run, or --keep to preserve them after a
successful run.
Common Workflows
Download only one exact URL:
python -m wayback_downloader --exact-url https://example.com/index.html
Limit by timestamp range:
python -m wayback_downloader --from 20060101 --to 20071231 https://example.com
Filter URLs:
python -m wayback_downloader --only "/\\.(css|js|png)$/i" https://example.com
python -m wayback_downloader --exclude admin https://example.com
Download pages and immediately queue linked assets:
python -m wayback_downloader --page-requisites --local https://example.com
Recursively mirror discovered subdomains:
python -m wayback_downloader --recursive-subdomains --subdomain-depth 2 https://example.com
Snapshot Selection Modes
The downloader supports three selection strategies:
- Latest per logical file Default behavior. For each logical file ID, the newest capture wins.
- All timestamps
Enabled with
--all-timestamps. The timestamp becomes part of the logical file ID so every capture is kept. - Composite snapshot
Enabled with
--snapshot-at. For each file, choose the newest capture at or before the requested timestamp.
URL and Filename Behavior
Several implementation details are worth knowing because they influence the output tree:
- Host and trailing-slash directory targets are normalized into CDX prefix
queries unless
--exact-urlis used. - Query strings are folded into filenames using a short digest, such as
app__q12ab34cd56ef.css. - Directory-like captures are stored as
.../index.html. - If a file blocks a needed directory later in the run, it is moved to
index.htmlso both captures can coexist.
Local Rewriting
The --local option rewrites archived absolute URLs into local relative
references after files are saved. It handles:
- Wayback-hosted rewritten URLs
- direct absolute HTTP/HTTPS links
- HTML attributes such as
href,src, andaction - CSS
url(...)references - JavaScript string literals containing absolute URLs
--local-only performs only the rewrite phase on an existing directory and
does not contact the archive.
Documentation
Publishing
Build and validate the distribution archives:
python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
Upload to TestPyPI first:
python -m twine upload --repository testpypi dist/*
Upload to PyPI:
python -m twine upload dist/*
Use an API token when Twine prompts for credentials:
- username:
__token__ - password: your
pypi-...token
Testing
Run the test suite:
python -B -m unittest discover -s tests -t .
Compile modules as a quick import sanity check:
python -m compileall wayback_downloader tests
The tests use fake transports and temporary directories, so they do not depend
on live access to web.archive.org.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wayback_machine_downloader-0.1.0.tar.gz.
File metadata
- Download URL: wayback_machine_downloader-0.1.0.tar.gz
- Upload date:
- Size: 42.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc8a62c503d798f815a2c9614fb3f4cfe6f8bc8887dd5088438b11a52dc3c555
|
|
| MD5 |
4cbf8c11784dfda91b03d14f98e53eb0
|
|
| BLAKE2b-256 |
94fad8b70e3ea74c4f8dcb52b62686190d70eb472a371a9d1ad62f0a94a3357a
|
File details
Details for the file wayback_machine_downloader-0.1.0-py3-none-any.whl.
File metadata
- Download URL: wayback_machine_downloader-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8fb8ce2eb6158b429b6a53238d4876b431d373de0d357dcfb3a6904a1d0f3eb
|
|
| MD5 |
429c3c136bf689cc44fe3fdc95c876d0
|
|
| BLAKE2b-256 |
2f7562b90ff19e04b5b63c69e58de5c3db89b7c1c735ca6a0aca78f0fec80f98
|