Download and rewrite archived websites from the Internet Archive Wayback Machine.

These details have not been verified by PyPI

Project links

Project description

Wayback Machine Downloader

A Python port of the original Ruby wayback-machine-downloader, built for users who prefer a Python-based workflow for downloading archived websites from the Internet Archive Wayback Machine.

This tool helps recover, mirror, and archive old websites from Wayback Machine snapshots. It is useful for digital preservation, website recovery, static site restoration, OSINT research, historical web analysis, and rebuilding sites that are no longer online.

This Python version includes a number of extra fixes, improvements, and quality-of-life changes over the original Ruby implementation.

Highlights

Download the latest capture of each file for a target
Download every timestamped capture with timestamp-prefixed file IDs
Build a composite snapshot as of a point in time
Resume interrupted runs using .cdx.json and .downloaded.txt
Rewrite archived links for local browsing with --local
Discover linked page assets with --page-requisites
Recursively mirror subdomains with --recursive-subdomains
Keep the implementation dependency-light and fully testable offline

Requirements

Python 3.10 or newer

Installation

Install the published package from PyPI:

python -m pip install wayback-machine-downloader

The PyPI distribution name is wayback-machine-downloader; the import package remains wayback_downloader.

Install the package in editable mode while developing:

python -m pip install -e .

Or run it directly from the repository:

python -m wayback_downloader --help

The package also exposes a console script after installation:

wayback-machine-downloader --help

Quick Start

Download the latest version of every file for a site:

python -m wayback_downloader https://example.com

List the planned captures without downloading:

python -m wayback_downloader --list https://example.com

Download all historical captures:

python -m wayback_downloader --all-timestamps https://example.com

Build a composite snapshot as of a specific timestamp:

python -m wayback_downloader --snapshot-at 20130101000000 https://example.com

Rewrite an existing downloaded tree for local browsing:

python -m wayback_downloader --local-only ./websites/example.com

Output Layout

By default, downloads are written under:

./websites/<backup-name>/

<backup-name> is usually the target host. For example:

websites/example.com/

The downloader also uses two state files in the output directory:

.cdx.json Cached snapshot listing fetched from the CDX API.
.downloaded.txt Logical file IDs that have been written successfully.

These files let later runs resume instead of starting from scratch. Use --reset to delete them before a run, or --keep to preserve them after a successful run.

Common Workflows

Download only one exact URL:

python -m wayback_downloader --exact-url https://example.com/index.html

Limit by timestamp range:

python -m wayback_downloader --from 20060101 --to 20071231 https://example.com

Filter URLs:

python -m wayback_downloader --only "/\\.(css|js|png)$/i" https://example.com
python -m wayback_downloader --exclude admin https://example.com

Download pages and immediately queue linked assets:

python -m wayback_downloader --page-requisites --local https://example.com

Recursively mirror discovered subdomains:

python -m wayback_downloader --recursive-subdomains --subdomain-depth 2 https://example.com

Snapshot Selection Modes

The downloader supports three selection strategies:

Latest per logical file Default behavior. For each logical file ID, the newest capture wins.
All timestamps Enabled with --all-timestamps. The timestamp becomes part of the logical file ID so every capture is kept.
Composite snapshot Enabled with --snapshot-at. For each file, choose the newest capture at or before the requested timestamp.

URL and Filename Behavior

Several implementation details are worth knowing because they influence the output tree:

Host and trailing-slash directory targets are normalized into CDX prefix queries unless --exact-url is used.
Query strings are folded into filenames using a short digest, such as app__q12ab34cd56ef.css.
Directory-like captures are stored as .../index.html.
If a file blocks a needed directory later in the run, it is moved to index.html so both captures can coexist.

Local Rewriting

The --local option rewrites archived absolute URLs into local relative references after files are saved. It handles:

Wayback-hosted rewritten URLs
direct absolute HTTP/HTTPS links
HTML attributes such as href, src, and action
CSS url(...) references
JavaScript string literals containing absolute URLs

--local-only performs only the rewrite phase on an existing directory and does not contact the archive.

Documentation

Publishing

Build and validate the distribution archives:

python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*

Upload to TestPyPI first:

python -m twine upload --repository testpypi dist/*

Upload to PyPI:

python -m twine upload dist/*

Use an API token when Twine prompts for credentials:

username: __token__
password: your pypi-... token

Testing

Run the test suite:

python -B -m unittest discover -s tests -t .

Compile modules as a quick import sanity check:

python -m compileall wayback_downloader tests

The tests use fake transports and temporary directories, so they do not depend on live access to web.archive.org.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayback_machine_downloader-0.1.0.tar.gz (42.0 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wayback_machine_downloader-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file wayback_machine_downloader-0.1.0.tar.gz.

File metadata

Download URL: wayback_machine_downloader-0.1.0.tar.gz
Upload date: Jun 3, 2026
Size: 42.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for wayback_machine_downloader-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fc8a62c503d798f815a2c9614fb3f4cfe6f8bc8887dd5088438b11a52dc3c555`
MD5	`4cbf8c11784dfda91b03d14f98e53eb0`
BLAKE2b-256	`94fad8b70e3ea74c4f8dcb52b62686190d70eb472a371a9d1ad62f0a94a3357a`

See more details on using hashes here.

File details

Details for the file wayback_machine_downloader-0.1.0-py3-none-any.whl.

File metadata

Download URL: wayback_machine_downloader-0.1.0-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 36.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for wayback_machine_downloader-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8fb8ce2eb6158b429b6a53238d4876b431d373de0d357dcfb3a6904a1d0f3eb`
MD5	`429c3c136bf689cc44fe3fdc95c876d0`
BLAKE2b-256	`2f7562b90ff19e04b5b63c69e58de5c3db89b7c1c735ca6a0aca78f0fec80f98`

See more details on using hashes here.

wayback-machine-downloader 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Wayback Machine Downloader

Highlights

Requirements

Installation

Quick Start

Output Layout

Common Workflows

Snapshot Selection Modes

URL and Filename Behavior

Local Rewriting

Documentation

Publishing

Testing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes