wayback-archive

A comprehensive tool for downloading and archiving websites from the Wayback Machine

These details have not been verified by PyPI

Project links

Homepage

Project description

Wayback-Archive banner

Download complete websites from the Wayback Machine for offline viewing.

Python 3.8+

Wayback-Archive is a Python tool that downloads archived websites from the Wayback Machine and reconstructs them for fully functional offline viewing. It preserves all assets -- HTML, CSS, JavaScript, images, and fonts -- rewrites URLs to relative paths, and cleans up Wayback Machine artifacts so the result looks like the original site.

Quick Start

# Install
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
pip install -r config/requirements.txt

# Run
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
python3 -m wayback_archive.cli

# Preview
cd output && python3 -m http.server 8000
# Open http://localhost:8000

Features

Core

Full website download -- HTML, CSS, JS, images, fonts, and all linked assets
Recursive link discovery -- Automatically follows links in HTML, CSS, and JS files
Smart URL rewriting -- Converts all links to relative paths for local serving
Timeframe fallback -- Searches nearby Wayback Machine timestamps when a resource returns 404
Real-time progress logging -- Displays download status and file processing as it happens

Asset Handling

Google Fonts support -- Downloads Google Fonts CSS and font files locally, fixing CORS issues
Font corruption detection -- Identifies and removes corrupted font files (HTML error pages served as fonts)
CDN fallback -- Automatic fallback to CDN for critical libraries (e.g., jQuery) when Wayback Machine fails
Data attribute processing -- Processes data-* attributes containing URLs (videos, images, etc.)

Preservation

Icon group preservation -- Preserves all links in icon groups (social media, contact icons)
Button link preservation -- Maintains styling and functionality of button links
Cookie consent preservation -- Keeps cookie consent popups and functionality intact

Optimization

HTML minification -- Uses minify-html (Python 3.14+ compatible)
JS/CSS minification -- Optional JavaScript and CSS minification via rjsmin and cssmin
Image compression -- Optional image optimization with Pillow
Tracker/ad removal -- Strips analytics, ads, and external iframes
Link cleanup -- Configurable external link removal with anchor preservation options
www/non-www normalization -- Normalize domain variations automatically

Why Wayback-Archive?

Capability	Wayback-Archive	wget	httrack
Wayback Machine URL rewriting	Yes	No	No
Wayback artifact cleanup	Yes	No	No
Timeframe fallback for 404s	Yes	No	No
Google Fonts localization	Yes	No	No
Font corruption detection	Yes	No	No
CDN fallback	Yes	No	No
HTML/CSS/JS minification	Yes	No	No
Tracker and ad removal	Yes	No	No
`data-*` attribute processing	Yes	No	No

General-purpose tools like wget --mirror or httrack can download live websites, but they do not understand Wayback Machine URL structures, cannot clean up archive artifacts, and lack the specialized asset recovery that Wayback-Archive provides.

Installation

Prerequisites

Python 3.8 or higher
pip

From Source

git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive

# Optional: create a virtual environment
python3 -m venv venv
source venv/bin/activate  # macOS/Linux
# venv\Scripts\activate   # Windows

pip install -r config/requirements.txt

As a Package

cd Wayback-Archive
pip install -e .
wayback-archive  # Available as a CLI command after installation

Configuration

All options are set via environment variables. You can also use a .env file.

Required

Variable	Description
`WAYBACK_URL`	The Wayback Machine URL to download

Output

Variable	Default	Description
`OUTPUT_DIR`	`./output`	Output directory for downloaded files

Optimization

Variable	Default	Description
`OPTIMIZE_HTML`	`true`	Minify HTML
`OPTIMIZE_IMAGES`	`false`	Compress images
`MINIFY_JS`	`false`	Minify JavaScript
`MINIFY_CSS`	`false`	Minify CSS

Content Removal

Variable	Default	Description
`REMOVE_TRACKERS`	`true`	Remove analytics and trackers
`REMOVE_ADS`	`true`	Remove advertisements
`REMOVE_CLICKABLE_CONTACTS`	`true`	Remove `tel:` and `mailto:` links
`REMOVE_EXTERNAL_IFRAMES`	`false`	Remove external iframes

Link Handling

Variable	Default	Description
`REMOVE_EXTERNAL_LINKS_KEEP_ANCHORS`	`true`	Remove external links, keep anchor text
`REMOVE_EXTERNAL_LINKS_REMOVE_ANCHORS`	`false`	Remove external links and anchor elements
`MAKE_INTERNAL_LINKS_RELATIVE`	`true`	Convert internal links to relative paths

Domain

Variable	Default	Description
`MAKE_NON_WWW`	`true`	Convert www to non-www
`MAKE_WWW`	`false`	Convert non-www to www
`KEEP_REDIRECTIONS`	`false`	Keep redirect pages

Testing

Variable	Default	Description
`MAX_FILES`	unlimited	Limit number of files to download

Usage

macOS / Linux

export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export OUTPUT_DIR="./my_website"
export REMOVE_CLICKABLE_CONTACTS="false"  # Keep email/phone links

python3 -m wayback_archive.cli

Windows (PowerShell)

$env:WAYBACK_URL = "https://web.archive.org/web/20250417203037/http://example.com/"
$env:OUTPUT_DIR = ".\my_website"
$env:REMOVE_CLICKABLE_CONTACTS = "false"

python -m wayback_archive.cli

Windows (CMD)

set WAYBACK_URL=https://web.archive.org/web/20250417203037/http://example.com/
set OUTPUT_DIR=.\my_website
set REMOVE_CLICKABLE_CONTACTS=false

python -m wayback_archive.cli

Quick Test

Download a limited number of files to verify everything works:

export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export MAX_FILES=5
python3 -m wayback_archive.cli

How It Works

Initial download -- Fetches the main page from the Wayback Machine
Link extraction -- Parses HTML to find all referenced assets (links, images, CSS, JS)
CSS processing -- Extracts font URLs, background images, and @import statements; downloads Google Fonts locally; detects corrupted font files
JS processing -- Extracts dynamically loaded resources from JavaScript
Data attributes -- Scans data-* attributes for additional asset URLs
Iterative crawling -- Continues discovering and downloading resources until the queue is empty
Timeframe fallback -- For 404 responses, searches nearby Wayback Machine timestamps
URL rewriting -- Converts all URLs to relative paths for offline serving
Preservation -- Maintains icon groups, button links, and cookie consent functionality

Project Structure

Wayback-Archive/
  wayback_archive/          # Main package
    __init__.py
    __main__.py
    cli.py                  # CLI entry point
    config.py               # Environment variable configuration
    downloader.py           # Core download and processing engine
  config/
    requirements.txt        # Runtime dependencies
    requirements-dev.txt    # Development dependencies
    setup.py                # Package setup
    pytest.ini              # Test configuration
  tests/                    # Test suite
  docs/                     # Documentation
  LICENSE                   # GPL-3.0
  README.md

Testing

pip install -r config/requirements-dev.txt

# Run tests
pytest

# Run tests with coverage
pytest --cov=wayback_archive

Troubleshooting

Port Already in Use

python3 -m http.server 8080  # Use a different port

Font Loading Issues

Google Fonts: Downloaded automatically to avoid CORS issues
Corrupted fonts: Detected and removed from CSS automatically
Missing fonts: Some fonts may not exist in the Wayback Machine archive

See Font Loading Research Notes for details.

Missing Links or Icons

Icon groups (social media, contacts) are preserved automatically
Button links with sppb-btn or btn classes are preserved
Set REMOVE_CLICKABLE_CONTACTS=false to keep tel: and mailto: links

jQuery or Libraries Not Loading

The tool includes automatic CDN fallback for critical libraries. If a file fails to download from the Wayback Machine, it will attempt to fetch it from a CDN.

Dependencies

Package	Purpose
requests	HTTP client
beautifulsoup4	HTML parsing
lxml	Fast HTML/XML parser
minify-html	HTML minification
cssmin	CSS minification
rjsmin	JS minification
Pillow	Image optimization
python-dotenv	`.env` file support

Contributing

Contributions are welcome. Please feel free to submit a Pull Request.

Related Web Archiving Tools

Way-CMS — Simple web CMS for editing archived HTML/CSS files
Wayback-Diff — Web page comparison with Wayback Machine support
web-mirror — Mirror any webpage for offline access
media-download — Download all media files from any web page

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.4.1

May 4, 2026

1.4.0

Apr 8, 2026

This version

1.3.3

Apr 6, 2026

1.3.2

Apr 6, 2026

1.3.1

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayback_archive-1.3.3.tar.gz (5.4 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wayback_archive-1.3.3-py3-none-any.whl (30.1 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file wayback_archive-1.3.3.tar.gz.

File metadata

Download URL: wayback_archive-1.3.3.tar.gz
Upload date: Apr 6, 2026
Size: 5.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wayback_archive-1.3.3.tar.gz
Algorithm	Hash digest
SHA256	`93d6613eac228dd69f6d58e2399fb0fd01364865a1bd3384b98703039ab539ae`
MD5	`05adc94d531c43db9476f8ff87d73f3a`
BLAKE2b-256	`fbf0de818d84d1cc8a48211cd1179b776a1017ed2b45c56fc56c79b5adb8320d`

See more details on using hashes here.

File details

Details for the file wayback_archive-1.3.3-py3-none-any.whl.

File metadata

Download URL: wayback_archive-1.3.3-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 30.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wayback_archive-1.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`707a81a49d649aa39a015bcfa1b568989f2cfe6e160289cc56ca9af97cb22853`
MD5	`064323b31d9eca31755cbd1b0645a6f3`
BLAKE2b-256	`16d90ea94fbcb9b4b38edfe4ee5a0104fd7f79901f3c6062f416ea61d43b64c8`

See more details on using hashes here.

wayback-archive 1.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quick Start

Features

Core

Asset Handling

Preservation

Optimization

Why Wayback-Archive?

Installation

Prerequisites

From Source

As a Package

Configuration

Required

Output

Optimization

Content Removal

Link Handling

Domain

Testing

Usage

macOS / Linux

Windows (PowerShell)

Windows (CMD)

Quick Test

How It Works

Project Structure

Testing

Troubleshooting

Port Already in Use

Font Loading Issues

Missing Links or Icons

jQuery or Libraries Not Loading

Dependencies

Contributing

Related Web Archiving Tools

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes