A comprehensive tool for downloading and archiving websites from the Wayback Machine
Project description
Download complete websites from the Wayback Machine for offline viewing.
Wayback-Archive is a Python tool that downloads archived websites from the Wayback Machine and reconstructs them for fully functional offline viewing. It preserves all assets -- HTML, CSS, JavaScript, images, and fonts -- rewrites URLs to relative paths, and cleans up Wayback Machine artifacts so the result looks like the original site.
Quick Start
# Install
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
pip install -r config/requirements.txt
# Run
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
python3 -m wayback_archive.cli
# Preview
cd output && python3 -m http.server 8000
# Open http://localhost:8000
Features
Core
- Full website download -- HTML, CSS, JS, images, fonts, and all linked assets
- Recursive link discovery -- Automatically follows links in HTML, CSS, and JS files
- Smart URL rewriting -- Converts all links to relative paths for local serving
- Timeframe fallback -- Searches nearby Wayback Machine timestamps when a resource returns 404
- Real-time progress logging -- Displays download status and file processing as it happens
Asset Handling
- Google Fonts support -- Downloads Google Fonts CSS and font files locally, fixing CORS issues
- Font corruption detection -- Identifies and removes corrupted font files (HTML error pages served as fonts)
- CDN fallback -- Automatic fallback to CDN for critical libraries (e.g., jQuery) when Wayback Machine fails
- Data attribute processing -- Processes
data-*attributes containing URLs (videos, images, etc.)
Preservation
- Icon group preservation -- Preserves all links in icon groups (social media, contact icons)
- Button link preservation -- Maintains styling and functionality of button links
- Cookie consent preservation -- Keeps cookie consent popups and functionality intact
Optimization
- HTML minification -- Uses
minify-html(Python 3.14+ compatible) - JS/CSS minification -- Optional JavaScript and CSS minification via
rjsminandcssmin - Image compression -- Optional image optimization with Pillow
- Tracker/ad removal -- Strips analytics, ads, and external iframes
- Link cleanup -- Configurable external link removal with anchor preservation options
- www/non-www normalization -- Normalize domain variations automatically
Why Wayback-Archive?
| Capability | Wayback-Archive | wget | httrack |
|---|---|---|---|
| Wayback Machine URL rewriting | Yes | No | No |
| Wayback artifact cleanup | Yes | No | No |
| Timeframe fallback for 404s | Yes | No | No |
| Google Fonts localization | Yes | No | No |
| Font corruption detection | Yes | No | No |
| CDN fallback | Yes | No | No |
| HTML/CSS/JS minification | Yes | No | No |
| Tracker and ad removal | Yes | No | No |
data-* attribute processing |
Yes | No | No |
General-purpose tools like wget --mirror or httrack can download live websites, but they do not understand Wayback Machine URL structures, cannot clean up archive artifacts, and lack the specialized asset recovery that Wayback-Archive provides.
Installation
Prerequisites
- Python 3.8 or higher
- pip
From Source
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
# Optional: create a virtual environment
python3 -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
pip install -r config/requirements.txt
As a Package
cd Wayback-Archive
pip install -e .
wayback-archive # Available as a CLI command after installation
Configuration
All options are set via environment variables. You can also use a .env file.
Required
| Variable | Description |
|---|---|
WAYBACK_URL |
The Wayback Machine URL to download |
Output
| Variable | Default | Description |
|---|---|---|
OUTPUT_DIR |
./output |
Output directory for downloaded files |
Optimization
| Variable | Default | Description |
|---|---|---|
OPTIMIZE_HTML |
true |
Minify HTML |
OPTIMIZE_IMAGES |
false |
Compress images |
MINIFY_JS |
false |
Minify JavaScript |
MINIFY_CSS |
false |
Minify CSS |
Content Removal
| Variable | Default | Description |
|---|---|---|
REMOVE_TRACKERS |
true |
Remove analytics and trackers |
REMOVE_ADS |
true |
Remove advertisements |
REMOVE_CLICKABLE_CONTACTS |
true |
Remove tel: and mailto: links |
REMOVE_EXTERNAL_IFRAMES |
false |
Remove external iframes |
Link Handling
| Variable | Default | Description |
|---|---|---|
REMOVE_EXTERNAL_LINKS_KEEP_ANCHORS |
true |
Remove external links, keep anchor text |
REMOVE_EXTERNAL_LINKS_REMOVE_ANCHORS |
false |
Remove external links and anchor elements |
MAKE_INTERNAL_LINKS_RELATIVE |
true |
Convert internal links to relative paths |
Domain
| Variable | Default | Description |
|---|---|---|
MAKE_NON_WWW |
true |
Convert www to non-www |
MAKE_WWW |
false |
Convert non-www to www |
KEEP_REDIRECTIONS |
false |
Keep redirect pages |
Testing
| Variable | Default | Description |
|---|---|---|
MAX_FILES |
unlimited | Limit number of files to download |
Usage
macOS / Linux
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export OUTPUT_DIR="./my_website"
export REMOVE_CLICKABLE_CONTACTS="false" # Keep email/phone links
python3 -m wayback_archive.cli
Windows (PowerShell)
$env:WAYBACK_URL = "https://web.archive.org/web/20250417203037/http://example.com/"
$env:OUTPUT_DIR = ".\my_website"
$env:REMOVE_CLICKABLE_CONTACTS = "false"
python -m wayback_archive.cli
Windows (CMD)
set WAYBACK_URL=https://web.archive.org/web/20250417203037/http://example.com/
set OUTPUT_DIR=.\my_website
set REMOVE_CLICKABLE_CONTACTS=false
python -m wayback_archive.cli
Quick Test
Download a limited number of files to verify everything works:
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export MAX_FILES=5
python3 -m wayback_archive.cli
How It Works
- Initial download -- Fetches the main page from the Wayback Machine
- Link extraction -- Parses HTML to find all referenced assets (links, images, CSS, JS)
- CSS processing -- Extracts font URLs, background images, and
@importstatements; downloads Google Fonts locally; detects corrupted font files - JS processing -- Extracts dynamically loaded resources from JavaScript
- Data attributes -- Scans
data-*attributes for additional asset URLs - Iterative crawling -- Continues discovering and downloading resources until the queue is empty
- Timeframe fallback -- For 404 responses, searches nearby Wayback Machine timestamps
- URL rewriting -- Converts all URLs to relative paths for offline serving
- Preservation -- Maintains icon groups, button links, and cookie consent functionality
Project Structure
Wayback-Archive/
wayback_archive/ # Main package
__init__.py
__main__.py
cli.py # CLI entry point
config.py # Environment variable configuration
downloader.py # Core download and processing engine
config/
requirements.txt # Runtime dependencies
requirements-dev.txt # Development dependencies
setup.py # Package setup
pytest.ini # Test configuration
tests/ # Test suite
docs/ # Documentation
LICENSE # GPL-3.0
README.md
Testing
pip install -r config/requirements-dev.txt
# Run tests
pytest
# Run tests with coverage
pytest --cov=wayback_archive
Troubleshooting
Port Already in Use
python3 -m http.server 8080 # Use a different port
Font Loading Issues
- Google Fonts: Downloaded automatically to avoid CORS issues
- Corrupted fonts: Detected and removed from CSS automatically
- Missing fonts: Some fonts may not exist in the Wayback Machine archive
See Font Loading Research Notes for details.
Missing Links or Icons
- Icon groups (social media, contacts) are preserved automatically
- Button links with
sppb-btnorbtnclasses are preserved - Set
REMOVE_CLICKABLE_CONTACTS=falseto keeptel:andmailto:links
jQuery or Libraries Not Loading
The tool includes automatic CDN fallback for critical libraries. If a file fails to download from the Wayback Machine, it will attempt to fetch it from a CDN.
Dependencies
| Package | Purpose |
|---|---|
| requests | HTTP client |
| beautifulsoup4 | HTML parsing |
| lxml | Fast HTML/XML parser |
| minify-html | HTML minification |
| cssmin | CSS minification |
| rjsmin | JS minification |
| Pillow | Image optimization |
| python-dotenv | .env file support |
Contributing
Contributions are welcome. Please feel free to submit a Pull Request.
Related Web Archiving Tools
- Way-CMS — Simple web CMS for editing archived HTML/CSS files
- Wayback-Diff — Web page comparison with Wayback Machine support
- web-mirror — Mirror any webpage for offline access
- media-download — Download all media files from any web page
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wayback_archive-1.3.3.tar.gz.
File metadata
- Download URL: wayback_archive-1.3.3.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93d6613eac228dd69f6d58e2399fb0fd01364865a1bd3384b98703039ab539ae
|
|
| MD5 |
05adc94d531c43db9476f8ff87d73f3a
|
|
| BLAKE2b-256 |
fbf0de818d84d1cc8a48211cd1179b776a1017ed2b45c56fc56c79b5adb8320d
|
File details
Details for the file wayback_archive-1.3.3-py3-none-any.whl.
File metadata
- Download URL: wayback_archive-1.3.3-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
707a81a49d649aa39a015bcfa1b568989f2cfe6e160289cc56ca9af97cb22853
|
|
| MD5 |
064323b31d9eca31755cbd1b0645a6f3
|
|
| BLAKE2b-256 |
16d90ea94fbcb9b4b38edfe4ee5a0104fd7f79901f3c6062f416ea61d43b64c8
|