A comprehensive tool for comparing web pages with Wayback Machine support
Project description
Detect meaningful differences between web pages -- with Wayback Machine artifact cleaning, visual comparison, and significance scoring.
Why Wayback-Diff?
Comparing web pages sounds simple until you deal with Wayback Machine injection artifacts, insignificant whitespace noise, and visual regressions invisible to the DOM. Wayback-Diff is a purpose-built CLI that solves all three:
- Wayback Machine cleaning -- automatically strips banners, analytics scripts, playback code, and URL rewrites so you compare actual content.
- Significance scoring -- every change is tagged High, Medium, or Low so you focus on what matters.
- Multi-browser visual comparison -- captures screenshots in Chrome, Firefox, Edge, and Opera, then generates pixel-diff images.
- CI/CD-ready exit codes -- integrate directly into pipelines (
0= no changes,1= low/medium,2= high).
Table of Contents
- Quick Start
- Installation
- Usage
- Visual Comparison
- Markdown Reports
- CI/CD Integration
- How It Works
- Output Formats
- Comparison with Similar Tools
- Contributing
- License
Quick Start
pip install wayback-diff
# Compare two pages
wayback-diff https://example.com/old https://example.com/new
# Compare a Wayback snapshot with the live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/
# Full report: visual diff + markdown
wayback-diff https://old.example.com https://new.example.com --visual --markdown
Installation
From PyPI
pip install wayback-diff
# With visual comparison support
pip install wayback-diff[visual]
From source
git clone https://github.com/GeiserX/Wayback-Diff.git
cd Wayback-Diff
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
For visual comparison support:
pip install -e ".[visual]"
Docker
docker build -t wayback-diff .
docker run --rm wayback-diff https://example.com/a https://example.com/b
Usage
Basic comparison
wayback-diff https://example.com/page1 https://example.com/page2
Wayback Machine support
The tool automatically detects Wayback Machine URLs and cleans injection artifacts before comparing:
# Archive vs. live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/
# Two archive snapshots
wayback-diff \
https://web.archive.org/web/20230101/https://example.com/ \
https://web.archive.org/web/20230601/https://example.com/
Output formats
# Save to file
wayback-diff url1 url2 -o diff.txt
# JSON (for programmatic consumption)
wayback-diff url1 url2 --format json
# Unified diff
wayback-diff url1 url2 --format unified
Site-wide traversal
# Crawl and compare across linked pages (depth-limited)
wayback-diff url1 url2 --traverse --depth 2
Advanced options
| Flag | Description |
|---|---|
--no-clean-wayback |
Disable Wayback Machine artifact removal |
--no-ignore-whitespace |
Treat whitespace changes as significant |
--timeout N |
Set HTTP timeout in seconds (default: 30) |
--verbose |
Enable detailed logging |
Visual Comparison
Take screenshots in one or more browsers and generate side-by-side difference images:
# Auto-detect all installed browsers
wayback-diff url1 url2 --visual
# Specific browsers
wayback-diff url1 url2 --visual --browsers chrome firefox edge opera
# Custom viewport
wayback-diff url1 url2 --visual --viewport-width 1280 --viewport-height 720
# Non-headless mode (for debugging)
wayback-diff url1 url2 --visual --no-headless
# Custom screenshot output
wayback-diff url1 url2 --visual --screenshot-dir ./my-screenshots
Visual comparison generates:
- Screenshots of both pages per browser
- Side-by-side comparison images
- Pixel-level difference highlighting (red overlay marks changes)
Markdown Reports
Generate comprehensive Markdown reports that include everything in a single reviewable document:
wayback-diff url1 url2 --visual --markdown --report-dir ./reports
Each report contains:
- Executive summary with change statistics
- Visual comparison screenshots (when
--visualis used) - Changes grouped by significance (High / Medium / Low)
- Site-wide results (when
--traverseis used) - Actionable recommendations
CI/CD Integration
Wayback-Diff returns meaningful exit codes designed for pipeline gates:
| Exit Code | Meaning |
|---|---|
0 |
No differences detected |
1 |
Low or medium significance changes |
2 |
High significance changes detected |
GitHub Actions example
name: Visual Regression Check
on:
pull_request:
jobs:
diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Wayback-Diff
run: |
pip install -r requirements.txt
pip install -e ".[visual]"
- name: Compare staging vs production
run: |
wayback-diff \
https://staging.example.com \
https://production.example.com \
--visual --markdown --format json -o diff.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: diff-report
path: reports/
Shell script gate
wayback-diff "$OLD_URL" "$NEW_URL" --format json -o result.json
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "BLOCKING: high-significance changes detected"
exit 1
elif [ $EXIT_CODE -eq 1 ]; then
echo "WARNING: minor changes detected"
fi
How It Works
Wayback Machine cleaning
When a Wayback Machine URL is detected, the tool automatically:
- Removes header artifacts -- strips analytics scripts, playback scripts, and banner CSS injected by the Wayback Machine.
- Removes footer comments -- removes archival metadata and copyright notices.
- Restores URLs -- converts
web.archive.org/web/…/prefixed URLs back to their originals. - Normalizes content -- handles whitespace and formatting differences introduced by archival.
Significance scoring
Every detected change is categorized:
| Level | Examples |
|---|---|
| High | Structural changes, content text, meta tags, scripts, stylesheets |
| Medium | Attribute changes, inline styling, div/span modifications |
| Low | Whitespace, comments, minor formatting |
Intelligent comparison
The diff engine:
- Focuses on meaningful content changes
- Ignores noise like timestamps and auto-generated IDs
- Provides context around each change
- Groups results by significance for fast review
Output Formats
Text (default)
Summary statistics, significance breakdown, and detailed changes with context lines.
JSON
Structured output for programmatic processing:
{
"summary": {
"total_changes": 15,
"added": 5,
"removed": 3,
"modified": 7,
"high_significance": 2,
"medium_significance": 8,
"low_significance": 5
},
"changes": [
{
"type": "modified",
"old_text": "...",
"new_text": "...",
"significance": "high"
}
]
}
Unified diff
Standard unified diff format, compatible with patch and code review tools.
Comparison with Similar Tools
| Feature | Wayback-Diff | htmldiff | diff2html | BackstopJS | Percy |
|---|---|---|---|---|---|
| HTML-aware semantic diff | Yes | Yes | No | No | No |
| Wayback Machine artifact cleaning | Yes | No | No | No | No |
| Significance scoring | Yes | No | No | No | No |
| Visual (screenshot) comparison | Yes | No | No | Yes | Yes |
| Multi-browser support | Yes | N/A | N/A | Yes | Yes |
| Site-wide crawl and compare | Yes | No | No | Yes | No |
| Markdown report generation | Yes | No | No | No | No |
| CI/CD exit codes | Yes | No | No | Yes | Yes |
| Self-hosted / no SaaS | Yes | Yes | Yes | Yes | No |
| Free and open source | GPL-3.0 | MIT | MIT | MIT | Freemium |
Testing
pip install -r requirements-dev.txt
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=wayback_diff --cov-report=html
Contributing
Contributions are welcome. To get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Add tests for new functionality
- Ensure all tests pass:
pytest tests/ -v - Submit a Pull Request
Related Web Archiving Tools
- Wayback-Archive — Download complete websites from the Wayback Machine
- Way-CMS — Simple web CMS for editing archived HTML/CSS files
- web-mirror — Mirror any webpage for offline access
- media-download — Download all media files from any web page
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.
This software is not intended for commercial use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wayback_diff-1.1.0.tar.gz.
File metadata
- Download URL: wayback_diff-1.1.0.tar.gz
- Upload date:
- Size: 34.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec85f3af2ddf9c0598ef24078b8ba64bf0cd68919f4e0b6aa68612c418f3d198
|
|
| MD5 |
055f1b03a4fcda537d2cfd496f4e3c00
|
|
| BLAKE2b-256 |
8b4b89431edb301867f29fccf31654528faea9c8a4c463fee8f8c4fbe7741314
|
File details
Details for the file wayback_diff-1.1.0-py3-none-any.whl.
File metadata
- Download URL: wayback_diff-1.1.0-py3-none-any.whl
- Upload date:
- Size: 35.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50c4cb6aa51bc90839c553607dea06f24b1928998a2668ec0fa3ef4f28a58368
|
|
| MD5 |
56ef0e2d40cc49b6b51274e29804699f
|
|
| BLAKE2b-256 |
639f015a91b8a1ffe5f190b690afb64be79c550c7aaca712270892d1ddeb0755
|