Skip to main content

A comprehensive tool for comparing web pages with Wayback Machine support

Project description

Wayback-Diff Banner

Detect meaningful differences between web pages -- with Wayback Machine artifact cleaning, visual comparison, and significance scoring.

Python Versions PyPI Version License: GPL-3.0 Docker


Why Wayback-Diff?

Comparing web pages sounds simple until you deal with Wayback Machine injection artifacts, insignificant whitespace noise, and visual regressions invisible to the DOM. Wayback-Diff is a purpose-built CLI that solves all three:

  • Wayback Machine cleaning -- automatically strips banners, analytics scripts, playback code, and URL rewrites so you compare actual content.
  • Significance scoring -- every change is tagged High, Medium, or Low so you focus on what matters.
  • Multi-browser visual comparison -- captures screenshots in Chrome, Firefox, Edge, and Opera, then generates pixel-diff images.
  • CI/CD-ready exit codes -- integrate directly into pipelines (0 = no changes, 1 = low/medium, 2 = high).

Table of Contents


Quick Start

pip install wayback-diff

# Compare two pages
wayback-diff https://example.com/old https://example.com/new

# Compare a Wayback snapshot with the live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/

# Full report: visual diff + markdown
wayback-diff https://old.example.com https://new.example.com --visual --markdown

Installation

From PyPI

pip install wayback-diff

# With visual comparison support
pip install wayback-diff[visual]

From source

git clone https://github.com/GeiserX/Wayback-Diff.git
cd Wayback-Diff
python3 -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

For visual comparison support:

pip install -e ".[visual]"

Docker

docker build -t wayback-diff .
docker run --rm wayback-diff https://example.com/a https://example.com/b

Usage

Basic comparison

wayback-diff https://example.com/page1 https://example.com/page2

Wayback Machine support

The tool automatically detects Wayback Machine URLs and cleans injection artifacts before comparing:

# Archive vs. live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/

# Two archive snapshots
wayback-diff \
  https://web.archive.org/web/20230101/https://example.com/ \
  https://web.archive.org/web/20230601/https://example.com/

Output formats

# Save to file
wayback-diff url1 url2 -o diff.txt

# JSON (for programmatic consumption)
wayback-diff url1 url2 --format json

# Unified diff
wayback-diff url1 url2 --format unified

Site-wide traversal

# Crawl and compare across linked pages (depth-limited)
wayback-diff url1 url2 --traverse --depth 2

Advanced options

Flag Description
--no-clean-wayback Disable Wayback Machine artifact removal
--no-ignore-whitespace Treat whitespace changes as significant
--timeout N Set HTTP timeout in seconds (default: 30)
--verbose Enable detailed logging

Visual Comparison

Take screenshots in one or more browsers and generate side-by-side difference images:

# Auto-detect all installed browsers
wayback-diff url1 url2 --visual

# Specific browsers
wayback-diff url1 url2 --visual --browsers chrome firefox edge opera

# Custom viewport
wayback-diff url1 url2 --visual --viewport-width 1280 --viewport-height 720

# Non-headless mode (for debugging)
wayback-diff url1 url2 --visual --no-headless

# Custom screenshot output
wayback-diff url1 url2 --visual --screenshot-dir ./my-screenshots

Visual comparison generates:

  • Screenshots of both pages per browser
  • Side-by-side comparison images
  • Pixel-level difference highlighting (red overlay marks changes)

Markdown Reports

Generate comprehensive Markdown reports that include everything in a single reviewable document:

wayback-diff url1 url2 --visual --markdown --report-dir ./reports

Each report contains:

  • Executive summary with change statistics
  • Visual comparison screenshots (when --visual is used)
  • Changes grouped by significance (High / Medium / Low)
  • Site-wide results (when --traverse is used)
  • Actionable recommendations

CI/CD Integration

Wayback-Diff returns meaningful exit codes designed for pipeline gates:

Exit Code Meaning
0 No differences detected
1 Low or medium significance changes
2 High significance changes detected

GitHub Actions example

name: Visual Regression Check
on:
  pull_request:

jobs:
  diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install Wayback-Diff
        run: |
          pip install -r requirements.txt
          pip install -e ".[visual]"

      - name: Compare staging vs production
        run: |
          wayback-diff \
            https://staging.example.com \
            https://production.example.com \
            --visual --markdown --format json -o diff.json

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: diff-report
          path: reports/

Shell script gate

wayback-diff "$OLD_URL" "$NEW_URL" --format json -o result.json
EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
  echo "BLOCKING: high-significance changes detected"
  exit 1
elif [ $EXIT_CODE -eq 1 ]; then
  echo "WARNING: minor changes detected"
fi

How It Works

Wayback Machine cleaning

When a Wayback Machine URL is detected, the tool automatically:

  1. Removes header artifacts -- strips analytics scripts, playback scripts, and banner CSS injected by the Wayback Machine.
  2. Removes footer comments -- removes archival metadata and copyright notices.
  3. Restores URLs -- converts web.archive.org/web/…/ prefixed URLs back to their originals.
  4. Normalizes content -- handles whitespace and formatting differences introduced by archival.

Significance scoring

Every detected change is categorized:

Level Examples
High Structural changes, content text, meta tags, scripts, stylesheets
Medium Attribute changes, inline styling, div/span modifications
Low Whitespace, comments, minor formatting

Intelligent comparison

The diff engine:

  • Focuses on meaningful content changes
  • Ignores noise like timestamps and auto-generated IDs
  • Provides context around each change
  • Groups results by significance for fast review

Output Formats

Text (default)

Summary statistics, significance breakdown, and detailed changes with context lines.

JSON

Structured output for programmatic processing:

{
  "summary": {
    "total_changes": 15,
    "added": 5,
    "removed": 3,
    "modified": 7,
    "high_significance": 2,
    "medium_significance": 8,
    "low_significance": 5
  },
  "changes": [
    {
      "type": "modified",
      "old_text": "...",
      "new_text": "...",
      "significance": "high"
    }
  ]
}

Unified diff

Standard unified diff format, compatible with patch and code review tools.


Comparison with Similar Tools

Feature Wayback-Diff htmldiff diff2html BackstopJS Percy
HTML-aware semantic diff Yes Yes No No No
Wayback Machine artifact cleaning Yes No No No No
Significance scoring Yes No No No No
Visual (screenshot) comparison Yes No No Yes Yes
Multi-browser support Yes N/A N/A Yes Yes
Site-wide crawl and compare Yes No No Yes No
Markdown report generation Yes No No No No
CI/CD exit codes Yes No No Yes Yes
Self-hosted / no SaaS Yes Yes Yes Yes No
Free and open source GPL-3.0 MIT MIT MIT Freemium

Testing

pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=website_diff --cov-report=html

Contributing

Contributions are welcome. To get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Add tests for new functionality
  4. Ensure all tests pass: pytest tests/ -v
  5. Submit a Pull Request

Related Web Archiving Tools

  • Wayback-Archive — Download complete websites from the Wayback Machine
  • Way-CMS — Simple web CMS for editing archived HTML/CSS files
  • web-mirror — Mirror any webpage for offline access
  • media-download — Download all media files from any web page

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.

This software is not intended for commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayback_diff-1.0.0.tar.gz (46.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wayback_diff-1.0.0-py3-none-any.whl (47.7 kB view details)

Uploaded Python 3

File details

Details for the file wayback_diff-1.0.0.tar.gz.

File metadata

  • Download URL: wayback_diff-1.0.0.tar.gz
  • Upload date:
  • Size: 46.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for wayback_diff-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ad82b415a4d4fbe297ca972b707d82d5d0444863351006d6f90ce3b8aee1cf33
MD5 7dd570a2da5b6e2886739d3ff3bbcdc8
BLAKE2b-256 b76dace3a7224a9757bbb310c186c1e2783550ded038ecc59cc7d3ebb8f5d570

See more details on using hashes here.

File details

Details for the file wayback_diff-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wayback_diff-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 47.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for wayback_diff-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 012365df6eb6e8e366ea1efe9ac39e9a0cacb99b91634470327087aadf9ce514
MD5 10ec4b0081688c043b9c45f2432a4f8d
BLAKE2b-256 294d4035683f2fcc53fdc26371c1851ff98a5afb285ff7807d7fb5a2504206e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page