Skip to main content

Wayback Machine utils (web.archive.org)

Project description

wayback_utils.py

This module provides a Python interface to interact with the Wayback Machine web page archiving service (web.archive.org). It allows you to save URLs, check the status of archiving jobs, and verify if a URL has already been indexed.

Based on SPN2 Public API Docs

Main classes:

  • WayBackStatus: Represents the status of an archiving job.
  • WayBackSave: Represents the response when requesting to archive a URL.
  • WayBack: Main class to interact with the Wayback Machine API.

Installation

pip install wayback_utils

  • You need valid access keys (ACCESS_KEY and SECRET_KEY) to use the archiving API.
  • You can provide an on_confirmation callback function to save() to receive the final archiving status asynchronously.
  • The module uses requests and threading.

Basic usage:

[!NOTE]
You can obtain your ACCESS_KEY and SECRET_KEY from archive.org.

  1. Initialize the WayBack class with your access keys:
    from wayback_utils import WayBack, WayBackStatus, WayBackSave
    
    wb = WayBack(ACCESS_KEY="your_access_key", SECRET_KEY="your_secret_key")
  1. Save a URL:
    result = wb.save("https://example.com")
  1. Check the status of a job:
    status = wb.status(result.job_id)
  1. Verify if a URL is already indexed:
    is_indexed = wb.indexed("https://example.com")

You can also pass a callback function to save() using the on_confirmation parameter. This callback will be called asynchronously with the final result of the archiving operation:

def my_callback(result):
    print("Archiving finished:", result.status)

result = wb.save("https://example.com", on_confirmation=my_callback)

[!WARNING]
URLs archived with the Wayback Machine may take up to 12 hours to become fully indexed and discoverable.

save() parameters:

The save( ) method accepts several optional parameters to customize the capture process:

  • url: The URL to be archived.
  • timeout: Maximum time (in seconds) to wait for the archiving operation to complete.
  • capture_all: Set to 1 to capture web pages even if they return HTTP errors (4xx/5xx). By default, only status 200 pages are captured.
  • capture_outlinks: Set to 1 to automatically capture outlinks found on the page (including PDF, JSON, RSS, MRSS).
  • capture_screenshot: Set to 1 to capture a full-page PNG screenshot, stored as a separate capture.
  • delay_wb_availability: Set to 1 to delay capture availability in the Wayback Machine by ~12 hours, reducing system load.
  • force_get: Set to 1 to force a simple HTTP GET request for capture, overriding the default HEAD-based logic.
  • skip_first_archive: Set to 1 to skip checking if this is the first archive, speeding up the process.
  • outlinks_availability: Set to 1 to return the timestamp of the last capture for all outlinks.
  • email_result: Set to 1 to receive an email report of the captured URLs.
  • on_confirmation: A callback function that will be called asynchronously with the final result of the archiving operation.

status() parameters:

The status( ) method checks the status of an archiving job.

  • job_id: The unique identifier of the archiving job to check.
  • timeout: Maximum time in seconds to wait for the status response.

Returns a WayBackStatus object with details about the job's progress or result.

indexed() parameters:

The indexed( ) method checks if a given URL has already been archived and indexed by the Wayback Machine.

  • url: The URL to check for existing archives.
  • timeout: Maximum time in seconds to wait for the response.

Returns True if the URL has at least one valid (HTTP 2xx or 3xx) archived snapshot, otherwise False.

License:

MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wayback_utils-0.0.17.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wayback_utils-0.0.17-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file wayback_utils-0.0.17.tar.gz.

File metadata

  • Download URL: wayback_utils-0.0.17.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.14.0b2

File hashes

Hashes for wayback_utils-0.0.17.tar.gz
Algorithm Hash digest
SHA256 6aced4330b223a1c61f526aa652cd39da0f82a699c1591d9be4876ab5321e670
MD5 36b8ae9cb415e6d8b7e7d612d655f2e5
BLAKE2b-256 ef1c6abebadd855084e79935b774f60bea2199e0eca93c45c834e84625ad5a7e

See more details on using hashes here.

File details

Details for the file wayback_utils-0.0.17-py3-none-any.whl.

File metadata

  • Download URL: wayback_utils-0.0.17-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.14.0b2

File hashes

Hashes for wayback_utils-0.0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 6b0cbc5dc3a605419236d9b81b5af6dbef4585f8289ab2d73d970c245f4a7b6b
MD5 8069906404f5030a893686acccff2560
BLAKE2b-256 9f68cfde4aa5c42b0684b3ec9b9265a9c6b93d576207616adaf0896b029b1efb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page