Skip to main content

Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily.

Project description


A Python package & CLI tool that interfaces with the Wayback Machine API

Unit Tests pypi Downloads GitHub lastest commit PyPI - Python Version Code style: black


⭐️ Introduction

Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine API.

Wayback Machine has 3 client side APIs.

These three APIs can be accessed via the waybackpy either by importing it in a script or from the CLI.

🏗 Installation

Using pip, from PyPI (recommended):

pip install waybackpy

Install directly from this git repository (NOT recommended):

pip install git+https://github.com/akamhy/waybackpy.git

🐳 Docker Image

Docker Hub : https://hub.docker.com/r/secsi/waybackpy

Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).

RAUDI is a tool by SecSI (https://secsi.io), an Italian cybersecurity startup.

🚀 Usage

As a Python package

Save API aka SavePageNow
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)
Availability API
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>>
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>>
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
>>>
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
CDX API aka CDXServerAPI
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
...     print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/

Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.

As a CLI tool

Saving a webpage:

waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
Archive URL:
https://web.archive.org/web/20220121193801/https://en.wikipedia.org/wiki/Social_media
Cached save:
False

Retriving the oldest archive and also printing the JSON response of the availability API:

waybackpy --oldest --json --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
Archive URL:
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
JSON response:
{"url": "https://en.wikipedia.org/wiki/Humanoid", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid", "timestamp": "20040415020811"}}, "timestamp": "199401212126"}

Archive close to a time, minute level precision is supported:

waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/

CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.

🛡 License

License: MIT

Copyright (c) 2020-2022 Akash Mahanty Et al.

Released under the MIT License. See license for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

waybackpy-3.0.2.tar.gz (17.7 kB view hashes)

Uploaded Source

Built Distribution

waybackpy-3.0.2-py3-none-any.whl (19.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page