Skip to main content

Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily.

Project description


A Python package & CLI tool that interfaces with the Wayback Machine API

Unit Tests codecov pypi Downloads Codacy Badge GitHub lastest commit PyPI - Python Version Code style: black


Introduction

Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.

Wayback Machine has 3 client side APIs.

  • SavePageNow or Save API
  • CDX Server API
  • Availability API

These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.

Installation

Using pip, from PyPI (recommended):

pip install waybackpy

Using conda, from conda-forge (recommended):

See also waybackpy feedstock, maintainers are @rafaelrdealmeida, @labriunesp and @akamhy.

conda install -c conda-forge waybackpy

Install directly from this git repository (NOT recommended):

pip install git+https://github.com/akamhy/waybackpy.git

Docker Image

Docker Hub: hub.docker.com/r/secsi/waybackpy

Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).

RAUDI is a tool by SecSI, an Italian cybersecurity startup.

Usage

As a Python package

Save API aka SavePageNow

>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)

CDX API aka CDXServerAPI

>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://google.com"
>>> user_agent = "my new app's user agent"
>>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent)
oldest
>>> cdx_api.oldest()
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest = cdx_api.oldest()
>>> oldest
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> oldest.original
'http://google.com:80/'
>>> oldest.urlkey
'com,google)/'
>>> oldest.timestamp
'19981111184551'
>>> oldest.datetime_timestamp
datetime.datetime(1998, 11, 11, 18, 45, 51)
>>> oldest.statuscode
'200'
>>> oldest.mimetype
'text/html'
newest
>>> newest = cdx_api.newest()
>>> newest
com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563
>>> newest.archive_url
'https://web.archive.org/web/20220217234427/http://@google.com/'
>>> newest.timestamp
'20220217234427'
near
>>> near = cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10)
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.timestamp
'20101010101435'
>>> near.timestamp
'20101010101435'
>>> near = cdx_api.near(wayback_machine_timestamp=2008080808)
>>> near.archive_url
'https://web.archive.org/web/20080808051143/http://google.com/'
>>> near = cdx_api.near(unix_timestamp=1286705410)
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> 
snapshots
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
...     print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/

Availability API

It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, WaybackMachineAvailabilityAPI, are also implemented in the CDX server API interface class, WaybackMachineCDXServerAPI. Also note that the newest() method of WaybackMachineAvailabilityAPI can be more recent than WaybackMachineCDXServerAPI's same method.

>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
newest
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
near
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/

Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.

As a CLI tool

Demo video on asciinema.org, you can copy the text from video:

asciicast

CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

waybackpy-3.0.6.tar.gz (29.9 kB view hashes)

Uploaded Source

Built Distribution

waybackpy-3.0.6-py3-none-any.whl (34.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page