A simple Python wrapper for archive.org's "Save Page Now" capturing service
Project description
A simple Python wrapper for archive.org’s “Save Page Now” capturing service
Installation
$ pipenv install savepagenow
Python Usage
Import it.
>>> import savepagenow
Capture a URL.
>>> archive_url = savepagenow.capture("http://www.example.com/")
See where it’s stored.
>>> print(archive_url)
https://web.archive.org/web/20161018203554/http://www.example.com/
If a URL has been recently cached, archive.org may return the URL to that page rather than conduct a new capture. When that happens, the capture method will raise a CachedPage exception.
This is likely happen if you request the same URL twice within a few seconds.
>>> savepagenow.capture("http://www.example.com/")
'https://web.archive.org/web/20161019062637/http://www.example.com/'
>>> savepagenow.capture("http://www.example.com/")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "savepagenow/__init__.py", line 36, in capture
archive_url
savepagenow.exceptions.CachedPage: archive.org returned a cached version of this page: https://web.archive.org/web/20161019062637/http://www.example.com/
You can craft your code to catch that exception yourself, or use the built-in capture_or_cache method, which will return the URL provided by archive.org along with a boolean indicating if it is a fresh capture (True) or from the cache (False).
>>> savepagenow.capture_or_cache("http://www.example.com/")
('https://web.archive.org/web/20161019062832/http://www.example.com/', True)
>>> savepagenow.capture_or_cache("http://www.example.com/")
('https://web.archive.org/web/20161019062832/http://www.example.com/', False)
There’s no accounting for taste but you could craft a line to handle that command like so:
>>> url, captured = savepagenow.capture_or_cache("http://www.example.com/")
Command-line usage
The Python library is also installed as a command-line interface. You can run it from your terminal like so:
$ savepagenow http://www.example.com/
The command has the same options as the Python API, which you can learn about from its help output.
$ savepagenow --help
Usage: savepagenow [OPTIONS] URL
Archives the provided URL using the archive.org Wayback Machine.
Raises a CachedPage exception if archive.org declines to conduct a new
capture and returns a previous snapshot instead.
Options:
-ua, --user-agent TEXT User-Agent header for the web request
-c, --accept-cache Accept and return cached URL
--help Show this message and exit.
Customizing the user agent
In an effort to be transparent and polite to the Internet Archive, all requests made by savepagenow carry a custom user agent that identifies itself as "savepagenow (https://github.com/pastpages/savepagenow)".
You can further customize this setting by using the optional arguments to our API.
Here’s how to do it in Python:
>>> savepagenow.capture("http://www.example.com/", user_agent="my user agent here")
And here’s how to do it from the command line:
$ savepagenow http://www.example.com/ --user-agent "my user agent here"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for savepagenow-1.1.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b85c3baf9edbae36e5b6c6823dfbb98b9ea23b399d0884493afb1c5216988d25 |
|
MD5 | 9a3b7aeb4116d8f4750486b06c4b2baa |
|
BLAKE2b-256 | a0febf8b66805dab8d94da07412e1700aa7bfed8cbcd491b9db5b9c7ff9a766c |