Tool for analysing links within a website archived by archive.org
Project description
Metarchive was created during the need to analyse some archived sites from http://archive.org to create some metrics. In this endeavor certain procedures repeated itself over and over again, so I decided to put them to a package. In the spirit of Open Source I decided to make the project public, maybe someone out there can cut some corners using it. If so, great thing, let me know if you like to.
Please be reasonable when accessing pages archived by archive.org. It's a great project and there is no need to create excessive stress on their servers.
Below is a sample invocation of the library, for more information please have a look at the documentation.
from metarchive import create_url_analyser, AnalyseURL, ArchiveLink
analyse: AnalyseURL = create_url_analyser()
links: list[ArchiveLink] = list(
analyse(
'http://web.archive.org/web/20210101000012/http://example.com/'
)
)
print(links)
# Will yield something like
[
ArchiveLink(timestamp=datetime.datetime(2020, 11, 30, 23, 42, 54),
original_url=AnyUrl('https://example.com/', scheme='https', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2021, 2, 1, 0, 5, 5),
original_url=AnyUrl('https://example.com/', scheme='https', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2020, 12, 30, 23, 54, 41),
original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2021, 1, 2, 0, 7, 38),
original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2019, 12, 31, 23, 45, 1),
original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2022, 1, 1, 0, 19, 39),
original_url=AnyUrl('https://example.com/', scheme='https', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
original_url=AnyUrl('http://web.archive.org/screenshot/http://example.com/', scheme='http',
host='web.archive.org', tld='org', host_type='domain',
path='/screenshot/http://example.com/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
host_type='domain', path='/'), category=None),
ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
original_url=AnyUrl('https://www.iana.org/domains/example', scheme='https', host='www.iana.org',
tld='org', host_type='domain', path='/domains/example'), category=None)
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file metarchive-0.6.1.tar.gz
.
File metadata
- Download URL: metarchive-0.6.1.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.13.0-35-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec1e64bde8d8c46dc04fc8fbb62c27b9f91b6bcb200dfd9e7055a9c2800de47e |
|
MD5 | 7ae6f82f4264e6cb56ffe68e9bf7eb5c |
|
BLAKE2b-256 | 1ff3ba72f0e46a47bda34c92c3d0108e58d29a3ec26c51f82700ee2f8b9c545e |
File details
Details for the file metarchive-0.6.1-py3-none-any.whl
.
File metadata
- Download URL: metarchive-0.6.1-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.13.0-35-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d448d99e98d1a4cef52f8318d47b410bed475a53875df1f3b356f0267606515 |
|
MD5 | 950c59c343e71a15ee30a55f7bc9f8d3 |
|
BLAKE2b-256 | 0e4495edbb5fd1bad0337021471c558badb46c36f12eeb72852e3a3e1e4dad2f |