Skip to main content

Tool for analysing links within a website archived by archive.org

Project description

Metarchive was created during the need to analyse some archived sites from http://archive.org to create some metrics. In this endeavor certain procedures repeated itself over and over again, so I decided to put them to a package. In the spirit of Open Source I decided to make the project public, maybe someone out there can cut some corners using it. If so, great thing, let me know if you like to.

Please be reasonable when accessing pages archived by archive.org. It's a great project and there is no need to create excessive stress on their servers.

Below is a sample invocation of the library, for more information please have a look at the documentation.

from metarchive import create_url_analyser, AnalyseURL, ArchiveLink

analyse: AnalyseURL = create_url_analyser()
links: list[ArchiveLink] = list(
    analyse(
        'http://web.archive.org/web/20210101000012/http://example.com/'
    )
)
print(links)
# Will yield something like
[
    ArchiveLink(timestamp=datetime.datetime(2020, 11, 30, 23, 42, 54),
                original_url=AnyUrl('https://example.com/', scheme='https', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2021, 2, 1, 0, 5, 5),
                original_url=AnyUrl('https://example.com/', scheme='https', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2020, 12, 30, 23, 54, 41),
                original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2021, 1, 2, 0, 7, 38),
                original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2019, 12, 31, 23, 45, 1),
                original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2022, 1, 1, 0, 19, 39),
                original_url=AnyUrl('https://example.com/', scheme='https', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
                original_url=AnyUrl('http://web.archive.org/screenshot/http://example.com/', scheme='http',
                                    host='web.archive.org', tld='org', host_type='domain',
                                    path='/screenshot/http://example.com/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
                original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
                original_url=AnyUrl('http://example.com/', scheme='http', host='example.com', tld='com',
                                    host_type='domain', path='/'), category=None),
    ArchiveLink(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 12),
                original_url=AnyUrl('https://www.iana.org/domains/example', scheme='https', host='www.iana.org',
                                    tld='org', host_type='domain', path='/domains/example'), category=None)
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metarchive-0.6.1.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

metarchive-0.6.1-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file metarchive-0.6.1.tar.gz.

File metadata

  • Download URL: metarchive-0.6.1.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.13.0-35-generic

File hashes

Hashes for metarchive-0.6.1.tar.gz
Algorithm Hash digest
SHA256 ec1e64bde8d8c46dc04fc8fbb62c27b9f91b6bcb200dfd9e7055a9c2800de47e
MD5 7ae6f82f4264e6cb56ffe68e9bf7eb5c
BLAKE2b-256 1ff3ba72f0e46a47bda34c92c3d0108e58d29a3ec26c51f82700ee2f8b9c545e

See more details on using hashes here.

File details

Details for the file metarchive-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: metarchive-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Linux/5.13.0-35-generic

File hashes

Hashes for metarchive-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3d448d99e98d1a4cef52f8318d47b410bed475a53875df1f3b356f0267606515
MD5 950c59c343e71a15ee30a55f7bc9f8d3
BLAKE2b-256 0e4495edbb5fd1bad0337021471c558badb46c36f12eeb72852e3a3e1e4dad2f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page