Skip to main content

Turn URLs in Markdown files into archive.org snapshots

Project description

archive-md-urls: Turn URLs into archive.org snapshots in Markdown

archive-md-urls scans Markdown files for URLs and if possible turns them into links to snapshots from archive.org. If a publication date can be extracted from the file (more info), the snapshots closest to this date will be used. If no date can be found, the latest available snapshots are used instead.

This is very useful when you use a static site generator for your personal homepage that supports Markdown for writing blogposts and pages, e.g. Pelican, Jekyll or Hugo. Older content published years ago is likely to contain link rot: links that are simply broken or now point to a different target compared to when you wrote the content. In an ideal scenario, archive-md-urls will not only fix these URLs, but also link to a snapshot that shows how a website or social media profile/post looked like when you wrote the content.

archive-md-urls tries to be smart and does not simply replace every URL it finds. Instead, it uses a list of URLs which are considered 'stable' and are therefore ignored: URLs that already point to archive.org snapshots, intra-site links (e.g. a link to another blogpost on the same homepage) and URLs that contain persistent identifiers.

Example showcase

Input file example_blogpost.md:

Tile: Example blog post
author: Stefan
date: 2013-11-06

This fake blog post from 2013 links to [example.com](http://www.example.com/), a homepage that has dramatically changed in the meantime.

But it also links to URLs which can be considered 'stable':

- [here](https://web.archive.org/web/20000622042643/http://www.google.com/) we already link to an archive.org snapshot
- [here](https://doi.org/10.1080/32498327493.2014.358732798) the link contains a persistent identifier
- and [here]({filename}/blog/2012/2012-02-05-an-even-older-blogpost.md) we link to a different post on our own homepage (Pelican format, Jekyll and Hugo intra-site links are supported too)

In addition, google.com is mentioned but not explicitly linked.

And finally, [here](www.some-madeup-link-that-hasnt-been-archived.com) we link to a homepage that doesn't have any corresponding archive.org snapshots.

Output from archive-md-urls example_blogpost.md:

Tile: Example blog post
author: Stefan
date: 2013-11-06

This fake blog post from 2013 links to [example.com](http://web.archive.org/web/20131106211912/http://www.example.com/), a homepage that has dramatically changed in the meantime.

But it also links to URLs which can be considered 'stable':

- [here](https://web.archive.org/web/20000622042643/http://www.google.com/) we already link to an archive.org snapshot
- [here](https://doi.org/10.1080/32498327493.2014.358732798) the link contains a persistent identifier
- and [here]({filename}/blog/2012/2012-02-05-an-even-older-blogpost.md) we link to a different post on our own homepage (Pelican format, Jekyll and Hugo intra-site links are supported too)

In addition, google.com is mentioned but not explicitly linked.

And finally, [here](www.some-madeup-link-that-hasnt-been-archived.com) we link to a homepage that doesn't have any corresponding archive.org snapshots.

Note how only the first link to example.com has been altered and points to a snapshot close in time to the publication date of this fake blog post (6th November 2013). Also note that URLs which are mentioned but not explicitly linked are ignored.

Install

You can install archive-md-urls via pip:

> pip install archive-md-urls

However, using Pipx is recommended:

> pipx install archive-md-urls

Usage

Important: archive-md-urls modifies your files directly in-place. It is recommended that the files you want to change are under version control so you can review the changes.

Once installed, you can pass any number of Markdown files or directories containing Markdown files to archive-md-urls:

# Update two files
> archive-md-urls my-file.md another-file.md
Updated 13 links in 2 files.
# Update files in a directory
> archive-md-urls myblog/content/blog/2014
Updated 97 links in 20 files.
# You can also combine files and directories
> archive-md-urls myblog/content/blosg/2014 my-file.md
Updated 103 links 21 files.

By default, directories are not searched recursively for Markdown files. For recursive search, use the -r flag (use this with caution!):

# Update URLs in all Markdown files of myblog
> archive-md-urls -r myblog/content
Updated 160 links in 32 files.

Note that Markdown files are identified by the file ending .md, other file endings are ignored.

How publication dates are detected

First, archve-md-urls checks for a date field in Markdown metadata blocks (see Python Markdown's format and YAML front matter). If that fails, it tries to extract a date from the name of the file following Jekyll's naming convention where blog posts are named YEAR-MONTH-DAY-title.

Currently, archive-md-urls recognizes dates in the formats YYYY-MM-DD and YYYY-DD-DD hh:msm.

A note about speed

archive-md-urls uses asyncio with HTTPX to make asynchronous API calls. However, do not expect to get fast results, especially (but not only) when you try to change a larger amount of URLs. The Wayback Machine API can be slow or even unavailable. If archive-md-urls has to cancel the operation because of that, just re-run it on the same files again later. Links that have already been updated before will be skipped because archive.org links are considered stable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archive-md-urls-0.0.1.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

archive_md_urls-0.0.1-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file archive-md-urls-0.0.1.tar.gz.

File metadata

  • Download URL: archive-md-urls-0.0.1.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for archive-md-urls-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3187001d18204ddf5444410083966f34956a83075dd7ab6f21c2946d95d33b6e
MD5 67947d375f1de115daa5ea039f141654
BLAKE2b-256 f2df341464cd856141b4071e1b5fad5a9fe2929fefbbf3240c88e9cdae1f52ca

See more details on using hashes here.

File details

Details for the file archive_md_urls-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: archive_md_urls-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for archive_md_urls-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3da0e9692755e0406959ac1e205e413ba5d940bfb5036da10e5bf02226c07213
MD5 fe0006404177a5c70dc463375100b43b
BLAKE2b-256 b58a71508d612f6b1c58616c1fa53965eb1c04b94ec67f8fa178e3d7bb2a9d1c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page