Turn URLs in Markdown files into archive.org snapshots
Project description
archive-md-urls: Turn URLs in Markdown files into archive.org snapshots
archive-md-urls scans Markdown files for URLs and if possible turns them into links to snapshots from archive.org. If a publication date can be extracted from the file (more info), the snapshots closest to this date will be used. If no date can be found, the latest available snapshots are used instead.
This is very useful when you use a static site generator for your personal homepage that supports Markdown for writing blogposts and pages, e.g. Pelican, Jekyll, Hugo and Zola. Older content published years ago is likely to contain link rot: links that are simply broken or now point to a different target compared to when you wrote the content. In an ideal scenario, archive-md-urls will not only fix these URLs, but also link to a snapshot that shows how a website or social media profile/post looked like when you wrote the content.
archive-md-urls tries to be smart and does not simply replace every URL it finds. Instead, it uses a list of URLs which are considered 'stable' and are therefore ignored: URLs that already point to archive.org snapshots, intra-site links (e.g. a link to another blogpost on the same homepage) and URLs that contain persistent identifiers.
Example showcase
Input file example_blogpost.md:
Tile: Example blog post
author: Stefan
date: 2013-11-06
This fake blog post from 2013 links to [example.com](http://www.example.com/), a homepage that has dramatically changed in the meantime.
But it also links to URLs which can be considered 'stable':
- [here](https://web.archive.org/web/20000622042643/http://www.google.com/) we already link to an archive.org snapshot
- [here](https://doi.org/10.1080/32498327493.2014.358732798) the link contains a persistent identifier
- and [here]({filename}/blog/2012/2012-02-05-an-even-older-blogpost.md) we link to a different post on our own homepage (Pelican format, Jekyll, Hugo and Zola intra-site links are supported too)
In addition, google.com is mentioned but not explicitly linked.
And finally, [here](www.some-madeup-link-that-hasnt-been-archived.com) we link to a homepage that doesn't have any corresponding archive.org snapshots.
Output from archive-md-urls example_blogpost.md:
Tile: Example blog post
author: Stefan
date: 2013-11-06
This fake blog post from 2013 links to [example.com](http://web.archive.org/web/20131106211912/http://www.example.com/), a homepage that has dramatically changed in the meantime.
But it also links to URLs which can be considered 'stable':
- [here](https://web.archive.org/web/20000622042643/http://www.google.com/) we already link to an archive.org snapshot
- [here](https://doi.org/10.1080/32498327493.2014.358732798) the link contains a persistent identifier
- and [here]({filename}/blog/2012/2012-02-05-an-even-older-blogpost.md) we link to a different post on our own homepage (Pelican format, Jekyll, Hugo and Zola intra-site links are supported too)
In addition, google.com is mentioned but not explicitly linked.
And finally, [here](www.some-madeup-link-that-hasnt-been-archived.com) we link to a homepage that doesn't have any corresponding archive.org snapshots.
Note how only the first link to example.com has been altered and points to a snapshot close in time to the publication date of this fake blog post (6th November 2013). Also note that URLs which are mentioned but not explicitly linked are ignored.
Install
Since you probably won't need archive-md-urls frequently, the recommended way to use it is via uvx or pipx run. Both methods install and run packages in a temporary virtual environment that will be discarded automatically after some time. If you're unfamiliar with these tools, I recommend using uv. Here is how you would use archive-md-urls to convert URLs in a single file with them (see next section of this readme on how to use archive-md-urls):
# uv
uvx archive-md-urls file.md
# pipx
pipx run archive-md-urls file.md
You can also install archive-md-urls permanently using these tools:
# uv
uv tool install archive-md-urls
# pipx
pipx install archive-md-urls
If you don't want to use any additional tool for installing Python packages, you can simply install archive-md-urls via pip (preferably in a virtual environment!):
python -m pip install archive-md-urls
Usage
Important: archive-md-urls modifies your files directly in-place. It is recommended that the files you want to change are under version control so you can review the changes.
Once installed, you can pass any number of Markdown files or directories containing Markdown files to archive-md-urls:
# Update two files
archive-md-urls my-file.md another-file.md
Updated 13 links in 2 files.
# Update files in a directory
archive-md-urls myblog/content/blog/2014
Updated 97 links in 20 files.
# You can also combine files and directories
archive-md-urls myblog/content/blog/2014 my-file.md
Updated 103 links 21 files.
By default, directories are not searched recursively for Markdown files. For recursive search, use the -r flag (use this with caution!):
# Update URLs in all Markdown files of myblog
archive-md-urls -r myblog/content
Updated 160 links in 32 files.
Note that Markdown files are identified by the file ending .md, other file endings are ignored.
A note about speed
archive-md-urls uses asyncio with HTTPX to make asynchronous API calls. However, do not expect to get fast results, especially (but not only) when you try to change a larger amount of URLs. The Wayback Machine API can be slow or even unavailable. If archive-md-urls has to cancel the operation because of that, just re-run it on the same files again later. Links that have already been updated before will be skipped because archive.org links are considered stable.
Contributing
If you would like to contribute to this project, please create a pull request from a fork.
To set up a local development environment, clone your fork and set up a virtual environment with your preferred tool. For example:
# Here we just clone the main repository, change it to your fork's URL
git clone https://github.com/sbaack/archive-md-urls.git
cd archive-md-urls
python -m venv .venv && source .venv/bin/activate
Install an editable version of archive-md-urls:
make setup
# OR, if you can't use Gnu Make:
python -m pip install -U pip -Ue .
In addition, you'll need hatch to run tests. If you don't already have hatch installed via Pipx, Conda etc., install it in the project venv:
python -m pip install -U hatch
Tests should pass before submitting a pull request:
make test
# OR:
hatch run tests:test
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file archive_md_urls-0.0.8.tar.gz.
File metadata
- Download URL: archive_md_urls-0.0.8.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d636ba8b050ff8302a4ce483b33138cb7079d970f0c33b3c89cda205b44bbf9
|
|
| MD5 |
fc992f1eca71399a0003a814212c2b1c
|
|
| BLAKE2b-256 |
81ac3892777775853504a2d41794fa68346d921901b4222d00dc487b9eb51f4e
|
File details
Details for the file archive_md_urls-0.0.8-py3-none-any.whl.
File metadata
- Download URL: archive_md_urls-0.0.8-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aac5da8b0ebd1a11d64eb89b28559288ffc082b595ba6dea32cea4eaf0279141
|
|
| MD5 |
ee799d67a278ac20a7955ec372100299
|
|
| BLAKE2b-256 |
601e8ca8258d7df93af77e14e0a26315a53a246adc838e320f971568d6353867
|