Self-hosted internet archive

These details have been verified by PyPI

Project links

Changelog

GitHub Statistics

Maintainers

LunarWatcher

These details have not been verified by PyPI

Project links

Project description

MIArchive[^1]

A quick and dirty archival system meant as a replacement for my use of ArchiveBox, featuring:

undetected-geckodriver with ublock by default. Self-hosted archives have been especially vulnerable to aggressive Cloudflare configurations that block anything that maybe perhaps vaguely looks like it could be an AI slop scraper.
More of archive.org-like interface, where recapturing sites isn't a second-class activity shoehorned in after the fact.

Unlike ArchiveBox, MIArchive is intentionally designed to not store as many formats. Though certain additional downloaders exist, for websites, the goal is to store websites. If you want to download YouTube videos, there's a perfectly good program for that.

Also unlike ArchiveBox, MIA is Linux-only, largely to take advantage of some Linux-only features.

Untitled tangent section

The main target pages for the archiver is relatively simple pages, meaning pages where things load fairly easily. Archiving a full SPA, for example, is never going to be as good as archiving a more conventional website. Large amounts of dynamic content heavily dependent on API requests that aren't called during the page load never works well, at least not with any archives I'm aware of.

Trying to support these to archive every single website in the greatest detail possible simply isn't a goal of this archiver. That's part of why I have no plans to support WARC. That and WARC doesn't seem to trivially integrate with selenium, from a few very quick searches. I care more about preserving information than preserving every cursed website setup there is. Instead, MIA focuses more on actually getting to the content. Ads can fuck right off, and Cloudflare can too. This may be an unacceptable tradeoff for a good few kinds of archivists out there, but it's perfectly acceptable for my use.

ArchiveBox, as far as I can tell, has plans to handle stuff like this better, but its goals are also very different from MIA's goals. Unforunately, development has halted for the foreseeable future, as its main developer had to earn money. That is what caused this project to start existing; the internet is growing increasingly locked-down, which makes third-party archival increasingly more difficult. At the same time, centralised archives (notably archive.org) is under immense pressure from anti-archival capitalists with sizeable lawyer funds. I don't have months to wait for ArchiveBox to maybe become usable again.

Why not support WARC?

WARC is designed to reproduce websites down to the request level, while MIA is designed to store websites in a usable format.

MIA only cares about three categories of request codes:

Redirects, because these are displayed specially in the UI
Various OKs, basically the entire 100- and 200-series
Errors, which are either ignored or result in archival errors

If you need precision archival, there's other tools more suited for that, and I do not understand enough to even begin to approximate an implementation.

Why roll your own?

Two main reason, the first being that I can. I need this kind of software, and with ArchiveBox no longer being an option, this was the only way.

The second, and the arguably better and more broadly applicable reason, is that archival appears to have been somewhat underprioritised in open-source.

Archive.org, archive.is, and many of the other major archives currently in used are closed-source. Archive.org is the more legal one of these[^2], and in spite of this, several industries have gone after archive.org with lawsuits. Lawsuits are expensive, especially when the two parties are a donation-driven non-profit, and for-profit organisations with seemingly bottomless lawyer funds and an almost certainly unhealthy love for litigation.

The combination of a few huge, closed-source archives and lawsuits means archival is constantly being threatened. If these actors disappear (read: get sued into oblivion), publicly available archival software also risks disappearing.

Though MIA will never be able to operate at that scale, it is at least designed to work well for private archival use. You won't find an (officially endorsed) public MIA instance anywhere. It is also one of only two archival tools I'm aware of meeting this particular niche, with the other being ArchiveBox. ArchiveBox has a list of competitors, and of these, precisely 0 are the same kind of software as ArchiveBox and MIA. There's WARC, there's bookmark managers (quite a few actually, and they're only archiving in the sense that they try to create a local copy), notetaking tools, and other archival utilities, but no full, proper archives. Precisely 0 of these archives then go on to work around aggressive Cloudflare configurations that block private archives, but not private access.

But there should be more full archives on the list. MIA is my contribution - and I do hope much better alternatives appear eventually.

Implementation technicalities

Cloudflare and `robots.txt`

When Stack Exchange drastically increased how aggressive they configured Cloudflare, it resulted in people being locked out, or forced to go through very regular Cloudflare checks due to browser configuration details. Legitimate users were inconvenienced or blocked by systems meant to block AI scrapers - they were collateral damage. Same with many scripts by powerusers that make the site possible to moderate or use efficiently.

This has happened in quite a few places around the web, and unfortunately, it means that private archives are heavily affected. Archive.org often gets a pass because it's a big, centralised instance on several whitelists. Cloudflare maintains a list of "verified bots"[^3], which the Internet Archive is on, but getting on that list is a Whole Thing:tm:. There's a minimum requirement of 1000 requests per day, and that the IP(s) provided for the service are exclusively used for that service[^4]. If all you're doing is running a self-hosted archive so you don't lose access to the sites you care about, you're probably going to fail both these criteria. I self-host my instance, and my IP is used for all kinds of things, so I would be breaching that policy even if I did somehow manage 1000 requests/day.

In lieu of being able to say "I'm a bot operating on the explicit instructions of a human", the decision was made to apply various techniques for avoiding Cloudflare. undetected-geckodriver-lw is used to force the browser not to identify itself as automated, and certain very basic stepes are taken to automatically resolve Cloudflare checks if they're encountered. robots.txt is not respected either, since the services with very aggressive CF configurations don't respect me as a user in a non-automated context anyway.

As for robots.txt in particular, archive.is has a similar rationale. The crawls are primarily intended to be either triggered by a human, or triggered by a human by proxy, so robots.txt doesn't need to be respected. This behaviour is consistent with many other archival and non-archival applications.

I apply a similar rationale to Cloudflare. The requests are manually made for someone who self-hosts MIA, so within reason trying to work around Cloudflare is not something I see as a problem. The attempted workaround is simply clicking the checkbox and seeing if that's good enough for the captcha, which is what a human would've done if they sat interacting with the archival process anyway. Since the goal is to implement an extended version of <Ctrl-S> in browsers, and that stores it in a sane format automatically, it's not far-fetched that this is something someone could do manually anyway.

Requirements and setup

To set up MIArchive, you need:

A Linux-based server
Python 3.10+
Postgresql, not necessarily installed on the same machine

For development setup, see CONTRIBUTING.md. The README only details how to install MIArchive for production use.

Automated setup

[^1]: Yes, this is a pun on Missing In Action and Archival/Archive (Missing In Archive is functionally the canonical full name). Yes, I thought I was funny. Yes, I'm already regretting my decision (mostly, it does at least give nice, shortly typed mia commands). Yes, I still think I'm funny several days later, even after needing to fork selenium-wire-2. [^2]: I'm not saying the others are illegal, but as far as I know, archive.org goes a lot further than many other archives in making sure the content hosted is legal. This unfortunately means archive.org is fairly quick to take down content, which makes it hard to actually preserve the historical record. [^3]: This list is absolute bullshit. It includes OpenAI, Google's slop bot, anthropic, and Meta (Facebook), all of whom have questionable relations to respecting requests not steal data, and questionable relations to basic copyright law [^4]: https://developers.cloudflare.com/bots/concepts/bot/verified-bots/policy/

Project details

These details have been verified by PyPI

Project links

Changelog

GitHub Statistics

Maintainers

LunarWatcher

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.3

Jul 9, 2025

0.0.2

Jul 9, 2025

This version

0.0.1

Jul 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

miarchive-0.0.1.tar.gz (7.0 kB view details)

Uploaded Jul 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

miarchive-0.0.1-py3-none-any.whl (6.5 kB view details)

Uploaded Jul 6, 2025 Python 3

File details

Details for the file miarchive-0.0.1.tar.gz.

File metadata

Download URL: miarchive-0.0.1.tar.gz
Upload date: Jul 6, 2025
Size: 7.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for miarchive-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`f5b0d0b8807957e1d1c03f9eb86f1dd0392fa39f1d3a0e5489687976f0d5879d`
MD5	`1fd9eb7df59949c7448457bcb2d66c4c`
BLAKE2b-256	`bda9d0499511d64bb6a3e977215d9980f16b786de1212efe52891d46262351d0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for miarchive-0.0.1.tar.gz:

Publisher: release.yml on LunarWatcher/MIArchive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: miarchive-0.0.1.tar.gz
- Subject digest: f5b0d0b8807957e1d1c03f9eb86f1dd0392fa39f1d3a0e5489687976f0d5879d
- Sigstore transparency entry: 264888210
- Sigstore integration time: Jul 6, 2025
Source repository:
- Permalink: LunarWatcher/MIArchive@82a4f3847e82fb9ba66d8225daf158700e59eeac
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/LunarWatcher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@82a4f3847e82fb9ba66d8225daf158700e59eeac
- Trigger Event: release

File details

Details for the file miarchive-0.0.1-py3-none-any.whl.

File metadata

Download URL: miarchive-0.0.1-py3-none-any.whl
Upload date: Jul 6, 2025
Size: 6.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for miarchive-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49d6166d6ac1a0da285b01750b4a7324b9b8bd151b740349bf57b565fe31707b`
MD5	`ff6b99ebd71404d33993ff8668845880`
BLAKE2b-256	`48936e1d4439c5adb6dae8078d40c477bf2e05dfa0dae869a19d2c7c990e61e1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for miarchive-0.0.1-py3-none-any.whl:

Publisher: release.yml on LunarWatcher/MIArchive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: miarchive-0.0.1-py3-none-any.whl
- Subject digest: 49d6166d6ac1a0da285b01750b4a7324b9b8bd151b740349bf57b565fe31707b
- Sigstore transparency entry: 264888211
- Sigstore integration time: Jul 6, 2025
Source repository:
- Permalink: LunarWatcher/MIArchive@82a4f3847e82fb9ba66d8225daf158700e59eeac
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/LunarWatcher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@82a4f3847e82fb9ba66d8225daf158700e59eeac
- Trigger Event: release

miarchive 0.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MIArchive[^1]

Untitled tangent section

Why not support WARC?

Why roll your own?

Implementation technicalities

Cloudflare and `robots.txt`

Requirements and setup

Automated setup

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

miarchive 0.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MIArchive[^1]

Untitled tangent section

Why not support WARC?

Why roll your own?

Implementation technicalities

Cloudflare and robots.txt

Requirements and setup

Automated setup

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Cloudflare and `robots.txt`