Skip to main content

Archive GitHub repositories to the Internet Archive.

Project description

iagitup

PyPI version Python versions License: GPLv3

Archive GitHub repositories to the Internet Archive.

iagitup clones a GitHub repository, creates a portable git bundle, and uploads it to the Internet Archive with rich metadata. If the repository has a wiki, that is bundled and uploaded too. A companion command, archive-watchlist, continuously archives the most-starred repositories on GitHub.


Features

  • Full-fidelity snapshots -- every branch, tag, and ref is preserved in a single git bundle.
  • Wiki archiving -- wiki repositories are detected and bundled automatically.
  • Rich IA metadata -- description, README, topics, language, stars, and more are attached to each item.
  • Duplicate prevention -- two layers (local state cache + IA item check) ensure the same snapshot is never uploaded twice.
  • Bulk archiving -- archive-watchlist fetches and archives the top-N most-starred GitHub repos on a schedule.
  • Parallel workers -- configurable concurrency for bulk runs.
  • Custom metadata -- pass extra key:value pairs to enrich any upload.

Installation

From PyPI

pip install iagitup

This installs two commands: iagitup and archive-watchlist.

From source

git clone https://github.com/gdamdam/iagitup.git
cd iagitup
pip install .

Prerequisites


Quick Start

Archive a single repository

iagitup https://github.com/torvalds/linux
:: Downloading https://github.com/torvalds/linux ...
:: Cloning https://github.com/torvalds/linux.git ...
:: Uploading bundle: torvalds-linux_-_2026-02-28_10-00-00.bundle
:: Upload FINISHED.
   Identifier:          github.com-torvalds-linux_-_2026-02-28_10-00-00
   Archived repository: https://archive.org/details/github.com-torvalds-linux_-_2026-02-28_10-00-00
   Git bundle:          https://archive.org/download/github.com-torvalds-linux_-_2026-02-28_10-00-00/torvalds-linux_-_2026-02-28_10-00-00.bundle

Bulk-archive top starred repos

# Preview the top 10 without uploading
archive-watchlist --dry-run --top-n 10

# Full run with 8 parallel workers
archive-watchlist --workers 8

Usage

iagitup

iagitup [options] <github_repo_url>
Flag Short Default Description
github_url -- (required) GitHub repository URL to archive
--metadata -m -- Custom metadata fields (see Custom Metadata)
--version -v -- Print version and exit

archive-watchlist

archive-watchlist [options]
Flag Default Description
--top-n N 100 Number of top repositories to fetch and check (max 100)
--workers N 4 Number of parallel archive workers
--dry-run off Preview what would be archived -- no uploads, no state changes
--state-file PATH ./watchlist_state.json Path to the persistent state cache

Examples:

# Use a custom state file
archive-watchlist --state-file /var/lib/iagitup/state.json

Configuration

GitHub Authentication

Unauthenticated GitHub API calls are rate-limited to 60 requests/hour. Set GITHUB_TOKEN to raise this to 5,000/hour:

export GITHUB_TOKEN=ghp_your_token_here
iagitup https://github.com/user/repo

Generate a token at https://github.com/settings/tokens -- no specific scopes are required for public repositories.

Internet Archive Credentials

On first run, if no credentials are found, iagitup will prompt you to run ia configure interactively. Credentials are stored in ~/.ia or ~/.config/ia.ini and reused on subsequent runs.

You can also configure them manually:

ia configure

Or create ~/.ia directly:

[s3]
access = YOUR_ACCESS_KEY
secret = YOUR_SECRET_KEY

Find your keys at https://archive.org/account/s3.php.


Custom Metadata

Pass additional Internet Archive metadata fields as comma-separated key:value pairs:

iagitup --metadata="subject:python;cli,creator:myorg" https://github.com/user/repo

Custom fields are merged into the default metadata. Any key that matches a default field will override it.

Default metadata fields

Field Value
mediatype software
collection open_source_software
creator GitHub owner login
title IA item identifier
date Last push date (YYYY-MM-DD)
year Last push year
subject GitHub;code;software;git
originalurl GitHub repository URL
pushed_date Full push timestamp (YYYY-MM-DD HH:MM:SS)
uploaded_with iagitup-vX.X.X
description HTML: repo description + README + restore instructions

Extra fields added by archive-watchlist

Field Value
stars_count Stargazer count at time of archive
forks_count Fork count
watchers_count Watcher count
language Primary programming language
topics Semicolon-joined topic list
github_rank Position in the top-N list
subject Extended: base tags + language + topics

How It Works

Single repository (iagitup)

  1. Fetches metadata from the GitHub API (pushed_at, description, owner, topics, language, etc.).
  2. Checks for duplicates -- the IA item identifier is derived from the repo name and pushed_at timestamp (github.com-{owner}-{repo}_-_{YYYY-MM-DD_HH-MM-SS}). If an item with that identifier already exists, iagitup exits early.
  3. Clones the repository in full (all branches and tags).
  4. Downloads the owner's avatar as a cover image (cover.jpg), concurrently with wiki cloning.
  5. Creates git bundles (git bundle create --all) for the repository and, if present, the wiki.
  6. Builds an HTML description from the repo description, README (.md or .txt), and restore instructions.
  7. Uploads the bundle(s) and cover image to the Internet Archive.

Each archived repository becomes a single IA item containing:

File Description
<bundle_name>.bundle Full git bundle (all branches + tags)
cover.jpg Repository owner's avatar
<bundle_name>_wiki.bundle Wiki git bundle (if wiki exists)

Bulk archiving (archive-watchlist)

  1. Fetches the top-N repos from the GitHub Search API (sorted by stars).
  2. Compares each repo's pushed_at against a local state cache.
  3. Skips unchanged repos instantly (no network, no IA call).
  4. Archives new or updated repos via iagitup, enriched with popularity metadata.
  5. Repos are processed in parallel across a configurable worker pool.
  6. State is saved to disk after each archive -- a crash mid-run loses at most one item.

Duplicate prevention

Two independent layers prevent the same snapshot from being uploaded twice:

Layer Where How
Local cache archive_watchlist.py Compares pushed_at to the state file -- instant skip, zero network traffic
IA item check iagitup.upload_ia Checks item.exists on IA before any heavy work -- safe even if the state file is deleted

A new push changes pushed_at, generates a new IA item identifier, and triggers a fresh archive -- preserving the full history of snapshots.

Cron setup

Add to your crontab (crontab -e) to run daily at 03:00:

0 3 * * * archive-watchlist >> watchlist.log 2>&1

Set GITHUB_TOKEN in the cron environment to avoid rate limiting:

GITHUB_TOKEN=ghp_your_token_here
0 3 * * * archive-watchlist >> watchlist.log 2>&1

State file

The state file (watchlist_state.json) tracks the last-seen snapshot of each repository:

{
  "owner/repo": {
    "pushed_at": "2026-02-01T12:00:00Z",
    "archived_at": "2026-02-02T03:00:00Z",
    "ia_identifier": "github.com-owner-repo_-_2026-02-01_12-00-00",
    "stars": 470000
  }
}

To force a re-archive of a specific repo, delete its entry from the file or change pushed_at to an old value.


Restoring an Archived Repository

  1. Find the .bundle file in the archived IA item.
  2. Download it:
wget https://archive.org/download/<identifier>/<bundle>.bundle
  1. Clone from the bundle:
git clone <bundle>.bundle my-repo

All branches and tags are preserved in the bundle.


Contributing

Contributions are welcome. Please open an issue or submit a pull request at https://github.com/gdamdam/iagitup.

The project uses GitLab CI for automated linting and testing across Python 3.10, 3.11, 3.12, and 3.13.


License

GPLv3 -- Copyright (C) 2018-2026 Giovanni Damiola

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License v3.0 as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iagitup-3.1.2.tar.gz (37.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iagitup-3.1.2-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file iagitup-3.1.2.tar.gz.

File metadata

  • Download URL: iagitup-3.1.2.tar.gz
  • Upload date:
  • Size: 37.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for iagitup-3.1.2.tar.gz
Algorithm Hash digest
SHA256 81c161ab7c5cd89a4a17f17eb3226b91a4955db3a8e6006e7a12fa202496d3a4
MD5 5d460d0fedf9ae0e35cff786b6e128bd
BLAKE2b-256 b86204322076f6f4be0097f700800f7fa33da804a6e5e4cb73f4385760f8317f

See more details on using hashes here.

File details

Details for the file iagitup-3.1.2-py3-none-any.whl.

File metadata

  • Download URL: iagitup-3.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for iagitup-3.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d9e5cd6bda619a15225ebb52abb96e9bbdc3a483becb3e5e49708d76a64ce37b
MD5 1eaad5bcc772e200120908e2651a9250
BLAKE2b-256 e2bd9decc7810fa01f865e11a66d9e35664c4b800fb9b563db1018d062bfd8ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page