Tool for extracting archived web sites from the Internet Archive saving as WARC files.

Project description

InternetArchiveExtractor

This repository extracts archived content from the Wayback Machine and converts collected metadata and downloaded snapshot files into compressed WARC files. The project currently supports three primary modes of operation: downloading snapshots from the Internet Archive, combining/cleaning CSV metadata produced by Wayback backup tools, and converting that metadata + downloaded files into a WARC-GZ archive.

What this does (short)

Download mode: reads a CSV of Internet Archive (Wayback) URLs, determines snapshot ranges, and uses pywaybackup to download snapshots (the downloads are stored in the local waybackup_snapshots/ folder by default).
Convert mode: combines CSV files (from a directory) into a single CSV and then converts that CSV into a compressed WARC (.warc.gz) using warcio.
Full mode: runs download then combine+convert to produce a WARC in one run.

Requirements

Install the Python dependencies from the repository requirements.txt:

pip install -r requirements.txt

Notable packages used:

pywaybackup — downloads Wayback snapshots
pandas — CSV handling and merging when combining multiple CSVs
warcio — writing WARC records

See requirements.txt for the exact pinned versions used in this repository.

Project layout (important files)

src/main.py — command-line entry point that exposes download, convert, and full modes.
src/internet_archive_downloader.py — logic that reads an input CSV of Internet Archive URLs and runs pywaybackup to download snapshots.
src/waybackup_to_warc.py — functions to combine CSV files, clean URLs (remove :80), and produce a .warc.gz from a CSV of records.
resources/ — example CSVs (e.g. curated_urls.csv) useful for quick testing.

How to run

Usage pattern for the main runner (src/main.py):

python src/main.py <mode> <input> [--output OUTPUT] [--column_name COLUMN] [--period DAY|WEEK] [--reset]

Modes and example usage:

Download mode — download snapshots listed in a CSV
- Description: Reads a CSV containing full Wayback URLs such as https://web.archive.org/web/20251002062751/https://example.com/page and downloads snapshots for a small period around the archived date.
- Required input: path to the CSV file to read (e.g. resources/curated_urls.csv). The default column name expected is Internet_Archive_URL.
- Example:
```
 python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL --period DAY
```
- Flags:
  - --period — DAY (default) or WEEK. Controls whether the downloader fetches snapshots ±1 day or ±1 week around the archived date.
  - --reset — if present, passes reset=True to pywaybackup (useful to force re-download).
Convert mode — combine CSVs and produce a WARC
- Description: Combine all .csv files from the specified directory into a single CSV (written to combined_output.csv by default) and convert that CSV to a WARC-GZ.
- Required input: path to a directory that contains CSV files to combine (e.g. waybackup_snapshots/ or any folder with CSV exports).
- --output should be provided to name the resulting WARC file (the code will append .warc.gz).
- Example:
```
 python src/main.py convert waybackup_snapshots --output mysite_archive
```
- Notes: The script combines CSV files using pandas.concat and writes the combined CSV to combined_output.csv (value of COMBINED_CSV_PATH). The combined CSV is then read and converted into output/<output>.warc.gz.
Full mode — download then convert
- Description: Downloads snapshots from the input CSV, then combines CSVs (from waybackup_snapshots) and converts them into a WARC.
- Example:
```
 python src/main.py full resources/curated_urls.csv --output combined_site_archive
```

Important implementation notes

Default combined CSV file path: combined_output.csv (the module-level COMBINED_CSV_PATH in src/waybackup_to_warc.py).
The CSVs read by the converter are expected to contain columns like url_origin, url_archive, file, timestamp, and response (see src/waybackup_to_warc.py for required field names used when creating WARC records).
The converter will skip entries whose file path does not exist and prints a warning. It also emits simple 404/500 WARC entries when those response codes are encountered.

Example quick workflow

Create or obtain a CSV of Wayback URLs (column name Internet_Archive_URL), e.g. resources/curated_urls.csv.
Download snapshots for those URLs:
```
python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL
```
This writes per-site CSVs and downloaded files into waybackup_snapshots/ (and related subfolders) using pywaybackup.

Combine CSVs and convert to WARC:

python src/main.py convert waybackup_snapshots --output archived_site

The resulting WARC will be written to output/archived_site.warc.gz.

Troubleshooting

If --output is not provided for convert/full, the conversion step may attempt to use a None filename. Always provide --output when converting.
If the script can't find expected CSV columns, inspect the CSV(s) created by pywaybackup and ensure the required column names (file, timestamp, response, url_origin) are present.
If downloads fail, try rerunning with --reset to force re-downloads.

Next steps / Improvements

Add argument validation to require --output for convert and full modes.
Add unit tests for CSV combining and WARC creation edge cases (missing files, bad timestamps).

Project details

Release history Release notifications | RSS feed

0.0.11

Mar 10, 2026

0.0.10

Jan 15, 2026

0.0.9

Jan 13, 2026

This version

0.0.8

Dec 19, 2025

0.0.7

Oct 22, 2025

0.0.6

Oct 22, 2025

0.0.5

Oct 22, 2025

0.0.4

Oct 21, 2025

0.0.3

Oct 16, 2025

0.0.2

Oct 16, 2025

0.0.1

Oct 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

internet_archive_extractor-0.0.8.tar.gz (10.8 kB view details)

Uploaded Dec 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

internet_archive_extractor-0.0.8-py3-none-any.whl (12.5 kB view details)

Uploaded Dec 19, 2025 Python 3

File details

Details for the file internet_archive_extractor-0.0.8.tar.gz.

File metadata

Download URL: internet_archive_extractor-0.0.8.tar.gz
Upload date: Dec 19, 2025
Size: 10.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for internet_archive_extractor-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`78941f696e38b976ac935ee8ae2569c6e36afbafe421e0fcb799c5f34affc025`
MD5	`b06a0b4430277c326c5a661750fae9ff`
BLAKE2b-256	`0b15edf78cc799aa3a992451c683e3913098093c8851f9452ffc573113976f2c`

See more details on using hashes here.

File details

Details for the file internet_archive_extractor-0.0.8-py3-none-any.whl.

File metadata

Download URL: internet_archive_extractor-0.0.8-py3-none-any.whl
Upload date: Dec 19, 2025
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for internet_archive_extractor-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5b75aa11a286c3a4304128c0170d0488512e795e98c747dc9d3724bd91c6a93`
MD5	`ad9d23c0e19d8628f0ac0577895c2a6c`
BLAKE2b-256	`3582ed19d1b0f8df36058d1b7e4fb7df3f2e0318ce1b62b6affe53ff9b0c29e0`

See more details on using hashes here.

internet-archive-extractor 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

InternetArchiveExtractor

What this does (short)

Requirements

Project layout (important files)

How to run

Important implementation notes

Example quick workflow

Troubleshooting

Next steps / Improvements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes