Tool for extracting archived web sites from the Internet Archive saving as WARC files.

Project description

InternetArchiveExtractor

This repository extracts archived content from the Wayback Machine and converts collected metadata and downloaded snapshot files into compressed WARC files. The project supports two primary modes of operation: downloading snapshots from the Internet Archive and converting CSV metadata (produced by pywaybackup) into WARC-GZ archives.

What this does (short)

Download mode: Reads a CSV of Internet Archive (Wayback) URLs, and uses pywaybackup to download snapshots. For each URL processed, it automatically converts the downloaded snapshots to a WARC file and cleans up temporary files.
Convert mode: Combines CSV files (from a directory) into a single CSV and then converts that CSV into a compressed WARC (.warc.gz) using warcio.

Requirements

Install the Python dependencies from the repository requirements.txt:

pip install -r requirements.txt

Notable packages used:

pywaybackup — downloads Wayback snapshots
pandas — CSV handling and merging when combining multiple CSVs
warcio — writing WARC records

See requirements.txt for the exact pinned versions used in this repository.

Project layout (important files)

src/main.py — command-line entry point that exposes download and convert modes.
src/internet_archive_downloader.py — logic that reads an input CSV of Internet Archive URLs and runs pywaybackup to download snapshots. After each URL is downloaded, it automatically converts the CSV to WARC and cleans up temporary files.
src/waybackup_to_warc.py — functions to combine CSV files, clean URLs (remove :80), and produce a .warc.gz from a CSV of records. ng.

How to run

Usage pattern for the main runner (src/main.py):

# Download mode
python src/main.py download <input> [--column_name COLUMN] [--period PERIOD] [--reset] [--start_time START] [--end_time END] [--snapshot-folder FOLDER] [--warc-output FOLDER] [--workers N] [--clean]

# Convert mode
python src/main.py convert <input> --output OUTPUT [--warc-output FOLDER]

Modes and example usage:

Download mode — download snapshots listed in a CSV

Description: Reads a CSV containing full Wayback URLs such as https://web.archive.org/web/20251002062751/https://example.com/page and downloads snapshots for a specified period around the archived date. After downloading each URL, the tool automatically:

Converts the downloaded snapshots to a WARC file (saved in output/ directory, or custom location via --warc-output)
Cleans up temporary files from waybackup_snapshots/ directory (if --clean flag is used)

Note: In download mode, WARC filenames are automatically generated from the URL. The --output flag is not used in this mode.

Required input: Path to the CSV file to read (e.g. resources/curated_urls.csv). The default column name expected is Internet_Archive_URL.

Example:

python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL --period DAY

Flags:

--column_name — Name of the CSV column containing Wayback URLs (default: Internet_Archive_URL)
--period — Download period options:
- DAY (default) — Downloads snapshots ±1 day around the archived date
- WEEK — Downloads snapshots ±1 week around the archived date
- FULL — Downloads all snapshots from 1995-2005
- CUSTOM — Downloads snapshots within a custom date range (requires --start_time and --end_time)
--start_time — Start time for CUSTOM period in YYYYMMDDHHMMSS format
--end_time — End time for CUSTOM period in YYYYMMDDHHMMSS format
--reset — If present, forces re-download by passing reset=True to pywaybackup
--snapshot-folder — Path to the folder where pywaybackup stores downloaded snapshots (default: ./waybackup_snapshots)
--warc-output — Path to the folder where WARC files will be saved (default: ./output)
--workers — Number of worker threads for parallel downloading (default: 5)
--clean — If present, deletes intermediate CSV, DB, and CDX files after processing

Example with CUSTOM period:

python src/main.py download resources/curated_urls.csv --period CUSTOM --start_time 20000101000000 --end_time 20001231235959

Example with custom snapshot and WARC output folders:

python src/main.py download resources/curated_urls.csv --snapshot-folder /data/snapshots --warc-output /data/warcs

Convert mode — combine CSVs and produce a WARC

Description: Combine all .csv files from the specified directory into a single CSV (written to combined_output.csv by default) and convert that CSV to a WARC-GZ.

Required input: Path to a directory that contains CSV files to combine (e.g. waybackup_snapshots/ or any folder with CSV exports).

Required --output: Base filename for the resulting WARC file (without extension). The tool will append -0001.warc.gz, -0002.warc.gz, etc.

Example:

python src/main.py convert waybackup_snapshots --output mysite_archive

Optional flags:

--warc-output — Path to the folder where WARC files will be saved (default: ./output)

Example with custom WARC output folder:

python src/main.py convert waybackup_snapshots --output mysite_archive --warc-output /data/warcs

Notes:

The script combines CSV files using pandas.concat and writes the combined CSV to combined_output.csv.
The combined CSV is then read and converted into <warc-output>/<output>.warc.gz.
The CSVs are expected to contain columns: url_origin, url_archive, file, timestamp, and response.

Important implementation notes

Automatic workflow in Download mode: When downloading, each URL is processed individually:
1. Downloads snapshots using pywaybackup to the snapshot folder (default: waybackup_snapshots/, configurable via --snapshot-folder)
2. Generates a CSV file with snapshot metadata
3. Automatically creates WARC file of downloaded data (saved to the WARC output folder, default: output/, configurable via --warc-output)
4. Cleans up temporary files and subdirectories from the snapshot folder (if --clean flag is used)
Expected CSV columns: The CSVs read by the converter must contain: url_origin, url_archive, file, timestamp, and response, which is created by the pywaybackup-package.
Missing files: The converter will skip entries whose file path does not exist and prints a warning

Example workflow

Create or obtain a CSV of Wayback URLs (column name Internet_Archive_URL), e.g. resources/small_test.csv.

Run download mode - this will automatically download, convert to WARC, and clean up for each URL:

python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL --period DAY

The resulting WARC files will be in the output/ directory (or your custom --warc-output directory), named after each URL (e.g., output/http_www_example_com_page.warc.gz).

Advanced workflow with custom directories:

python src/main.py download resources/curated_urls.csv \
  --column_name Internet_Archive_URL \
  --period WEEK \
  --snapshot-folder /mnt/data/snapshots \
  --warc-output /mnt/data/archives \
  --workers 10 \
  --clean

This will:

Download snapshots to /mnt/data/snapshots/
Save WARC files to /mnt/data/archives/
Use 10 parallel workers for faster downloads
Clean up temporary files after each URL is processed

Troubleshooting

Missing CSV columns: If the script can't find expected CSV columns, inspect the CSV(s) created by pywaybackup and ensure the required column names (file, timestamp, response, url_origin, url_archive) are present.
Download failures: If downloads fail, try rerunning with --reset to force re-downloads.
Custom period errors: When using --period CUSTOM, both --start_time and --end_time must be provided in YYYYMMDDHHMMSS format.
Database index errors: The tool handles SQLAlchemy OperationalError exceptions about existing database indexes gracefully - these are warnings, not fatal errors.

Next steps / Improvements

Add argument validation to require --output for convert mode
Add unit tests for CSV combining and WARC creation edge cases (missing files, bad timestamps)

Project details

Release history Release notifications | RSS feed

This version

0.0.11

Mar 10, 2026

0.0.10

Jan 15, 2026

0.0.9

Jan 13, 2026

0.0.8

Dec 19, 2025

0.0.7

Oct 22, 2025

0.0.6

Oct 22, 2025

0.0.5

Oct 22, 2025

0.0.4

Oct 21, 2025

0.0.3

Oct 16, 2025

0.0.2

Oct 16, 2025

0.0.1

Oct 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

internet_archive_extractor-0.0.11.tar.gz (17.9 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

internet_archive_extractor-0.0.11-py3-none-any.whl (17.5 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file internet_archive_extractor-0.0.11.tar.gz.

File metadata

Download URL: internet_archive_extractor-0.0.11.tar.gz
Upload date: Mar 10, 2026
Size: 17.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for internet_archive_extractor-0.0.11.tar.gz
Algorithm	Hash digest
SHA256	`5230c63ef58191f53e899ef9fb9370b991b23ec70db3a50fefe0a052565750ea`
MD5	`246cc21d329194ebbd1531edef1318ac`
BLAKE2b-256	`bb850f7a95a08a01ea6a3b91d4bf18807408e5e2ed03a16cfb9b505ef584c566`

See more details on using hashes here.

File details

Details for the file internet_archive_extractor-0.0.11-py3-none-any.whl.

File metadata

Download URL: internet_archive_extractor-0.0.11-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for internet_archive_extractor-0.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa1ab8a22fe7b9add01374747b52f27f6de372d5b80215737306763b7e9f7114`
MD5	`d7c22c725bfb6a2aacf006b676bdfa9f`
BLAKE2b-256	`aabacac57263d9ccf247f458f1ea66c5771b66eff38d9f2db2683585b65b2c35`

See more details on using hashes here.

internet-archive-extractor 0.0.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

InternetArchiveExtractor

What this does (short)

Requirements

Project layout (important files)

How to run

Modes and example usage:

Download mode — download snapshots listed in a CSV

Convert mode — combine CSVs and produce a WARC

Important implementation notes

Example workflow

Troubleshooting

Next steps / Improvements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes