Tool for extracting archived web sites from the Internet Archive saving as WARC files.
Project description
InternetArchiveExtractor
This repository extracts archived content from the Wayback Machine and converts collected metadata and downloaded snapshot files into compressed WARC files. The project currently supports three primary modes of operation: downloading snapshots from the Internet Archive, combining/cleaning CSV metadata produced by Wayback backup tools, and converting that metadata + downloaded files into a WARC-GZ archive.
What this does (short)
- Download mode: reads a CSV of Internet Archive (Wayback) URLs, determines snapshot ranges, and uses
pywaybackupto download snapshots (the downloads are stored in the localwaybackup_snapshots/folder by default). - Convert mode: combines CSV files (from a directory) into a single CSV and then converts that CSV into a compressed WARC (
.warc.gz) usingwarcio. - Full mode: runs download then combine+convert to produce a WARC in one run.
Requirements
Install the Python dependencies from the repository requirements.txt:
pip install -r requirements.txt
Notable packages used:
- pywaybackup — downloads Wayback snapshots
- pandas — CSV handling and merging when combining multiple CSVs
- warcio — writing WARC records
See requirements.txt for the exact pinned versions used in this repository.
Project layout (important files)
src/main.py— command-line entry point that exposesdownload,convert, andfullmodes.src/internet_archive_downloader.py— logic that reads an input CSV of Internet Archive URLs and runspywaybackupto download snapshots.src/waybackup_to_warc.py— functions to combine CSV files, clean URLs (remove:80), and produce a.warc.gzfrom a CSV of records.resources/— example CSVs (e.g.curated_urls.csv) useful for quick testing.
How to run
Usage pattern for the main runner (src/main.py):
python src/main.py <mode> <input> [--output OUTPUT] [--column_name COLUMN] [--period DAY|WEEK] [--reset]
Modes and example usage:
-
Download mode — download snapshots listed in a CSV
-
Description: Reads a CSV containing full Wayback URLs such as
https://web.archive.org/web/20251002062751/https://example.com/pageand downloads snapshots for a small period around the archived date. -
Required
input: path to the CSV file to read (e.g.resources/curated_urls.csv). The default column name expected isInternet_Archive_URL. -
Example:
python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL --period DAY -
Flags:
--period—DAY(default) orWEEK. Controls whether the downloader fetches snapshots ±1 day or ±1 week around the archived date.--reset— if present, passesreset=Truetopywaybackup(useful to force re-download).
-
-
Convert mode — combine CSVs and produce a WARC
-
Description: Combine all
.csvfiles from the specified directory into a single CSV (written tocombined_output.csvby default) and convert that CSV to a WARC-GZ. -
Required
input: path to a directory that contains CSV files to combine (e.g.waybackup_snapshots/or any folder with CSV exports). -
--outputshould be provided to name the resulting WARC file (the code will append.warc.gz). -
Example:
python src/main.py convert waybackup_snapshots --output mysite_archive -
Notes: The script combines CSV files using
pandas.concatand writes the combined CSV tocombined_output.csv(value ofCOMBINED_CSV_PATH). The combined CSV is then read and converted intooutput/<output>.warc.gz.
-
-
Full mode — download then convert
-
Description: Downloads snapshots from the input CSV, then combines CSVs (from
waybackup_snapshots) and converts them into a WARC. -
Example:
python src/main.py full resources/curated_urls.csv --output combined_site_archive
-
Important implementation notes
- Default combined CSV file path:
combined_output.csv(the module-levelCOMBINED_CSV_PATHinsrc/waybackup_to_warc.py). - The CSVs read by the converter are expected to contain columns like
url_origin,url_archive,file,timestamp, andresponse(seesrc/waybackup_to_warc.pyfor required field names used when creating WARC records). - The converter will skip entries whose
filepath does not exist and prints a warning. It also emits simple 404/500 WARC entries when those response codes are encountered.
Example quick workflow
-
Create or obtain a CSV of Wayback URLs (column name
Internet_Archive_URL), e.g.resources/curated_urls.csv. -
Download snapshots for those URLs:
python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URLThis writes per-site CSVs and downloaded files into
waybackup_snapshots/(and related subfolders) usingpywaybackup. -
Combine CSVs and convert to WARC:
python src/main.py convert waybackup_snapshots --output archived_site -
The resulting WARC will be written to
output/archived_site.warc.gz.
Troubleshooting
- If
--outputis not provided forconvert/full, the conversion step may attempt to use aNonefilename. Always provide--outputwhen converting. - If the script can't find expected CSV columns, inspect the CSV(s) created by
pywaybackupand ensure the required column names (file,timestamp,response,url_origin) are present. - If downloads fail, try rerunning with
--resetto force re-downloads.
Next steps / Improvements
- Add argument validation to require
--outputforconvertandfullmodes. - Add unit tests for CSV combining and WARC creation edge cases (missing files, bad timestamps).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file internet_archive_extractor-0.0.8.tar.gz.
File metadata
- Download URL: internet_archive_extractor-0.0.8.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78941f696e38b976ac935ee8ae2569c6e36afbafe421e0fcb799c5f34affc025
|
|
| MD5 |
b06a0b4430277c326c5a661750fae9ff
|
|
| BLAKE2b-256 |
0b15edf78cc799aa3a992451c683e3913098093c8851f9452ffc573113976f2c
|
File details
Details for the file internet_archive_extractor-0.0.8-py3-none-any.whl.
File metadata
- Download URL: internet_archive_extractor-0.0.8-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5b75aa11a286c3a4304128c0170d0488512e795e98c747dc9d3724bd91c6a93
|
|
| MD5 |
ad9d23c0e19d8628f0ac0577895c2a6c
|
|
| BLAKE2b-256 |
3582ed19d1b0f8df36058d1b7e4fb7df3f2e0318ce1b62b6affe53ff9b0c29e0
|