Tool for extracting archived web sites from the Internet Archive saving as WARC files.
Project description
InternetArchiveExtractor
This repository extracts archived content from the Wayback Machine and converts collected metadata and downloaded snapshot files into compressed WARC files. The project supports two primary modes of operation: downloading snapshots from the Internet Archive and converting CSV metadata (produced by pywaybackup) into WARC-GZ archives.
What this does (short)
- Download mode: Reads a CSV of Internet Archive (Wayback) URLs, and uses
pywaybackupto download snapshots. For each URL processed, it automatically converts the downloaded snapshots to a WARC file and cleans up temporary files. - Convert mode: Combines CSV files (from a directory) into a single CSV and then converts that CSV into a compressed WARC (
.warc.gz) usingwarcio.
Requirements
Install the Python dependencies from the repository requirements.txt:
pip install -r requirements.txt
Notable packages used:
pywaybackup— downloads Wayback snapshotspandas— CSV handling and merging when combining multiple CSVswarcio— writing WARC records
See requirements.txt for the exact pinned versions used in this repository.
Project layout (important files)
src/main.py— command-line entry point that exposesdownloadandconvertmodes.src/internet_archive_downloader.py— logic that reads an input CSV of Internet Archive URLs and runspywaybackupto download snapshots. After each URL is downloaded, it automatically converts the CSV to WARC and cleans up temporary files.src/waybackup_to_warc.py— functions to combine CSV files, clean URLs (remove:80), and produce a.warc.gzfrom a CSV of records. ng.
How to run
Usage pattern for the main runner (src/main.py):
# Download mode
python src/main.py download <input> [--column_name COLUMN] [--period PERIOD] [--reset] [--start_time START] [--end_time END] [--snapshot-folder FOLDER] [--warc-output FOLDER] [--workers N] [--clean]
# Convert mode
python src/main.py convert <input> --output OUTPUT [--warc-output FOLDER]
Modes and example usage:
Download mode — download snapshots listed in a CSV
Description: Reads a CSV containing full Wayback URLs such as https://web.archive.org/web/20251002062751/https://example.com/page and downloads snapshots for a specified period around the archived date. After downloading each URL, the tool automatically:
- Converts the downloaded snapshots to a WARC file (saved in
output/directory, or custom location via--warc-output) - Cleans up temporary files from
waybackup_snapshots/directory (if--cleanflag is used)
Note: In download mode, WARC filenames are automatically generated from the URL. The --output flag is not used in this mode.
Required input: Path to the CSV file to read (e.g. resources/curated_urls.csv). The default column name expected is Internet_Archive_URL.
Example:
python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL --period DAY
Flags:
--column_name— Name of the CSV column containing Wayback URLs (default:Internet_Archive_URL)--period— Download period options:DAY(default) — Downloads snapshots ±1 day around the archived dateWEEK— Downloads snapshots ±1 week around the archived dateFULL— Downloads all snapshots from 1995-2005CUSTOM— Downloads snapshots within a custom date range (requires--start_timeand--end_time)
--start_time— Start time for CUSTOM period inYYYYMMDDHHMMSSformat--end_time— End time for CUSTOM period inYYYYMMDDHHMMSSformat--reset— If present, forces re-download by passingreset=Truetopywaybackup--snapshot-folder— Path to the folder where pywaybackup stores downloaded snapshots (default:./waybackup_snapshots)--warc-output— Path to the folder where WARC files will be saved (default:./output)--workers— Number of worker threads for parallel downloading (default:5)--clean— If present, deletes intermediate CSV, DB, and CDX files after processing
Example with CUSTOM period:
python src/main.py download resources/curated_urls.csv --period CUSTOM --start_time 20000101000000 --end_time 20001231235959
Example with custom snapshot and WARC output folders:
python src/main.py download resources/curated_urls.csv --snapshot-folder /data/snapshots --warc-output /data/warcs
Convert mode — combine CSVs and produce a WARC
Description: Combine all .csv files from the specified directory into a single CSV (written to combined_output.csv by default) and convert that CSV to a WARC-GZ.
Required input: Path to a directory that contains CSV files to combine (e.g. waybackup_snapshots/ or any folder with CSV exports).
Required --output: Base filename for the resulting WARC file (without extension). The tool will append -0001.warc.gz, -0002.warc.gz, etc.
Example:
python src/main.py convert waybackup_snapshots --output mysite_archive
Optional flags:
--warc-output— Path to the folder where WARC files will be saved (default:./output)
Example with custom WARC output folder:
python src/main.py convert waybackup_snapshots --output mysite_archive --warc-output /data/warcs
Notes:
- The script combines CSV files using
pandas.concatand writes the combined CSV tocombined_output.csv. - The combined CSV is then read and converted into
<warc-output>/<output>.warc.gz. - The CSVs are expected to contain columns:
url_origin,url_archive,file,timestamp, andresponse.
Important implementation notes
- Automatic workflow in Download mode: When downloading, each URL is processed individually:
- Downloads snapshots using
pywaybackupto the snapshot folder (default:waybackup_snapshots/, configurable via--snapshot-folder) - Generates a CSV file with snapshot metadata
- Automatically creates WARC file of downloaded data (saved to the WARC output folder, default:
output/, configurable via--warc-output) - Cleans up temporary files and subdirectories from the snapshot folder (if
--cleanflag is used)
- Downloads snapshots using
- Expected CSV columns: The CSVs read by the converter must contain:
url_origin,url_archive,file,timestamp, andresponse, which is created by thepywaybackup-package. - Missing files: The converter will skip entries whose
filepath does not exist and prints a warning
Example workflow
-
Create or obtain a CSV of Wayback URLs (column name
Internet_Archive_URL), e.g.resources/small_test.csv. -
Run download mode - this will automatically download, convert to WARC, and clean up for each URL:
python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL --period DAY
-
The resulting WARC files will be in the
output/directory (or your custom--warc-outputdirectory), named after each URL (e.g.,output/http_www_example_com_page.warc.gz).
Advanced workflow with custom directories:
python src/main.py download resources/curated_urls.csv \
--column_name Internet_Archive_URL \
--period WEEK \
--snapshot-folder /mnt/data/snapshots \
--warc-output /mnt/data/archives \
--workers 10 \
--clean
This will:
- Download snapshots to
/mnt/data/snapshots/ - Save WARC files to
/mnt/data/archives/ - Use 10 parallel workers for faster downloads
- Clean up temporary files after each URL is processed
Troubleshooting
- Missing CSV columns: If the script can't find expected CSV columns, inspect the CSV(s) created by
pywaybackupand ensure the required column names (file,timestamp,response,url_origin,url_archive) are present. - Download failures: If downloads fail, try rerunning with
--resetto force re-downloads. - Custom period errors: When using
--period CUSTOM, both--start_timeand--end_timemust be provided inYYYYMMDDHHMMSSformat. - Database index errors: The tool handles SQLAlchemy
OperationalErrorexceptions about existing database indexes gracefully - these are warnings, not fatal errors.
Next steps / Improvements
- Add argument validation to require
--outputforconvertmode - Add unit tests for CSV combining and WARC creation edge cases (missing files, bad timestamps)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file internet_archive_extractor-0.0.11.tar.gz.
File metadata
- Download URL: internet_archive_extractor-0.0.11.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5230c63ef58191f53e899ef9fb9370b991b23ec70db3a50fefe0a052565750ea
|
|
| MD5 |
246cc21d329194ebbd1531edef1318ac
|
|
| BLAKE2b-256 |
bb850f7a95a08a01ea6a3b91d4bf18807408e5e2ed03a16cfb9b505ef584c566
|
File details
Details for the file internet_archive_extractor-0.0.11-py3-none-any.whl.
File metadata
- Download URL: internet_archive_extractor-0.0.11-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa1ab8a22fe7b9add01374747b52f27f6de372d5b80215737306763b7e9f7114
|
|
| MD5 |
d7c22c725bfb6a2aacf006b676bdfa9f
|
|
| BLAKE2b-256 |
aabacac57263d9ccf247f458f1ea66c5771b66eff38d9f2db2683585b65b2c35
|