Query and download archive.org as simple as possible.

Project description

python wayback machine downloader

Python Version

Downloading archived web pages from the Wayback Machine.

Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.

This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

Content

➡️ Installation
➡️ notes / issues / hints
➡️ import
➡️ cli
➡️ Usage
➡️ Examples
➡️ Output
➡️ Contributing

Installation

Pip

Install the package
pip install pywaybackup
Run the tool
waybackup -h

Manual

Clone the repository
git clone https://github.com/bitdruid/python-wayback-machine-downloader.git
Install
pip install .
- in a virtual env or use --break-system-package

notes / issues / hints

Linux recommended: On Windows machines, the path length is limited. Files that exceed the path length will not be downloaded.
The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
If you query an explicit file (e.g. a query-string ?query=this or login.html), the --explicit-argument is recommended as a wildcard query may lead to an empty result.
Downloading directly into a network share is not recommended. The sqlite locking mechanism may cause issues. If you need to download into a network share, set the --metadata argument to a local path.

import

You can import pywaybackup into your own scripts and run it. Args are the same as cli.

Additional args:

silent (default False): If True, suppresses all output to the console.
debug (default True): If False, disables writing errors to the error log file.

Use:

run()
status()
paths()
stop()

from pywaybackup import PyWayBackup

backup = PyWayBackup(
  url="https://example.com",
  all=True,
  start="20200101",
  end="20201231",
  silent=False,
  debug=True,
  log=True,
  keep=True
)

backup.run()
backup_paths = backup.paths(rel=True)
print(backup_paths)

output:

{
  'snapshots': 'output/example.com',
  'cdxfile': 'output/waybackup_example.cdx',
  'dbfile': 'output/waybackup_example.com.db',
  'csvfile': 'output/waybackup_https.example.com.csv',
  'log': 'output/waybackup_example.com.log',
  'debug': 'output/waybackup_error.log'
}

... or run it asynchronously and print the current status or stop it whenever needed.

import time
from pywaybackup import PyWayBackup

backup = PyWayBackup( ... )
backup.run(daemon=True)
print(backup.status())
time.sleep(10)
print(backup.status())
backup.stop()

output:

{
  'task': 'downloading snapshots',
  'current': 15,
  'total': 84,
  'progress': '18%'
}

cli

-h, --help: Show the help message and exit.
-v, --version: Show information about the tool and exit.

Required

-u, --url:
The URL of the web page to download. This argument is required.

Mode Selection (Choose One)

-a, --all:
All timestamps. Gives one folder per timestamp.
-l, --last:
Last Version. Gives one folder containing the last version of each file of specified --range.
-f, --first:
First Version. Gives one folder containing the first version of each file of specified --range.

Optional query parameters

Parameters for archive.org CDX query. No effect on snapshot download itself.

-e, --explicit:
Only the explicit URL. No wildcard subdomains or paths. For example get: root-only (https://example.com) or specific file (login.html, ?query=this).
--limit <count>:
Limits the snapshots fetched from archive.org CDX. (Will have no effect on existing CDX files)
Range Selection:
Set the query range in years (range) or a timestamp (start and/or end). If range then ignores start and end. Format for timestamps: YYYYMMDDhhmmss. Timestamp can as specific as needed (year 2019, year+month+day 20190101, ...).
- -r, --range:
  Specify the range in years for which to search and download snapshots.
- --start:
  Timestamp to start searching.
- --end:
  Timestamp to end searching.
Filtering:
- --filetype <filetype>:
  Specify filetypes to download. Example: --filetype jpg,css,js. You can only filter filetypes which are stored by archive.org (.html mostly not)
- --statuscode <statuscode>:
  Specify HTTP status codes to download. Example: --statuscode 200,301. PyWayBackup will always skip 404 and 301.
  Common status codes you may want to handle/filter:
  - 200 (OK)
  - 301 (Moved Permanently)
  - 404 (Not Found - snapshot seems to be empty)
  - 500 (Internal Server Error - snapshot is at least for now not available)

Optional Behavior Manipulation

Parameters will change the download behavior for snapshots.

-o, --output:
Defaults to waybackup_snapshots in the current directory. The folder where downloaded files will be saved.
-m, --metadata
Folder where metadata will be saved (cdx/db/csv/log). If you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
-v, --verbose [level]:
Set verbosity level. Available levels:
- low (or quiet, minimal, min): Essential output only (same as no flag)
- default (or normal, verbose): Standard verbose output (default when flag is set)
- high (or debug, detailed, max): Detailed verbose output
Examples: --verbose, --verbose default, --verbose high, -v high
--log :
Saves a log file into the output-dir. waybackup_<sanitized_url>.log.
--progress:
Shows a progress bar instead of the default output.
--workers <count>:
Number of simultaneous download workers. Default is 1, safe range is about 10. Too many workers may lead to refused connections by archive.org.
--no-redirect:
Disables following redirects of snapshots. Can prevent timestamp-folder mismatches caused by redirects.
--retry <attempts>:
Retry attempts for failed downloads.
--delay <seconds>:
Delay between download requests in seconds. Default is no delay (0).
--wait <seconds>:
Seconds to wait before renewing connection after HTTP errors or snapshot download errors. Default is 15 seconds.

Job Handling:

--reset:
If set, the job will be reset, and cdx, db, csv files will be deleted. This allows you to start the job from scratch.
--keep:
If set, cdx and db files will be kept after the job is finished. Otherwise they will be deleted.

Usage

Handling Interrupted Jobs

pywaybackup resumes interrupted jobs. The tool automatically continues from where it left off.

Only resumes queries if:

existing .cdx and .db files in an output dir
command is identical by URL, mode, and optional query parameters

Note: Changing URL, mode selection, query parameters or output prevents automatic resumption.

Examples

Download a specific single snapshot of all available files (starting from root):
waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000
Download a specific single snapshot of all available files (starting from a subdirectory):
waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000
Download a specific single snapshot of the exact given URL (no subdirs):
waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit
Download all snapshots of all available files in the given range:
waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000

Output

Path Structure

The output path is currently structured as follows by an example for the query:
http://example.com/subdir1/subdir2/assets/

For the first and last version (-f or -l):

Will only include all files/folders starting from your query-path.

your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    └── subdir1/
        └── subdir2/
            └── assets/
                ├── image.jpg
                ├── style.css
                ...

For all versions (-a):

Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.

your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    ├── yyyymmddhhmmss/
    │   ├── subidr1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ├── yyyymmddhhmmss/
    │   ├── subdir1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ...

CSV

The CSV contains a snapshot per row:

[
   {
      "file": "/your/path/waybackup_snapshots/example.com/yyyymmddhhmmss/index.html",
      "id": 1,
      "redirect_timestamp": "yyyymmddhhmmss",
      "redirect_url": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "response": 200,
      "timestamp": "yyyymmddhhmmss",
      "url_archive": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "url_origin": "http://example.com/"
   },
    ...
]

Log

Verbose:

-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
SUCCESS   -> 200 OK
          -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
          -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css

Non-verbose:

55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css

Debugging

Exceptions will be written into waybackup_error.log (each run overwrites the file).

Future ideas (long run)

More module functionality
Docker UI

Contributing

I'm always happy for some feature requests to improve the usability of this tool. Feel free to give suggestions and report issues. Project is still far from being perfect.

Project details

Release history Release notifications | RSS feed

This version

4.1.6

Mar 15, 2026

4.1.5

Mar 1, 2026

4.1.4

Jan 29, 2026

4.1.3

Jan 15, 2026

4.1.2

Nov 25, 2025

4.1.1

Oct 20, 2025

4.1.0

Oct 2, 2025

4.0.0

Sep 1, 2025

3.4.1

Jul 25, 2025

3.3.1

May 9, 2025

3.3.0

May 8, 2025

3.2.1

May 7, 2025

3.1.0

Feb 19, 2025

3.0.4

Jan 31, 2025

3.0.2

Jan 2, 2025

3.0.1

Dec 23, 2024

2.0.3

Nov 8, 2024

2.0.2

Nov 3, 2024

2.0.1

Oct 31, 2024

2.0.0

Oct 19, 2024

1.5.7

Sep 11, 2024

1.5.6

Sep 9, 2024

1.5.5

Sep 8, 2024

1.5.4

Sep 5, 2024

1.5.3

Sep 2, 2024

1.5.1

Aug 25, 2024

1.5.0

Aug 24, 2024

1.4.2

Aug 4, 2024

1.4.1

Jul 30, 2024

1.4.0

Jul 25, 2024

1.3.2

Jul 8, 2024

1.3.1

Jul 2, 2024

1.3.0

Jun 29, 2024

1.2.6

Jun 29, 2024

1.2.5

Jun 27, 2024

1.2.4

Jun 25, 2024

1.2.3

Jun 15, 2024

1.2.2

Jun 11, 2024

1.2.1

Jun 9, 2024

1.2.0

Jun 8, 2024

1.1.0

Jun 4, 2024

1.0.3

Jun 3, 2024

1.0.2

May 31, 2024

1.0.1

Apr 22, 2024

0.8.1

Apr 12, 2024

0.8.0

Apr 8, 2024

0.7.1

Apr 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywaybackup-4.1.6.tar.gz (35.7 kB view details)

Uploaded Mar 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pywaybackup-4.1.6-py3-none-any.whl (37.0 kB view details)

Uploaded Mar 15, 2026 Python 3

File details

Details for the file pywaybackup-4.1.6.tar.gz.

File metadata

Download URL: pywaybackup-4.1.6.tar.gz
Upload date: Mar 15, 2026
Size: 35.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywaybackup-4.1.6.tar.gz
Algorithm	Hash digest
SHA256	`9f65048517eb8998445de8c3e0e9ec34f9307a094de487eb00935a39dd29def7`
MD5	`b4d4cff3c67556ffddfe06f9b121054f`
BLAKE2b-256	`320a6f0815a2ed23f76ab5240bfd797df6a14f9cc019683eda90273207feb58a`

See more details on using hashes here.

File details

Details for the file pywaybackup-4.1.6-py3-none-any.whl.

File metadata

Download URL: pywaybackup-4.1.6-py3-none-any.whl
Upload date: Mar 15, 2026
Size: 37.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywaybackup-4.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`515a122e7251163bdbb45f99f59c9be61ba28e644a68fb4b2fc44e02da13c23a`
MD5	`83eafc8b6403a7851fe07ec6f3527909`
BLAKE2b-256	`e417e815164cc3979214203470a57054f2bc209f6cace06254409f24eb1158f9`

See more details on using hashes here.

pywaybackup 4.1.6

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

python wayback machine downloader

Content

Installation

Pip

Manual

notes / issues / hints

import

cli

Required

Mode Selection (Choose One)

Optional query parameters

Optional Behavior Manipulation

Job Handling:

Usage

Handling Interrupted Jobs

Examples

Output

Path Structure

CSV

Log

Debugging

Future ideas (long run)

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes