Checks the provenance of a URL in the Wayback machine

Project description

waybackprov

Give waybackprov a URL and it will summarize which Internet Archive collections have archived the URL. This kind of information can sometimes provide insight about why a particular web resource or set of web resources were archived from the web.

Run

If you have uv installed you can run waybackprov easily without installing anything:

uvx waybackprov

Otherwise you'll probably want to install it with pip:

pip install waybackprov

Basic Usage

To check a particular URL here's how it works:

waybackprov https://twitter.com/EPAScottPruitt

crawls collections
   364 https://archive.org/details/focused_crawls
   306 https://archive.org/details/edgi_monitor
   151 https://archive.org/details/www3.epa.gov
    60 https://archive.org/details/epa.gov4
    47 https://archive.org/details/epa.gov5

The first column contains the number of crawls for a particular URL, and the second column contains the URL for the Internet Archive collection that added it.

When evaluating the counts it's important to remember that collections can be contained in other collections. So epa.gov4 in the example above is part of the edgi_monitor collection.

Time

By default waybackprov will only look at the current year. If you would like it to examine a range of years use the --start and --end options:

waybackprov --start 2016 --end 2018 https://twitter.com/EPAScottPruitt

Multiple Pages

If you would like to look at all URLs at a particular URL prefix you can use the --prefix option:

waybackprov --prefix https://twitter.com/EPAScottPruitt

This will use the Internet Archive's CDX API to also include URLs that are extensions of the URL you supply, so it would include for example:

https://twitter.com/EPAScottPruitt/status/1309839080398339

But it can also include things you may not want, such as:

https://twitter.com/EPAScottPruitt/status/1309839080398339/media/1

To further limit the URLs use the --match parameter to specify a regular expression only check particular URLs. Further specifying the URLs you are interested in is highly recommended since it prevents lots of lookups for CSS, JavaScript and image files that are components of the resource that was initially crawled.

waybackprov --prefix --match 'status/\d+$' https://twitter.com/EPAScottPruitt

Collections

One thing to remember when interpreting this data is that collections can contain other collections. For example the edgi_monitor collection is a sub-collection of focused_crawls.

If you use the --collapse option only the most specific collection will be reported for a given crawl. So if coll1 is part of coll2 which is part of coll3, only coll1 will be reported instead of coll1, coll2 and coll3. This does involve collection metadata lookups at the Internet Archive API, so it does slow performance significantly.

JSON and CSV

If you would rather see the raw data as JSON or CSV use the --format option. When you use either of these formats you will see the metadata for each crawl, rather than a summary.

Log

If you would like to see detailed information about what waybackprov is doing use the --log option to supply the a file path to log to:

waybackprov --log waybackprov.log https://example.com/

Test

If you would like to test it first install pytest and then:

uv run pytest test.py

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Feb 5, 2026

0.1.0

Feb 4, 2026

0.0.9

May 19, 2022

0.0.8

Jan 23, 2021

0.0.7

Jul 30, 2018

0.0.6

Jul 24, 2018

0.0.5

Jul 24, 2018

0.0.4

Jul 23, 2018

0.0.3

Jul 21, 2018

0.0.2

Jul 12, 2018

0.0.1

Jul 12, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

waybackprov-0.1.1.tar.gz (4.9 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

waybackprov-0.1.1-py3-none-any.whl (5.8 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file waybackprov-0.1.1.tar.gz.

File metadata

Download URL: waybackprov-0.1.1.tar.gz
Upload date: Feb 5, 2026
Size: 4.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.8

File hashes

Hashes for waybackprov-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`67b004d527c5b4dadc2f852db511508089103c0e8f78fae9016b39705a75c7dc`
MD5	`85d87aecd1541b3d4b2c27703fe57924`
BLAKE2b-256	`4f2f4b107a71754067799a954e970f61abd0f0c435b83663ad59a9ffb4646589`

See more details on using hashes here.

File details

Details for the file waybackprov-0.1.1-py3-none-any.whl.

File metadata

Download URL: waybackprov-0.1.1-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 5.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.8

File hashes

Hashes for waybackprov-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d3a109a3c21f502e78acdce804aa5bdaa9d3596c6796805dc250566ee33ef78`
MD5	`c40e06a250cc9fbc8ab2757196ec9ca1`
BLAKE2b-256	`8ba18dbcf64da26aa9801af3329a9b468e3aae083c94e003b5ad95e2bb00e2c2`

See more details on using hashes here.

waybackprov 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

waybackprov

Run

Basic Usage

Time

Multiple Pages

Collections

JSON and CSV

Log

Test

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes