Skip to main content
Donate to the Python Software Foundation or Purchase a PyCharm License to Benefit the PSF! Donate Now

An Internet Archive search tool.

Project description

# Archiveorg

An [Internet Archive](archive.org) search tool.

This project is not affiliated with the Internet Archive.

The Internet Archive is a non-profit project that provides a public service.
If you are using package (or the Internet Archive generally) on an ongoing basis, please consider [donating](https://archive.org/donate/) to them.

# Installation

Archiveorg requires Python 3.6+.

To install:

```sh
pip install archiveorg
```

# Usage

Archiveorg contains a single object, `Search`.

```python
from archiveorg import Search
```

Pass in all search parameters when creating the search object.
See the Internet Archive's [search API](https://archive.org/help/aboutsearch.htm) for details on which parameters exist:

```python
search = Search(mediatype='image', collection='maps_usgs')
```

When the search is created, it will do an initial check of how many results exists:

```python
>>>search.num_items
10000
>>>search.num_pages
10
```

**NOTE**: By default, the number of items per page is 1,000 (and can be specified using the `rows` parameter).
As per their API, a maximum of 10,000 items will be returned.
You can actually specify up to 100,000,000 rows (at which point results will not be paginated and won't be sorted in any way).

Once your search has been created, you can iterate over results:

```python
for result in search:
print(result['identifier'])
```

The result will be a dictionary representing the `JSON` search output.

You can also user the explicit `iterate` method if you want to start from an offset, or if you want to limit results to items which have files of a certain format:

```python
for result in search.iterate(offset=100, format_regex=r'^TIFF$'):
... # only results with .tiff files, starting from the 101st object.
```

You can download files using the `get_files` method:

```python
for result in search:
directory = Path(result['identifier'])

file_list = search.get_files(result, directory)
```

Each result may have multiple files.
You can use the `format_regex` parameter to filter files based on file format.

## Random access

You can use the `random_item` method to return a single random result.
You can use the `format_regex` parameter to ensure the result contains the file type you want.

The method will return `None` if no result can be found.
You can use the `max_attempts` parameter to adjust the number of attempts are made to find a matching result.

# License

Archiveorg is available under an AGPL License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
archiveorg-0.2.0-py3.6.egg (9.5 kB) Copy SHA256 hash SHA256 Egg 3.6

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page