Skip to main content

Concurrently retrieve metadata from Archive.org items.

Project description

Internet Archive data mining tools.

What Is IA Mine?

IA Mine is a command line tool and Python 3 library for data mining the Internet Archive.

How Do I Get Started?

Command Line Interface

The IA Mine command line tool should work on any Unix-like operating system that has Python 3 installed on it. To start using ia-mine, simply download one of the latest binaries from https://archive.org/details/iamine-pex.

# Download ia-mine and make it executable.
$ curl -L https://archive.org/download/iamine-pex/ia-mine-0.3.2.pex > ia-mine
$ chmod +x ia-mine
$ ./ia-mine -v
0.3.2

Usage:

$ ia-mine --help
Concurrently retrieve metadata from Archive.org items.

usage: ia-mine (<itemlist> | -) [--debug] [--workers WORKERS] [--cache]
               [--retries RETRIES] [--secure] [--hosts HOSTS]
       ia-mine [--all | --search QUERY] [[--info | --info --field FIELD...]
               |--num-found | --mine-ids | --field FIELD... | --itemlist]
               [--debug] [--rows ROWS] [--workers WORKERS] [--cache]
               [--retries RETRIES] [--secure] [--hosts HOSTS]
       ia-mine [-h | --version | --configure]

positional arguments:
  itemlist              A file containing Archive.org identifiers, one per
                        line, for which to retrieve metadata from. If no
                        itemlist is provided, identifiers will be read from
                        stdin.

optional arguments:
  -h, --help            Show this help message and exit.
  -v, --version         Show program's version number and exit.
  --configure           Configure ia-mine to use your Archive.org credentials.
  -d, --debug           Turn on verbose logging [default: False]
  -a, --all             Mine all indexed items.
  -s, --search QUERY    Mine search results. For help formatting your query,
                        see: https://archive.org/advancedsearch.php
  -m, --mine-ids        Mine items returned from search results.
                        [default: False]
  -i, --info            Print search result response header to stdout and exit.
  -f, --field FIELD     Fields to include in search results.
  -i, --itemlist        Print identifiers only to stdout. [default: False]
  -n, --num-found       Print the number of items found for the given search
                        query.
  --rows ROWS           The number of rows to return for each request made to
                        the Archive.org Advancedsearch API. On slower networks,
                        it may be useful to use a lower value, and on faster
                        networks, a higher value. [default: 50]
  -w, --workers WORKERS
                        The maximum number of tasks to run at once.
                        [default: 100]
  -c, --cache           Cache item metadata on Archive.org. Items are not
                        cached are not cached by default.
  -r, --retries RETRIES
                        The maximum number of retries for each item.
                        [default: 10]
  --secure              Use HTTPS. HTTP is used by default.
  -H, --hosts HOSTS     A file containing a list of hosts to shuffle through.

Python Library

The IA Mine Python library can be installed with pip:

# Create a Python 3 virtualenv, and install iamine.
$ virtualenv --python=python3 venv
$ . venv/bin/activate
$ pip install iamine

This will also install the ia-mine comand line tool into your virtualenv:

$ which ia-mine
/home/user/venv/bin/ia-mine

Data Mining with IA Mine and jq

ia-mine simply retrieves metadata and search results concurrently from Archive.org, and dumps the JSON returned to stdout and any error messages to stderr. Mining the JSON dumped to stdout can be done using a tool like jq, for example. jq binaries can be downloaded at http://stedolan.github.io/jq/download/.

ia-mine can mine Archive.org search results, the items returned from search results, or items provide via an itemlist or stdin.

Developers

Please report any bugs or issues on github: https://github.com/jjjake/iamine

Release History

0.3.3 (2015-08-04)

Bugfixes

  • Added HISTORY.rst to MANIFEST.in to fix pip install iamine.

0.3.2 (2015-08-03)

Bugfixes

  • asyncio.JoinableQueue was deprecated in Python 3.4.4. iamine.core.Miner now uses asyncio.Queue for Python 3.4.4 and newer and asyncio.JoinableQueue for older versions (asyncio.Queue cannot be used for all versions because asyncio.Queue.join() was only added in version 3.4.4.).

  • SearchMiner.get_search_info() is no longer a coroutine (now uses urllib). Fixed bug in iamine.api.search where it was still being called as coroutine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iamine-0.3.3.tar.gz (12.5 kB view details)

Uploaded Source

File details

Details for the file iamine-0.3.3.tar.gz.

File metadata

  • Download URL: iamine-0.3.3.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for iamine-0.3.3.tar.gz
Algorithm Hash digest
SHA256 2c5447bc752ee8ab982efa58eabf220b06fb2e4c3968243c994159043c965a0f
MD5 dae5d5ea4d66398597e3572971cdd97d
BLAKE2b-256 ad98b278108273eb3e4d9ae1abc456a24a575c9f988af025a4601277f2cf933e

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page