Concurrently retrieve metadata from Archive.org items.
Project description
Internet Archive data mining tools.
What Is IA Mine?
IA Mine is a command line tool and Python 3 library for data mining the Internet Archive.
How Do I Get Started?
Command Line Interface
The IA Mine command line tool should work on any Unix-like operating system that has Python 3 installed on it. To start using ia-mine, simply download one of the latest binaries from https://archive.org/details/iamine-pex.
# Download ia-mine and make it executable.
$ curl -LO https://archive.org/download/iamine-pex/ia-mine
$ chmod +x ia-mine
$ ./ia-mine --help
...
Usage:
$ ia-mine --help
Concurrently retrieve metadata from Archive.org items.
usage: ia-mine (<itemlist> | -) [--debug] [--workers WORKERS] [--cache]
[--retries RETRIES] [--secure] [--hosts HOSTS]
ia-mine [--all | --search QUERY] [[--info | --info --field FIELD...]
|--num-found | --mine-ids | --field FIELD... | --itemlist]
[--debug] [--rows ROWS] [--workers WORKERS] [--cache]
[--retries RETRIES] [--secure] [--hosts HOSTS]
ia-mine [-h | --version | --configure]
positional arguments:
itemlist A file containing Archive.org identifiers, one per
line, for which to retrieve metadata from. If no
itemlist is provided, identifiers will be read from
stdin.
optional arguments:
-h, --help Show this help message and exit.
-v, --version Show program's version number and exit.
--configure Configure ia-mine to use your Archive.org credentials.
-d, --debug Turn on verbose logging [default: False]
-a, --all Mine all indexed items.
-s, --search QUERY Mine search results. For help formatting your query,
see: https://archive.org/advancedsearch.php
-m, --mine-ids Mine items returned from search results.
[default: False]
-i, --info Print search result response header to stdout and exit.
-f, --field FIELD Fields to include in search results.
-i, --itemlist Print identifiers only to stdout. [default: False]
-n, --num-found Print the number of items found for the given search
query.
--rows ROWS The number of rows to return for each request made to
the Archive.org Advancedsearch API. On slower networks,
it may be useful to use a lower value, and on faster
networks, a higher value. [default: 50]
-w, --workers WORKERS
The maximum number of tasks to run at once.
[default: 100]
-c, --cache Cache item metadata on Archive.org. Items are not
cached are not cached by default.
-r, --retries RETRIES
The maximum number of retries for each item.
[default: 10]
--secure Use HTTPS. HTTP is used by default.
-H, --hosts HOSTS A file containing a list of hosts to shuffle through.
Python Library
The IA Mine Python library can be installed with pip:
# Create a Python 3 virtualenv, and install iamine.
$ virtualenv --python=python3 venv
$ . venv/bin/activate
$ pip install iamine
This will also install the ia-mine comand line tool into your virtualenv:
$ which ia-mine
/home/user/venv/bin/ia-mine
Data Mining with IA Mine and jq
ia-mine simply retrieves metadata and search results concurrently from Archive.org, and dumps the JSON returned to stdout and any error messages to stderr. Mining the JSON dumped to stdout can be done using a tool like jq, for example. jq binaries can be downloaded at http://stedolan.github.io/jq/download/.
ia-mine can mine Archive.org search results, the items returned from search results, or items provide via an itemlist or stdin.
Developers
Please report any bugs or issues on github: https://github.com/jjjake/iamine
Release History
0.3.5 (2016-05-24)
Bugfixes
All output from ia-mine should be JSONL. Some responses from the Metadata API contain unescaped newlines. This causes a lot of issues when using jq are when parsing JSON line-by-line. to address this, JSON responses returned from server are now parsed and dumped back to JSON before printing to stdout.
0.3.5 (2016-05-24)
Features and Improvements
Fixed Exception ignored in: errors.
Added support for custom config files.
0.3.3 (2015-08-04)
Bugfixes
Added HISTORY.rst to MANIFEST.in to fix pip install iamine.
0.3.2 (2015-08-03)
Bugfixes
asyncio.JoinableQueue was deprecated in Python 3.4.4. iamine.core.Miner now uses asyncio.Queue for Python 3.4.4 and newer and asyncio.JoinableQueue for older versions (asyncio.Queue cannot be used for all versions because asyncio.Queue.join() was only added in version 3.4.4.).
SearchMiner.get_search_info() is no longer a coroutine (now uses urllib). Fixed bug in iamine.api.search where it was still being called as coroutine.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.