A python interface to archive.org.
Project description
A python interface to archive.org
This package installs a CLI tool named ia for using archive.org from the command-line. It also installs the internetarchive python module for programatic access to archive.org. Please report all bugs and issues on Github.
Installation
You can install this module via pip:
pip install internetarchive
Alternatively, you can install a few extra dependencies to help speed things up a bit:
pip install "internetarchive[speedups]"
This will install ujson for faster JSON parsing, and gevent for concurrent downloads.
If you want to install this module globally on your system instead of inside a virtualenv, use sudo:
sudo pip install internetarchive
Command-Line Usage
Help is available by typing ia --help. You can also get help on a command: ia <command> --help. Available subcommands are configure, metadata, upload, download, search, mine, and catalog.
Downloading
To download the entire TripDown1905 item:
$ ia download TripDown1905
ia download usage examples:
#download just the mp4 files using ``--glob``
$ ia download TripDown1905 --glob='\*.mp4'
#download all the mp4 files using ``--formats``:
$ ia download TripDown1905 --format='512Kb MPEG4'
#download multiple formats from an item:
$ ia download TripDown1905 --format='512Kb MPEG4' --format='Ogg Video'
#list all the formats in an item:
$ ia metadata --formats TripDown1905
#download a single file from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4
#download multiple files from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv
Uploading
You can use the provided ia command-line tool to upload items. You need to supply your IAS3 credentials in environment variables in order to upload. You can retrieve S3 keys from https://archive.org/account/s3.php
$ export AWS_ACCESS_KEY_ID='xxx'
$ export AWS_SECRET_ACCESS_KEY='yyy'
#upload files:
$ ia upload <identifier> file1 file2 --metadata="title:foo" --metadata="blah:arg"
#upload from `stdin`:
$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz |
ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."
Metadata
You can use the ia command-line tool to download item metadata in JSON format:
$ ia metadata TripDown1905
You can also modify metadata. Be sure that the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables are set.
$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"
Data Mining
If you have the Python library gevent installed, you can use the ia mine command. gevent is automatically installed if you installed ia via pip install "internetarchive[speedups]". You can also install gevent like so:
$ pip install cython git+git://github.com/surfly/gevent.git@1.0rc2#egg=gevent
ia mine can be used to concurrently retrieve metadata for items via the IA Metadata API.
# Create an itemlist to be used as input for your ``ia mine`` command.
$ ia search 'collection:IRS990' > itemlist.txt
# Print metadata to stdout (each items metadata is separated by a "\n" character).
$ ia mine itemlist.txt
# Download all metadata for each item contained in itemlist.txt.
$ ia mine itemlist.txt --cache
# Download all metadata for each item into a single file (each items metadata is separated by a "\n" character).
$ ia mine itemlist.txt --output irs990_metadata.json
ia mine can be a very powerful command when used with jq, a command-line JSON processor. For instance, items in the IRS990 collection have extra metadata that does not get indexed by the Archive.org search engine. Using ia mine and jq, you can quickly parse through this metadata using adhoc jq queries to find what you are looking for.
For instance, let’s find all of the 990 forms who’s foundation has the keyword “CANCER” in their name:
$ ia mine itemlist.txt | jq 'if .manifest then (.manifest[] | select(contains({foundation: "CANCER"}))) else empty end'
Searching
You can search using the provided ia command-line script:
$ ia search 'subject:"market street" collection:prelinger'
Parallel Downloading
If you have the GNU parallel tool intalled, then you can combine ia search and ia metadata to quickly retrieve data for many items in parallel:
$ia search 'subject:"market street" collection:prelinger' | parallel -j40 'ia metadata {} > {}_meta.json'
Python module usage
Below is brief overview of the internetarchive Python library. Please refer to the API documentation for more specific details.
Downloading
The Internet Archive stores data in items. You can query the archive using an item identifier:
>>> import internetarchive
>>> item = internetarchive.Item('stairs')
>>> print item.metadata
Items contains files. You can download the entire item:
>>> item.download()
or you can download just a particular file:
>>> f = item.file('glogo.png')
>>> f.download() #writes to disk
>>> f.download('/foo/bar/some_other_name.png')
You can iterate over files:
>>> for f in item.files():
... print f.name, f.sha1
Uploading
You can use the IA’s S3-like interface to upload files to an item. You need to supply your IAS3 credentials in environment variables in order to upload. You can retrieve S3 keys from https://archive.org/account/s3.php
>>> import os
>>> os.environ['AWS_ACCESS_KEY_ID']='x'
>>> os.environ['AWS_SECRET_ACCESS_KEY']='y'
>>> item = internetarchive.Item('new_identifier')
>>> item.upload('/path/to/image.jpg', metadata=dict(mediatype='image', creator='Jake Johnson'))
Item-level metadata must be supplied with the first file uploaded to an item.
You can upload additional files to an existing item:
>>> item = internetarchive.Item('existing_identifier')
>>> item.upload(['/path/to/image2.jpg', '/path/to/image3.jpg'])
You can also upload file-like objects:
>>> import StringIO
>>> fh = StringIO.StringIO('hello world')
>>> fh.name = 'hello_world.txt
>>> item.upload(fh)
Modifying Metadata
You can modify metadata for existing items, using the item.modify_metadata() function. This uses the IA Metadata API under the hood and requires your IAS3 credentials.
>>> import os
>>> os.environ['AWS_ACCESS_KEY_ID']='x'
>>> os.environ['AWS_SECRET_ACCESS_KEY']='y'
>>> item = internetarchive.Item('my_identifier')
>>> md = dict(blah='one', foo=['two', 'three'])
>>> item.modify_metadata(md)
Searching
You can search for items using the archive.org advanced search engine:
>>> import internetarchive
>>> search = internetarchive.Search('collection:nasa')
>>> print search.num_found
186911
You can iterate over your results:
>>> for result in search.results:
... print result['identifier']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file internetarchive-0.4.0.tar.gz
.
File metadata
- Download URL: internetarchive-0.4.0.tar.gz
- Upload date:
- Size: 35.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ea38ae6b9ff5b2f34e916b8e108cf9bca1faf05525f77ef9e798242a2ef8e31 |
|
MD5 | 2d69951b3587b61df3b95e981c26fcd2 |
|
BLAKE2b-256 | 9c418983a7f670b654b69bb1fa8721f6a1c87c7db3b19e3e0258b91bdeaca5b0 |