A python interface to archive.org.
Project description
A python interface to archive.org
This package installs a CLI tool named ia for using archive.org from the command-line. It also installs the internetarchive python module for programatic access to archive.org. Please report all bugs and issues on Github.
Installation
You can install this module via pip:
pip install internetarchive
Alternatively, you can install a few extra dependencies to help speed things up a bit:
pip install "internetarchive[speedups]"
This will install ujson for faster JSON parsing, and gevent for concurrent downloads.
If you want to install this module globally on your system instead of inside a virtualenv, use sudo:
sudo pip install internetarchive
Command-Line Usage
Help is available by typing ia --help. You can also get help on a command: ia <command> --help. Available subcommands are configure, metadata, upload, download, search, mine, and catalog.
Downloading
To download the entire TripDown1905 item:
$ ia download TripDown1905
ia download usage examples:
#download just the mp4 files using ``--glob``
$ ia download TripDown1905 --glob='*.mp4'
#download all the mp4 files using ``--formats``:
$ ia download TripDown1905 --format='512Kb MPEG4'
#download multiple formats from an item:
$ ia download TripDown1905 --format='512Kb MPEG4' --format='Ogg Video'
#list all the formats in an item:
$ ia metadata --formats TripDown1905
#download a single file from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4
#download multiple files from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv
Uploading
You can use the provided ia command-line tool to upload items:
$ export AWS_ACCESS_KEY_ID='xxx'
$ export AWS_SECRET_ACCESS_KEY='yyy'
$ ia upload <identifier> file1 file2 --metadata="title:foo" --metadata="blah:arg"
Metadata
You can use the ia command-line tool down to download item metadata in JSON format:
$ia metadata TripDown1905
You can also modify metadata. Be sure that the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables are set.
$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"
Searching
You can search using the provided ia command-line script:
$ ia search 'subject:"market street" collection:prelinger'
Parallel Downloading
If you have the GNU parallel tool intalled, then you can combine ia search and ia metadata to quickly retrieve data for many items in parallel:
$ia search 'subject:"market street" collection:prelinger' | parallel -j40 'ia metadata {} > {}_meta.json'
Python module usage
Downloading
The Internet Archive stores data in items. You can query the archive using an item identifier:
>>> import internetarchive
>>> item = internetarchive.Item('stairs')
>>> print item.metadata
Items contains files. You can download the entire item:
>>> item.download()
or you can download just a particular file:
>>> f = item.file('glogo.png')
>>> f.download() #writes to disk
>>> f.download('/foo/bar/some_other_name.png')
You can iterate over files:
>>> for f in item.files():
... print f.name, f.sha1
Uploading
You can use the IA’s S3-like interface to upload files to an item. You need to supply your IAS3 credentials in environment variables in order to upload. You can retrieve S3 keys from https://archive.org/account/s3.php
>>> import os
>>> os.environ['AWS_ACCESS_KEY_ID']='x'
>>> os.environ['AWS_SECRET_ACCESS_KEY']='y'
>>> item = internetarchive.Item('new_identifier')
>>> item.upload('/path/to/image.jpg', dict(mediatype='image', creator='Jake Johnson'))
Item-level metadata must be supplied with the first file uploaded to an item.
You can upload additional files to an existing item:
>>> item = internetarchive.Item('existing_identifier')
>>> item.upload(['/path/to/image2.jpg', '/path/to/image3.jpg'])
You can also upload file-like objects:
>>> import StringIO
>>> fh = StringIO.StringIO('hello world')
>>> fh.name = 'hello_world.txt
>>> item.upload(fh)
Modifying Metadata
You can modify metadata for existing items, using the item.modify_metadata() function. This uses the IA Metadata API under the hood and requires your IAS3 credentials.
>>> import os
>>> os.environ['AWS_ACCESS_KEY_ID']='x'
>>> os.environ['AWS_SECRET_ACCESS_KEY']='y'
>>> item = internetarchive.Item('my_identifier')
>>> md = dict(blah='one', foo=['two', 'three'])
>>> item.modify_metadata(md)
Searching
You can search for items using the archive.org advanced search engine:
>>> import internetarchive
>>> search = internetarchive.Search('collection:nasa')
>>> print search.num_found
186911
You can iterate over your results:
>>> for result in search.results:
... print result['identifier']
A note about uploading items with mixed-case names
The Internet Archive allows mixed-case item identifiers, but Amazon S3 does not allow mixed-case bucket names. The internetarchive python module is built on top of the boto S3 module. boto disallows creation of mixed-case buckets, but allows you to download from existing mixed-case buckets. If you wish to upload a new item to the Internet Archive with a mixed-case item identifier, you will need to monkey-patch the boto.s3.connection.check_lowercase_bucketname function:
>>> import boto
>>> def check_lowercase_bucketname(n):
... return True
>>> boto.s3.connection.check_lowercase_bucketname = check_lowercase_bucketname
>>> item = internetarchive.Item('TestUpload_pythonapi_20130812')
>>> item.upload('file.txt', dict(mediatype='texts', creator='Internet Archive'))
True
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.