A python interface to archive.org.
Project description
A python interface to archive.org
This package installs a CLI tool named ia for using archive.org from the command-line. It also installs the internetarchive python module for programatic access to archive.org. Please report all bugs and issues on Github.
Installation
You can install this module via pip:
pip install internetarchive
Alternatively, you can install a few extra dependencies to help speed things up a bit:
pip install "internetarchive[speedups]"
This will install ujson for faster JSON parsing, and gevent for concurrent downloads.
If you want to install this module globally on your system instead of inside a virtualenv, use sudo:
sudo pip install internetarchive
Configuring
You can configure both the ia command-line tool and the Python interface from the command-line:
$ ia configure
You will be prompted to enter your Archive.org login credentials. If authorization is successful a config file will be saved on your computer that contains your Archive.org S3 keys for uploading and modifying metadata.
Command-Line Usage
Help is available by typing ia --help. You can also get help on a command: ia <command> --help. Available subcommands are configure, metadata, upload, download, search, delete, list, and catalog.
Downloading
To download the entire TripDown1905 item:
$ ia download TripDown1905
ia download usage examples:
#download just the mp4 files using ``--glob``
$ ia download TripDown1905 --glob='*.mp4'
#download all the mp4 files using ``--formats``:
$ ia download TripDown1905 --format='512Kb MPEG4'
#download multiple formats from an item:
$ ia download TripDown1905 --format='512Kb MPEG4' --format='Ogg Video'
#list all the formats in an item:
$ ia metadata --formats TripDown1905
#download a single file from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4
#download multiple files from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv
Uploading
You can use the provided ia command-line tool to upload items. After configuring ia, you can upload files like so:
#upload files:
$ ia upload <identifier> file1 file2 --metadata="title:foo" --metadata="blah:arg"
#upload from `stdin`:
$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz |
ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."
Metadata
You can use the ia command-line tool to download item metadata in JSON format:
$ ia metadata TripDown1905
You can also modify metadata after configuring ia.
$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"
Data Mining
IA Mine can be used for data mining Archive.org metadata and search results: https://github.com/jjjake/iamine.
Searching
You can search using the provided ia command-line script:
$ ia search 'subject:"market street" collection:prelinger'
Parallel Downloading
If you have the GNU parallel tool intalled, then you can combine ia search and ia metadata to quickly retrieve data for many items in parallel:
$ia search 'subject:"market street" collection:prelinger' | parallel -j40 'ia metadata {} > {}_meta.json'
Python module usage
Below is brief overview of the internetarchive Python library. Please refer to the API documentation for more specific details.
Downloading from Python
The Internet Archive stores data in items. You can query the archive using an item identifier:
>>> from internetarchive import get_item
>>> item = get_item('stairs')
>>> print(item.metadata)
Items contains files. You can download the entire item:
>>> item.download()
or you can download just a particular file:
>>> f = item.get_file('glogo.png')
>>> f.download() #writes to disk
>>> f.download('/foo/bar/some_other_name.png')
You can iterate over files:
>>> for f in item.iter_files():
... print(f.name, f.sha1)
Uploading from Python
You can use the IA’s S3-like interface to upload files to an item after configuring the internetarchive library.
>>> from internetarchive import get_item
>>> item = get_item('new_identifier')
>>> md = dict(mediatype='image', creator='Jake Johnson')
>>> item.upload('/path/to/image.jpg', metadata=md)
Item-level metadata must be supplied with the first file uploaded to an item.
You can upload additional files to an existing item:
>>> item = internetarchive.Item('existing_identifier')
>>> item.upload(['/path/to/image2.jpg', '/path/to/image3.jpg'])
You can also upload file-like objects:
>>> import StringIO
>>> fh = StringIO.StringIO('hello world')
>>> fh.name = 'hello_world.txt'
>>> item.upload(fh)
Modifying Metadata from Python
You can modify metadata for existing items, using the item.modify_metadata() function. This uses the IA Metadata API under the hood and requires your IAS3 credentials. So, once again make sure you have the internetarchive library configured.
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> md = dict(blah='one', foo=['two', 'three'])
>>> item.modify_metadata(md)
Searching from Python
You can search for items using the archive.org advanced search engine:
>>> from internetarchive import search_items
>>> search = search_items('collection:nasa')
>>> print(search.num_found)
186911
You can iterate over your results:
>>> for result in search:
... print(result['identifier'])
Release History
0.9.7 (2015-11-05)
Bugfixes
Cleanup partially downloaded files when download() fails.
Features and Improvements
Added –format option to ia delete.
Refactored download() and ia download to behave more like rsync. Files are now clobbered by default, ignore_existing and –ignore-existing now skip over files already downloaded without making a request.
Added retry support to download() and ia download.
Added files kwarg to Item.download() for downloading specific files.
Added ignore_errors option to File.download() for ignoring (but logging) exceptions.
Added default timeouts to metadata and download requests.
Less verbose output in ia download by default, use ia download –verbose for old style output.
0.9.6 (2015-10-12)
Bugfixes
Removed sync-db features for now, as lazytaable is not playing nicely with setup.py right now.
0.9.5 (2015-10-12)
Features and Improvements
Added skip based on mtime and length if no other clobber/skip options specified in download() and ia download.
0.9.4 (2015-10-01)
Features and Improvements
Added internetarchive.api.get_username() for retrieving a username with an S3 key-pair.
Added ability to sync downloads via an sqlite database.
0.9.3 (2015-09-28)
Features and Improvements
Added ability to download items from an itemlist or search query in ia download.
Made ia configure Python 3 compatabile.
Bugfixes
Fixed bug in ia upload where uploading an item with more than one collection specified caused the collection check to fail.
0.9.2 (2015-08-17)
Bugfixes
Added error message for failed ia configure calls due to invalid creds.
0.9.1 (2015-08-13)
Bugfixes
Updated docopt to v0.6.2 and PyYAML to v3.11.
Updated setup.py to automatically pull version from __init__.
0.8.5 (2015-07-13)
Bugfixes
Fixed UnicodeEncodeError in ia metadata –append.
Features and Improvements
Added configuration documentation to readme.
Updated requests to v2.7.0
0.8.4 (2015-06-18)
Features and Improvements
Added check to ia upload to see if the collection being uploaded to exists. Also added an option to override this check.
0.8.3 (2015-05-18)
Features and Improvements
Fixed append to work like a standard metadata update if the metadata field does not yet exist for the given item.
0.8.0 2015-03-09
Bugfixes
Encode filenames in upload URLs.
0.7.9 (2015-01-26)
Bugfixes
Fixed bug in internetarchive.config.get_auth_config (i.e. ia configure) where logged-in cookies returned expired within hours. Cookies should now be valid for about one year.
0.7.8 (2014-12-23)
Output error message when downloading non-existing files in ia download rather than raising Python exception.
Fixed IOError in ia search when using head, tail, etc..
Simplified ia search to output only JSON, rather than doing any special formatting.
Added experimental support for creating pex binaries of ia in Makefile.
0.7.7 (2014-12-17)
Simplified ia configure. It now only asks for Archive.org email/password and automatically adds S3 keys and Archive.org cookies to config. See internetarchive.config.get_auth_config().
0.7.6 (2014-12-17)
Write metadata to stdout rather than stderr in ia mine.
Added options to search archive.org/v2.
Added destdir option to download files/itemdirs to a given destination dir.
0.7.5 (2014-10-08)
Fixed typo.
0.7.4 (2014-10-08)
Fixed missing “import” typo in internetarchive.iacli.ia_upload.
0.7.3 (2014-10-08)
Added progress bar to ia mine.
Fixed unicode metadata support for upload().
0.7.2 (2014-09-16)
Suppress KeyboardInterrupt exceptions and exit with status code 130.
Added ability to skip downloading files based on checksum in ia download, Item.download(), and File.download().
ia download is now verbose by default. Output can be suppressed with the –quiet flag.
Added an option to not download into item directories, but rather the current working directory (i.e. ia download –no-directories <id>).
Added/fixed support for modifying different metadata targets (i.e. files/logo.jpg).
0.7.1 (2014-08-25)
Added Item.s3_is_overloaded() method for S3 status check. This method is now used on retries in the upload method now as well. This will avoid uploading any data if a 503 is expected. If a 503 is still returned, retries are attempted.
Added –status-check option to ia upload for S3 status check.
Added –source parameter to ia list for returning files matching IA source (i.e. original, derivative, metadata, etc.).
Added support to ia upload for setting remote-name if only a single file is being uploaded.
Derive tasks are now only queued after the last file has been uploaded.
File URLs are now quoted in File objects, for downloading files with specail characters in their filenames
0.7.0 (2014-07-23)
Added support for retry on S3 503 SlowDown errors.
0.6.9 (2014-07-15)
Added support for n and r characters in upload headers.
Added support for reading filenames from stdin when using the ia delete command.
0.6.8 (2014-07-11)
The delete ia subcommand is now verbose by default.
Added glob support to the delete ia subcommand (i.e. ia delete –glob=’*jpg’).
Changed indexed metadata elements to clobber values instead of insert.
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are now deprecated. IAS3_ACCESS_KEY and IAS3_SECRET_KEY must be used if setting IAS3 keys via environment variables.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file internetarchive-0.9.7.tar.gz
.
File metadata
- Download URL: internetarchive-0.9.7.tar.gz
- Upload date:
- Size: 56.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af52e3d5fa8283b845555c5cfa2939ac2ec4c51c7190432389ff82a9f3768d68 |
|
MD5 | a021247ae2c54f4b4436eb39534e2a3d |
|
BLAKE2b-256 | be596634ae23731a67203a8c497ab034ebd80f57808edf961608af5a35888165 |