A python interface to archive.org.
Project description
A python interface to archive.org
---------------------------------
.. image:: https://travis-ci.org/jjjake/internetarchive.svg
:target: https://travis-ci.org/jjjake/internetarchive
.. image:: https://img.shields.io/pypi/dm/internetarchive.svg
:target: https://pypi.python.org/pypi/internetarchive
This package installs a CLI tool named ``ia`` for using archive.org from the command-line.
It also installs the ``internetarchive`` python module for programatic access to archive.org.
Please report all bugs and issues on `Github <https://github.com/jjjake/ia-wrapper/issues>`__.
.. contents:: Table of Contents:
Installation
~~~~~~~~~~~~
You can install this module via pip:
``pip install internetarchive``
Alternatively, you can install a few extra dependencies to help speed things up a bit:
``pip install "internetarchive[speedups]"``
This will install `ujson <https://pypi.python.org/pypi/ujson>`__ for faster JSON parsing,
and `gevent <https://pypi.python.org/pypi/gevent>`__ for concurrent downloads.
If you want to install this module globally on your system instead of inside a ``virtualenv``, use sudo:
``sudo pip install internetarchive``
Configuring
~~~~~~~~~~~
You can configure both the ``ia`` command-line tool and the Python interface from the command-line:
.. code:: bash
$ ia configure
You will be prompted to enter your Archive.org login credentials. If authorization is successful a config file will be saved
on your computer that contains your Archive.org S3 keys for uploading and modifying metadata.
Command-Line Usage
------------------
Help is available by typing ``ia --help``. You can also get help on a command: ``ia <command> --help``.
Available subcommands are ``configure``, ``metadata``, ``upload``, ``download``, ``search``, ``delete``, ``list``, and ``catalog``.
Downloading
~~~~~~~~~~~
To download the entire `TripDown1905 <https://archive.org/details/TripDown1905>`__ item:
.. code:: bash
$ ia download TripDown1905
``ia download`` usage examples:
.. code:: bash
#download just the mp4 files using ``--glob``
$ ia download TripDown1905 --glob='*.mp4'
#download all the mp4 files using ``--formats``:
$ ia download TripDown1905 --format='512Kb MPEG4'
#download multiple formats from an item:
$ ia download TripDown1905 --format='512Kb MPEG4' --format='Ogg Video'
#list all the formats in an item:
$ ia metadata --formats TripDown1905
#download a single file from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4
#download multiple files from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv
Uploading
~~~~~~~~~
You can use the provided ``ia`` command-line tool to upload items. After `configuring ia <https://github.com/jjjake/internetarchive#configuring>`__,
you can upload files like so:
.. code:: bash
#upload files:
$ ia upload <identifier> file1 file2 --metadata="title:foo" --metadata="blah:arg"
#upload from `stdin`:
$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz |
ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."
Metadata
~~~~~~~~
You can use the ``ia`` command-line tool to download item metadata in JSON format:
.. code:: bash
$ ia metadata TripDown1905
You can also modify metadata after `configuring ia <https://github.com/jjjake/internetarchive#configuring>`__.
.. code:: bash
$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"
Data Mining
~~~~~~~~~~~
IA Mine can be used for data mining Archive.org metadata and search results: `https://github.com/jjjake/iamine <https://github.com/jjjake/iamine>`__.
Searching
~~~~~~~~~
You can search using the provided ``ia`` command-line script:
.. code:: bash
$ ia search 'subject:"market street" collection:prelinger'
Parallel Downloading
~~~~~~~~~~~~~~~~~~~~
If you have the GNU ``parallel`` tool intalled, then you can combine ``ia search`` and ``ia metadata`` to quickly retrieve data for many items in parallel:
.. code:: bash
$ia search 'subject:"market street" collection:prelinger' | parallel -j40 'ia metadata {} > {}_meta.json'
Python module usage
-------------------
Below is brief overview of the ``internetarchive`` Python library.
Please refer to the `API documentation <http://ia-wrapper.readthedocs.org/en/latest/>`__ for more specific details.
Downloading from Python
~~~~~~~~~~~~~~~~~~~~~~~
The Internet Archive stores data in
`items <http://blog.archive.org/2011/03/31/how-archive-org-items-are-structured/>`__.
You can query the archive using an item identifier:
.. code:: python
>>> from internetarchive import get_item
>>> item = get_item('stairs')
>>> print(item.metadata)
Items contains files. You can download the entire item:
.. code:: python
>>> item.download()
or you can download just a particular file:
.. code:: python
>>> f = item.get_file('glogo.png')
>>> f.download() #writes to disk
>>> f.download('/foo/bar/some_other_name.png')
You can iterate over files:
.. code:: python
>>> for f in item.iter_files():
... print(f.name, f.sha1)
Uploading from Python
~~~~~~~~~~~~~~~~~~~~~
You can use the IA's S3-like interface to upload files to an item after
`configuring the internetarchive library <https://github.com/jjjake/internetarchive#configuring>`__.
.. code:: python
>>> from internetarchive import get_item
>>> item = get_item('new_identifier')
>>> md = dict(mediatype='image', creator='Jake Johnson')
>>> item.upload('/path/to/image.jpg', metadata=md)
Item-level metadata must be supplied with the first file uploaded to an
item.
You can upload additional files to an existing item:
.. code:: python
>>> item = internetarchive.Item('existing_identifier')
>>> item.upload(['/path/to/image2.jpg', '/path/to/image3.jpg'])
You can also upload file-like objects:
.. code:: python
>>> import StringIO
>>> fh = StringIO.StringIO('hello world')
>>> fh.name = 'hello_world.txt'
>>> item.upload(fh)
Modifying Metadata from Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can modify metadata for existing items, using the ``item.modify_metadata()`` function. This uses the `IA Metadata
API <http://blog.archive.org/2013/07/04/metadata-api/>`__ under the hood and requires your IAS3 credentials. So, once
again make sure you have the `internetarchive library configured <https://github.com/jjjake/internetarchive#configuring>`__.
.. code:: python
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> md = dict(blah='one', foo=['two', 'three'])
>>> item.modify_metadata(md)
Searching from Python
~~~~~~~~~~~~~~~~~~~~~
You can search for items using the `archive.org advanced search
engine <https://archive.org/advancedsearch.php>`__:
.. code:: python
>>> from internetarchive import search_items
>>> search = search_items('collection:nasa')
>>> print(search.num_found)
186911
You can iterate over your results:
.. code:: python
>>> for result in search:
... print(result['identifier'])
.. :changelog:
Release History
---------------
0.8.9 (2015-08-13)
++++++++++++++++++
**Bugfixes**
- Updated docopt to v0.6.2 and PyYAML to v3.11.
- Updated setup.py to automatically pull version from `__init__`.
0.8.5 (2015-07-13)
++++++++++++++++++
**Bugfixes**
- Fixed UnicodeEncodeError in `ia metadata --append`.
**Features and Improvements**
- Added configuration documentation to readme.
- Updated requests to v2.7.0
0.8.4 (2015-06-18)
++++++++++++++++++
**Features and Improvements**
- Added check to `ia upload` to see if the collection being uploaded to exists.
Also added an option to override this check.
0.8.3 (2015-05-18)
++++++++++++++++++
**Features and Improvements**
- Fixed append to work like a standard metadata update if the metadata field
does not yet exist for the given item.
0.8.0 2015-03-09
++++++++++++++++
**Bugfixes**
- Encode filenames in upload URLs.
0.7.9 (2015-01-26)
++++++++++++++++++
**Bugfixes**
- Fixed bug in `internetarchive.config.get_auth_config` (i.e. `ia configure`)
where logged-in cookies returned expired within hours. Cookies should now be
valid for about one year.
0.7.8 (2014-12-23)
++++++++++++++++++
- Output error message when downloading non-existing files in `ia download` rather
than raising Python exception.
- Fixed IOError in `ia search` when using `head`, `tail`, etc..
- Simplified `ia search` to output only JSON, rather than doing any special
formatting.
- Added experimental support for creating pex binaries of ia in `Makefile`.
0.7.7 (2014-12-17)
++++++++++++++++++
- Simplified `ia configure`. It now only asks for Archive.org email/password and
automatically adds S3 keys and Archive.org cookies to config.
See `internetarchive.config.get_auth_config()`.
0.7.6 (2014-12-17)
++++++++++++++++++
- Write metadata to stdout rather than stderr in `ia mine`.
- Added options to search archive.org/v2.
- Added destdir option to download files/itemdirs to a given destination dir.
0.7.5 (2014-10-08)
++++++++++++++++++
- Fixed typo.
0.7.4 (2014-10-08)
++++++++++++++++++
- Fixed missing "import" typo in `internetarchive.iacli.ia_upload`.
0.7.3 (2014-10-08)
++++++++++++++++++
- Added progress bar to `ia mine`.
- Fixed unicode metadata support for `upload()`.
0.7.2 (2014-09-16)
++++++++++++++++++
- Suppress `KeyboardInterrupt` exceptions and exit with status code 130.
- Added ability to skip downloading files based on checksum in `ia download`,
`Item.download()`, and `File.download()`.
- `ia download` is now verbose by default. Output can be suppressed with the `--quiet`
flag.
- Added an option to not download into item directories, but rather the current working
directory (i.e. `ia download --no-directories <id>`).
- Added/fixed support for modifying different metadata targets (i.e. files/logo.jpg).
0.7.1 (2014-08-25)
++++++++++++++++++
- Added `Item.s3_is_overloaded()` method for S3 status check. This method is now used on
retries in the upload method now as well. This will avoid uploading any data if a 503
is expected. If a 503 is still returned, retries are attempted.
- Added `--status-check` option to `ia upload` for S3 status check.
- Added `--source` parameter to `ia list` for returning files matching IA source (i.e.
original, derivative, metadata, etc.).
- Added support to `ia upload` for setting remote-name if only a single file is being
uploaded.
- Derive tasks are now only queued after the last file has been uploaded.
- File URLs are now quoted in `File` objects, for downloading files with specail
characters in their filenames
0.7.0 (2014-07-23)
++++++++++++++++++
- Added support for retry on S3 503 SlowDown errors.
0.6.9 (2014-07-15)
++++++++++++++++++
- Added support for \n and \r characters in upload headers.
- Added support for reading filenames from stdin when using the `ia delete` command.
0.6.8 (2014-07-11)
++++++++++++++++++
- The delete `ia` subcommand is now verbose by default.
- Added glob support to the delete `ia` subcommand (i.e. `ia delete --glob='*jpg'`).
- Changed indexed metadata elements to clobber values instead of insert.
- AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are now deprecated.
IAS3_ACCESS_KEY and IAS3_SECRET_KEY must be used if setting IAS3
keys via environment variables.
---------------------------------
.. image:: https://travis-ci.org/jjjake/internetarchive.svg
:target: https://travis-ci.org/jjjake/internetarchive
.. image:: https://img.shields.io/pypi/dm/internetarchive.svg
:target: https://pypi.python.org/pypi/internetarchive
This package installs a CLI tool named ``ia`` for using archive.org from the command-line.
It also installs the ``internetarchive`` python module for programatic access to archive.org.
Please report all bugs and issues on `Github <https://github.com/jjjake/ia-wrapper/issues>`__.
.. contents:: Table of Contents:
Installation
~~~~~~~~~~~~
You can install this module via pip:
``pip install internetarchive``
Alternatively, you can install a few extra dependencies to help speed things up a bit:
``pip install "internetarchive[speedups]"``
This will install `ujson <https://pypi.python.org/pypi/ujson>`__ for faster JSON parsing,
and `gevent <https://pypi.python.org/pypi/gevent>`__ for concurrent downloads.
If you want to install this module globally on your system instead of inside a ``virtualenv``, use sudo:
``sudo pip install internetarchive``
Configuring
~~~~~~~~~~~
You can configure both the ``ia`` command-line tool and the Python interface from the command-line:
.. code:: bash
$ ia configure
You will be prompted to enter your Archive.org login credentials. If authorization is successful a config file will be saved
on your computer that contains your Archive.org S3 keys for uploading and modifying metadata.
Command-Line Usage
------------------
Help is available by typing ``ia --help``. You can also get help on a command: ``ia <command> --help``.
Available subcommands are ``configure``, ``metadata``, ``upload``, ``download``, ``search``, ``delete``, ``list``, and ``catalog``.
Downloading
~~~~~~~~~~~
To download the entire `TripDown1905 <https://archive.org/details/TripDown1905>`__ item:
.. code:: bash
$ ia download TripDown1905
``ia download`` usage examples:
.. code:: bash
#download just the mp4 files using ``--glob``
$ ia download TripDown1905 --glob='*.mp4'
#download all the mp4 files using ``--formats``:
$ ia download TripDown1905 --format='512Kb MPEG4'
#download multiple formats from an item:
$ ia download TripDown1905 --format='512Kb MPEG4' --format='Ogg Video'
#list all the formats in an item:
$ ia metadata --formats TripDown1905
#download a single file from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4
#download multiple files from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv
Uploading
~~~~~~~~~
You can use the provided ``ia`` command-line tool to upload items. After `configuring ia <https://github.com/jjjake/internetarchive#configuring>`__,
you can upload files like so:
.. code:: bash
#upload files:
$ ia upload <identifier> file1 file2 --metadata="title:foo" --metadata="blah:arg"
#upload from `stdin`:
$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz |
ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."
Metadata
~~~~~~~~
You can use the ``ia`` command-line tool to download item metadata in JSON format:
.. code:: bash
$ ia metadata TripDown1905
You can also modify metadata after `configuring ia <https://github.com/jjjake/internetarchive#configuring>`__.
.. code:: bash
$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"
Data Mining
~~~~~~~~~~~
IA Mine can be used for data mining Archive.org metadata and search results: `https://github.com/jjjake/iamine <https://github.com/jjjake/iamine>`__.
Searching
~~~~~~~~~
You can search using the provided ``ia`` command-line script:
.. code:: bash
$ ia search 'subject:"market street" collection:prelinger'
Parallel Downloading
~~~~~~~~~~~~~~~~~~~~
If you have the GNU ``parallel`` tool intalled, then you can combine ``ia search`` and ``ia metadata`` to quickly retrieve data for many items in parallel:
.. code:: bash
$ia search 'subject:"market street" collection:prelinger' | parallel -j40 'ia metadata {} > {}_meta.json'
Python module usage
-------------------
Below is brief overview of the ``internetarchive`` Python library.
Please refer to the `API documentation <http://ia-wrapper.readthedocs.org/en/latest/>`__ for more specific details.
Downloading from Python
~~~~~~~~~~~~~~~~~~~~~~~
The Internet Archive stores data in
`items <http://blog.archive.org/2011/03/31/how-archive-org-items-are-structured/>`__.
You can query the archive using an item identifier:
.. code:: python
>>> from internetarchive import get_item
>>> item = get_item('stairs')
>>> print(item.metadata)
Items contains files. You can download the entire item:
.. code:: python
>>> item.download()
or you can download just a particular file:
.. code:: python
>>> f = item.get_file('glogo.png')
>>> f.download() #writes to disk
>>> f.download('/foo/bar/some_other_name.png')
You can iterate over files:
.. code:: python
>>> for f in item.iter_files():
... print(f.name, f.sha1)
Uploading from Python
~~~~~~~~~~~~~~~~~~~~~
You can use the IA's S3-like interface to upload files to an item after
`configuring the internetarchive library <https://github.com/jjjake/internetarchive#configuring>`__.
.. code:: python
>>> from internetarchive import get_item
>>> item = get_item('new_identifier')
>>> md = dict(mediatype='image', creator='Jake Johnson')
>>> item.upload('/path/to/image.jpg', metadata=md)
Item-level metadata must be supplied with the first file uploaded to an
item.
You can upload additional files to an existing item:
.. code:: python
>>> item = internetarchive.Item('existing_identifier')
>>> item.upload(['/path/to/image2.jpg', '/path/to/image3.jpg'])
You can also upload file-like objects:
.. code:: python
>>> import StringIO
>>> fh = StringIO.StringIO('hello world')
>>> fh.name = 'hello_world.txt'
>>> item.upload(fh)
Modifying Metadata from Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can modify metadata for existing items, using the ``item.modify_metadata()`` function. This uses the `IA Metadata
API <http://blog.archive.org/2013/07/04/metadata-api/>`__ under the hood and requires your IAS3 credentials. So, once
again make sure you have the `internetarchive library configured <https://github.com/jjjake/internetarchive#configuring>`__.
.. code:: python
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> md = dict(blah='one', foo=['two', 'three'])
>>> item.modify_metadata(md)
Searching from Python
~~~~~~~~~~~~~~~~~~~~~
You can search for items using the `archive.org advanced search
engine <https://archive.org/advancedsearch.php>`__:
.. code:: python
>>> from internetarchive import search_items
>>> search = search_items('collection:nasa')
>>> print(search.num_found)
186911
You can iterate over your results:
.. code:: python
>>> for result in search:
... print(result['identifier'])
.. :changelog:
Release History
---------------
0.8.9 (2015-08-13)
++++++++++++++++++
**Bugfixes**
- Updated docopt to v0.6.2 and PyYAML to v3.11.
- Updated setup.py to automatically pull version from `__init__`.
0.8.5 (2015-07-13)
++++++++++++++++++
**Bugfixes**
- Fixed UnicodeEncodeError in `ia metadata --append`.
**Features and Improvements**
- Added configuration documentation to readme.
- Updated requests to v2.7.0
0.8.4 (2015-06-18)
++++++++++++++++++
**Features and Improvements**
- Added check to `ia upload` to see if the collection being uploaded to exists.
Also added an option to override this check.
0.8.3 (2015-05-18)
++++++++++++++++++
**Features and Improvements**
- Fixed append to work like a standard metadata update if the metadata field
does not yet exist for the given item.
0.8.0 2015-03-09
++++++++++++++++
**Bugfixes**
- Encode filenames in upload URLs.
0.7.9 (2015-01-26)
++++++++++++++++++
**Bugfixes**
- Fixed bug in `internetarchive.config.get_auth_config` (i.e. `ia configure`)
where logged-in cookies returned expired within hours. Cookies should now be
valid for about one year.
0.7.8 (2014-12-23)
++++++++++++++++++
- Output error message when downloading non-existing files in `ia download` rather
than raising Python exception.
- Fixed IOError in `ia search` when using `head`, `tail`, etc..
- Simplified `ia search` to output only JSON, rather than doing any special
formatting.
- Added experimental support for creating pex binaries of ia in `Makefile`.
0.7.7 (2014-12-17)
++++++++++++++++++
- Simplified `ia configure`. It now only asks for Archive.org email/password and
automatically adds S3 keys and Archive.org cookies to config.
See `internetarchive.config.get_auth_config()`.
0.7.6 (2014-12-17)
++++++++++++++++++
- Write metadata to stdout rather than stderr in `ia mine`.
- Added options to search archive.org/v2.
- Added destdir option to download files/itemdirs to a given destination dir.
0.7.5 (2014-10-08)
++++++++++++++++++
- Fixed typo.
0.7.4 (2014-10-08)
++++++++++++++++++
- Fixed missing "import" typo in `internetarchive.iacli.ia_upload`.
0.7.3 (2014-10-08)
++++++++++++++++++
- Added progress bar to `ia mine`.
- Fixed unicode metadata support for `upload()`.
0.7.2 (2014-09-16)
++++++++++++++++++
- Suppress `KeyboardInterrupt` exceptions and exit with status code 130.
- Added ability to skip downloading files based on checksum in `ia download`,
`Item.download()`, and `File.download()`.
- `ia download` is now verbose by default. Output can be suppressed with the `--quiet`
flag.
- Added an option to not download into item directories, but rather the current working
directory (i.e. `ia download --no-directories <id>`).
- Added/fixed support for modifying different metadata targets (i.e. files/logo.jpg).
0.7.1 (2014-08-25)
++++++++++++++++++
- Added `Item.s3_is_overloaded()` method for S3 status check. This method is now used on
retries in the upload method now as well. This will avoid uploading any data if a 503
is expected. If a 503 is still returned, retries are attempted.
- Added `--status-check` option to `ia upload` for S3 status check.
- Added `--source` parameter to `ia list` for returning files matching IA source (i.e.
original, derivative, metadata, etc.).
- Added support to `ia upload` for setting remote-name if only a single file is being
uploaded.
- Derive tasks are now only queued after the last file has been uploaded.
- File URLs are now quoted in `File` objects, for downloading files with specail
characters in their filenames
0.7.0 (2014-07-23)
++++++++++++++++++
- Added support for retry on S3 503 SlowDown errors.
0.6.9 (2014-07-15)
++++++++++++++++++
- Added support for \n and \r characters in upload headers.
- Added support for reading filenames from stdin when using the `ia delete` command.
0.6.8 (2014-07-11)
++++++++++++++++++
- The delete `ia` subcommand is now verbose by default.
- Added glob support to the delete `ia` subcommand (i.e. `ia delete --glob='*jpg'`).
- Changed indexed metadata elements to clobber values instead of insert.
- AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are now deprecated.
IAS3_ACCESS_KEY and IAS3_SECRET_KEY must be used if setting IAS3
keys via environment variables.