Azure Data Lake Store Filesystem Client Library for Python
Project description
azure-datalake-store
====================
.. image:: https://travis-ci.org/Azure/azure-data-lake-store-python.svg?branch=dev
:target: https://travis-ci.org/Azure/azure-data-lake-store-python
.. image:: https://coveralls.io/repos/github/Azure/azure-data-lake-store-python/badge.svg?branch=master
:target: https://coveralls.io/github/Azure/azure-data-lake-store-python?branch=master
azure-datalake-store is a file-system management system in python for the
Azure Data-Lake Store.
To install from source instead of pip (for local testing and development):
.. code-block:: bash
> pip install -r dev_requirements.txt
> python setup.py develop
To run tests, you are required to set the following environment variables:
azure_tenant_id, azure_username, azure_password, azure_data_lake_store_name
To play with the code, here is a starting point:
.. code-block:: python
from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(token, store_name=store_name)
# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.ls('tmp/', detail=True, invalidate_cache=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')
# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
print(f.readline())
print(f.readline())
print(f.readline())
# could have passed f to any function requiring a file object:
# pandas.read_csv(f)
with adl.open('anewfile', 'wb') as f:
# data is written on flush/close, or when buffer is bigger than
# blocksize
f.write(b'important data')
adl.du('anewfile')
# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)
Progress can be tracked using a callback function in the form `track(current, total)`
When passed, this will keep track of transferred bytes and be called on each complete chunk.
Here's an example using the Azure CLI progress controller as the `progress_callback`:
.. code-block:: python
from cli.core.application import APPLICATION
def _update_progress(current, total):
hook = APPLICATION.get_progress_controller(det=True)
hook.add(message='Alive', value=current, total_val=total)
if total == current:
hook.end()
...
ADLUploader(client, destination_path, source_path, thread_count, overwrite=overwrite,
chunksize=chunk_size,
buffersize=buffer_size,
blocksize=block_size,
progress_callback=_update_progress)
This will output a progress bar to the stdout:
```
Alive[######################### ] 40.0881%
...
Finished[#############################################################] 100.0000%
```
Command Line Sample Usage
-------------------------
To interact with the API at a higher-level, you can use the provided
command-line interface in "samples/cli.py". You will need to set
the appropriate environment variables as described above to connect to the
Azure Data Lake Store. Below is a simple sample, with more details beyond.
.. code-block:: bash
python samples\cli.py ls -l
Execute the program without arguments to access documentation.
To start the CLI in interactive mode, run "python samples/cli.py"
and then type "help" to see all available commands (similiar to Unix utilities):
.. code-block:: bash
> python samples/cli.py
azure> help
Documented commands (type help <topic>):
========================================
cat chmod close du get help ls mv quit rmdir touch
chgrp chown df exists head info mkdir put rm tail
azure>
While still in interactive mode, you can run "ls -l" to list the entries in the
home directory ("help ls" will show the command's usage details). If you're not
familiar with the Unix/Linux "ls" command, the columns represent 1) permissions,
2) file owner, 3) file group, 4) file size, 5-7) file's modification time, and
8) file name.
.. code-block:: bash
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
azure> ls -l --human-readable
drwxrwx--- 0123abcd 0123abcd 0B Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1M Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36B Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0B Aug 03 13:46 tmp
azure>
To download a remote file, run "get remote-file [local-file]". The second
argument, "local-file", is optional. If not provided, the local file will be
named after the remote file minus the directory path.
.. code-block:: bash
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
azure> get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
azure>
It is also possible to run in command-line mode, allowing any available command
to be executed separately without remaining in the interpreter.
For example, listing the entries in the home directory:
.. code-block:: bash
> python samples/cli.py ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
>
Also, downloading a remote file:
.. code-block:: bash
> python samples/cli.py get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
>
.. :changelog:
Release History
===============
0.0.15 (2017-07-26)
----------
* Enable Data Lake Store progress controller callback #174
* Fix File state incorrectly marked as "errored" if contains chunks is "pending" state #182
* Fix Race condition due to `transfer` future `done_callback` #177
0.0.14 (2017-07-10)
-------------------
* Fix an issue where common prefixes in paths for upload and download were collapsed into only unique paths.
0.0.13 (2017-06-28)
-------------------
* Add support for automatic refreshing of service principal credentials
0.0.12 (2017-06-20)
-------------------
* Fix a regression with ls returning the top level folder if it has no contents. It now properly returns an empty array if a folder has no children.
0.0.11 (2017-06-02)
-------------------
* Update to name incomplete file downloads with a `.inprogress` suffix. This suffix is removed when the download completes successfully.
0.0.10 (2017-05-24)
-------------------
* Allow users to explicitly use or invalidate the internal, local cache of the filesystem that is built up from previous `ls` calls. It is now set to always call the service instead of the cache by default.
* Update to properly create the wheel package during build to ensure all pip packages are available.
* Update folder upload/download to properly throw early in the event that the destination files exist and overwrite was not specified. NOTE: target folder existence (or sub folder existence) does not automatically cause failure. Only leaf node existence will result in failure.
* Fix a bug that caused file not found errors when attempting to get information about the root folder.
0.0.9 (2017-05-09)
------------------
* Enforce basic SSL utilization to ensure performance due to `GitHub issue 625 <https://github.com/pyca/pyopenssl/issues/625>`
0.0.8 (2017-04-26)
------------------
* Fix server-side throttling retry support. This is not a guarantee that if the server is throttling the upload (or download) it will eventually succeed, but there is now a back-off retry in place to make it more likely.
0.0.7 (2017-04-19)
------------------
* Update the build process to more efficiently handle multi-part namespaces for pip.
0.0.6 (2017-03-15)
------------------
* Fix an issue with path caching that should drastically improve performance for download
0.0.5 (2017-03-01)
------------------
* Fix for downloader to ensure there is access to the source path before creating destination files
* Fix for credential objects to inherit from msrest.authentication for more universal authentication support
* Add support for the following:
* set_expiry: allows for setting expiration on files
* ACL management:
* set_acl: allows for the full replacement of an ACL on a file or folder
* set_acl_entries: allows for "patching" an existing ACL on a file or folder
* get_acl_status: retrieves the ACL information for a file or folder
* remove_acl_entries: removes the specified entries from an ACL on a file or folder
* remove_acl: removes all non-default ACL entries from a file or folder
* remove_default_acl: removes all default ACL entries from a folder
* Remove unsupported and unused "TRUNCATE" operation.
* Added API-Version support with a default of the latest api version (2016-11-01)
0.0.4 (2017-02-07)
------------------
* Fix for folder upload to properly delete folders with contents when overwrite specified.
* Fix to set verbose output to False/Off by default. This removes progress tracking output by default but drastically improves performance.
0.0.3 (2017-02-02)
------------------
* Fix to setup.py to include the HISTORY.rst file. No other changes
0.0.2 (2017-01-30)
------------------
* Addresses an issue with lib.auth() not properly defaulting to 2FA
* Fixes an issue with Overwrite for ADLUploader sometimes not being honored.
* Fixes an issue with empty files not properly being uploaded and resulting in a hang in progress tracking.
* Addition of a samples directory showcasing examples of how to use the client and upload and download logic.
* General cleanup of documentation and comments.
* This is still based on API version 2016-11-01
0.0.1 (2016-11-21)
------------------
* Initial preview release. Based on API version 2016-11-01.
* Includes initial ADLS filesystem functionality and extended upload and download support.
====================
.. image:: https://travis-ci.org/Azure/azure-data-lake-store-python.svg?branch=dev
:target: https://travis-ci.org/Azure/azure-data-lake-store-python
.. image:: https://coveralls.io/repos/github/Azure/azure-data-lake-store-python/badge.svg?branch=master
:target: https://coveralls.io/github/Azure/azure-data-lake-store-python?branch=master
azure-datalake-store is a file-system management system in python for the
Azure Data-Lake Store.
To install from source instead of pip (for local testing and development):
.. code-block:: bash
> pip install -r dev_requirements.txt
> python setup.py develop
To run tests, you are required to set the following environment variables:
azure_tenant_id, azure_username, azure_password, azure_data_lake_store_name
To play with the code, here is a starting point:
.. code-block:: python
from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(token, store_name=store_name)
# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.ls('tmp/', detail=True, invalidate_cache=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')
# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
print(f.readline())
print(f.readline())
print(f.readline())
# could have passed f to any function requiring a file object:
# pandas.read_csv(f)
with adl.open('anewfile', 'wb') as f:
# data is written on flush/close, or when buffer is bigger than
# blocksize
f.write(b'important data')
adl.du('anewfile')
# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)
Progress can be tracked using a callback function in the form `track(current, total)`
When passed, this will keep track of transferred bytes and be called on each complete chunk.
Here's an example using the Azure CLI progress controller as the `progress_callback`:
.. code-block:: python
from cli.core.application import APPLICATION
def _update_progress(current, total):
hook = APPLICATION.get_progress_controller(det=True)
hook.add(message='Alive', value=current, total_val=total)
if total == current:
hook.end()
...
ADLUploader(client, destination_path, source_path, thread_count, overwrite=overwrite,
chunksize=chunk_size,
buffersize=buffer_size,
blocksize=block_size,
progress_callback=_update_progress)
This will output a progress bar to the stdout:
```
Alive[######################### ] 40.0881%
...
Finished[#############################################################] 100.0000%
```
Command Line Sample Usage
-------------------------
To interact with the API at a higher-level, you can use the provided
command-line interface in "samples/cli.py". You will need to set
the appropriate environment variables as described above to connect to the
Azure Data Lake Store. Below is a simple sample, with more details beyond.
.. code-block:: bash
python samples\cli.py ls -l
Execute the program without arguments to access documentation.
To start the CLI in interactive mode, run "python samples/cli.py"
and then type "help" to see all available commands (similiar to Unix utilities):
.. code-block:: bash
> python samples/cli.py
azure> help
Documented commands (type help <topic>):
========================================
cat chmod close du get help ls mv quit rmdir touch
chgrp chown df exists head info mkdir put rm tail
azure>
While still in interactive mode, you can run "ls -l" to list the entries in the
home directory ("help ls" will show the command's usage details). If you're not
familiar with the Unix/Linux "ls" command, the columns represent 1) permissions,
2) file owner, 3) file group, 4) file size, 5-7) file's modification time, and
8) file name.
.. code-block:: bash
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
azure> ls -l --human-readable
drwxrwx--- 0123abcd 0123abcd 0B Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1M Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36B Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0B Aug 03 13:46 tmp
azure>
To download a remote file, run "get remote-file [local-file]". The second
argument, "local-file", is optional. If not provided, the local file will be
named after the remote file minus the directory path.
.. code-block:: bash
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
azure> get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
azure>
It is also possible to run in command-line mode, allowing any available command
to be executed separately without remaining in the interpreter.
For example, listing the entries in the home directory:
.. code-block:: bash
> python samples/cli.py ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
>
Also, downloading a remote file:
.. code-block:: bash
> python samples/cli.py get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
>
.. :changelog:
Release History
===============
0.0.15 (2017-07-26)
----------
* Enable Data Lake Store progress controller callback #174
* Fix File state incorrectly marked as "errored" if contains chunks is "pending" state #182
* Fix Race condition due to `transfer` future `done_callback` #177
0.0.14 (2017-07-10)
-------------------
* Fix an issue where common prefixes in paths for upload and download were collapsed into only unique paths.
0.0.13 (2017-06-28)
-------------------
* Add support for automatic refreshing of service principal credentials
0.0.12 (2017-06-20)
-------------------
* Fix a regression with ls returning the top level folder if it has no contents. It now properly returns an empty array if a folder has no children.
0.0.11 (2017-06-02)
-------------------
* Update to name incomplete file downloads with a `.inprogress` suffix. This suffix is removed when the download completes successfully.
0.0.10 (2017-05-24)
-------------------
* Allow users to explicitly use or invalidate the internal, local cache of the filesystem that is built up from previous `ls` calls. It is now set to always call the service instead of the cache by default.
* Update to properly create the wheel package during build to ensure all pip packages are available.
* Update folder upload/download to properly throw early in the event that the destination files exist and overwrite was not specified. NOTE: target folder existence (or sub folder existence) does not automatically cause failure. Only leaf node existence will result in failure.
* Fix a bug that caused file not found errors when attempting to get information about the root folder.
0.0.9 (2017-05-09)
------------------
* Enforce basic SSL utilization to ensure performance due to `GitHub issue 625 <https://github.com/pyca/pyopenssl/issues/625>`
0.0.8 (2017-04-26)
------------------
* Fix server-side throttling retry support. This is not a guarantee that if the server is throttling the upload (or download) it will eventually succeed, but there is now a back-off retry in place to make it more likely.
0.0.7 (2017-04-19)
------------------
* Update the build process to more efficiently handle multi-part namespaces for pip.
0.0.6 (2017-03-15)
------------------
* Fix an issue with path caching that should drastically improve performance for download
0.0.5 (2017-03-01)
------------------
* Fix for downloader to ensure there is access to the source path before creating destination files
* Fix for credential objects to inherit from msrest.authentication for more universal authentication support
* Add support for the following:
* set_expiry: allows for setting expiration on files
* ACL management:
* set_acl: allows for the full replacement of an ACL on a file or folder
* set_acl_entries: allows for "patching" an existing ACL on a file or folder
* get_acl_status: retrieves the ACL information for a file or folder
* remove_acl_entries: removes the specified entries from an ACL on a file or folder
* remove_acl: removes all non-default ACL entries from a file or folder
* remove_default_acl: removes all default ACL entries from a folder
* Remove unsupported and unused "TRUNCATE" operation.
* Added API-Version support with a default of the latest api version (2016-11-01)
0.0.4 (2017-02-07)
------------------
* Fix for folder upload to properly delete folders with contents when overwrite specified.
* Fix to set verbose output to False/Off by default. This removes progress tracking output by default but drastically improves performance.
0.0.3 (2017-02-02)
------------------
* Fix to setup.py to include the HISTORY.rst file. No other changes
0.0.2 (2017-01-30)
------------------
* Addresses an issue with lib.auth() not properly defaulting to 2FA
* Fixes an issue with Overwrite for ADLUploader sometimes not being honored.
* Fixes an issue with empty files not properly being uploaded and resulting in a hang in progress tracking.
* Addition of a samples directory showcasing examples of how to use the client and upload and download logic.
* General cleanup of documentation and comments.
* This is still based on API version 2016-11-01
0.0.1 (2016-11-21)
------------------
* Initial preview release. Based on API version 2016-11-01.
* Includes initial ADLS filesystem functionality and extended upload and download support.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for azure-datalake-store-0.0.15.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3831813d05d4cf490c1c3ff7f994db36d6bc94610e50ae0b74a1f4787f62abd |
|
MD5 | 42b3e4ea20d393d4381fa11a9e02dd08 |
|
BLAKE2b-256 | 9ed3eae9ca4362c6b4f7f88a39b0af3ea8638688edc54666f59426093c5ed7cc |
Close
Hashes for azure_datalake_store-0.0.15-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8e2e7824ebc13845ab0399b6f919b516e679a2ef6e4f655ff082073a4e1147a |
|
MD5 | d7f3a4e454b59c58484aff68e07b9f8f |
|
BLAKE2b-256 | 59e7f0e02aeff11bbdbc6249a1a68ca0ef592ecf082dd112d10b6d6c8959a8e4 |