Azure Data Lake Store Filesystem Client Library for Python
Project description
azure-datalake-store
azure-datalake-store is a file-system management system in python for the Azure Data-Lake Store.
To install from source instead of pip (for local testing and development):
> pip install -r dev_requirements.txt
> python setup.py develop
To run tests, you are required to set the following environment variables: azure_tenant_id, azure_username, azure_password, azure_data_lake_store_name
To play with the code, here is a starting point:
from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(token, store_name=store_name)
# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')
# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
print(f.readline())
print(f.readline())
print(f.readline())
# could have passed f to any function requiring a file object:
# pandas.read_csv(f)
with adl.open('anewfile', 'wb') as f:
# data is written on flush/close, or when buffer is bigger than
# blocksize
f.write(b'important data')
adl.du('anewfile')
# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)
Command Line Sample Usage
To interact with the API at a higher-level, you can use the provided command-line interface in “samples/cli.py”. You will need to set the appropriate environment variables as described above to connect to the Azure Data Lake Store. Below is a simple sample, with more details beyond.
python samples\cli.py ls -l
Execute the program without arguments to access documentation.
To start the CLI in interactive mode, run “python samples/cli.py” and then type “help” to see all available commands (similiar to Unix utilities):
> python samples/cli.py
azure> help
Documented commands (type help <topic>):
========================================
cat chmod close du get help ls mv quit rmdir touch
chgrp chown df exists head info mkdir put rm tail
azure>
While still in interactive mode, you can run “ls -l” to list the entries in the home directory (“help ls” will show the command’s usage details). If you’re not familiar with the Unix/Linux “ls” command, the columns represent 1) permissions, 2) file owner, 3) file group, 4) file size, 5-7) file’s modification time, and 8) file name.
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
azure> ls -l --human-readable
drwxrwx--- 0123abcd 0123abcd 0B Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1M Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36B Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0B Aug 03 13:46 tmp
azure>
To download a remote file, run “get remote-file [local-file]”. The second argument, “local-file”, is optional. If not provided, the local file will be named after the remote file minus the directory path.
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
azure> get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
azure>
It is also possible to run in command-line mode, allowing any available command to be executed separately without remaining in the interpreter.
For example, listing the entries in the home directory:
> python samples/cli.py ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd 0 Aug 03 13:46 tmp
>
Also, downloading a remote file:
> python samples/cli.py get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
>
Release History
0.0.4 (2017-02-07)
Fix for folder upload to properly delete folders with contents when overwrite specified.
Fix to set verbose output to False/Off by default. This removes progress tracking output by default but drastically improves performance.
0.0.3 (2017-02-02)
Fix to setup.py to include the HISTORY.rst file. No other changes
0.0.2 (2017-01-30)
Addresses an issue with lib.auth() not properly defaulting to 2FA
Fixes an issue with Overwrite for ADLUploader sometimes not being honored.
Fixes an issue with empty files not properly being uploaded and resulting in a hang in progress tracking.
Addition of a samples directory showcasing examples of how to use the client and upload and download logic.
General cleanup of documentation and comments.
This is still based on API version 2016-11-01
0.0.1 (2016-11-21)
Initial preview release. Based on API version 2016-11-01.
Includes initial ADLS filesystem functionality and extended upload and download support.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for azure-datalake-store-0.0.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | eac922b8bb85200cc93e7dac571bf013477ccb39d4b3e77e268c914d52e16ad3 |
|
MD5 | fcc5109152aad5f56e3aae6e5b0f34f1 |
|
BLAKE2b-256 | 05988762dd336beddedbf2b67237861790d2bf02302aa7190ddbcf73c017e07f |