Skip to main content

Access Azure Datalake Gen1 with fsspec and dask

Project description

Dask interface to Azure-Datalake Gen1 and Gen2 Storage

Warning: this code is experimental and untested.

Quickstart

This package is on PyPi and can be installed using:

pip install adlfs

In your code, call:

from fsspec.registry import known_implementations

To use the Gen1 filesystem:

known_implementations[‘adl’] = {‘class’: ‘adlfs.AzureDatalakeFileSystem’}

To use the Gen2 filesystem:

known_implementations[‘abfs’] = {‘class’: ‘adlfs.AzureBlobFileSystem’}

This allows operations such as: import dask.dataframe as dd storage_options={ ‘tenant_id’: TENANT_ID, ‘client_id’: CLIENT_ID, ‘client_secret’: CLIENT_SECRET, ‘storage_account’: STORAGE_ACCOUNT, ‘filesystem’: FILESYSTEM, } dd.read_csv(‘abfs://folder/file.csv’, storage_options=STORAGE_OPTIONS}

Details

The package includes pythonic filesystem implementations for both Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate interactions with both Azure Datalake implementations with Dask, using the intake/filesystem_spec base class.

Operations against both Gen1 and Gen2 datalakes currently require an Azure ServicePrincipal with suitable credentials to perform operations on the resources of choice.

Operations on the Azure Gen1 Datalake are implemented by leveraging multiple inheritance from both the fsspec.AbstractFileSystem and the Azure Python Gen1 Filesystem library, while operations against the Azure Gen2 Datalake are implemented by using subclassing the fsspec.AbstractFileSystem and leveraging the Azure Datalake Gen2 API. Note that the Azure Datalake Gen2 API allows calls to using either the ‘http://’ or ‘https://’ protocols, designated by an ‘abfs[s]://’ protocol. Under the hood in adlfs, this will always happen using ‘https://’ using the requests library.

An Azure Datalake Gen2 url takes the following form, which is replicated in the adlfs library, for the sake of consistency: ‘abfs[s]://{storage_account}/{filesystem}/{folder}/{file}’

Currently, when using either the ‘adl://’ or ‘abfs://’ protocols in a dask operation, it is required to explicitly declare the storage_options, as described in the Dask documentation. The intent is to eliminate this requirement for (at at minimum) Gen2 operations, by having the adlfs library parse the filesystem name

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adlfs-0.0.5a0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adlfs-0.0.5a0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file adlfs-0.0.5a0.tar.gz.

File metadata

  • Download URL: adlfs-0.0.5a0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/2.7.15rc1

File hashes

Hashes for adlfs-0.0.5a0.tar.gz
Algorithm Hash digest
SHA256 5138a0754d0b1e73c9e23061aa1a181512361f0a6a2a82add74f38a327b26356
MD5 64ea415defac229acd84f69726f6881d
BLAKE2b-256 708ef50ed15048b0eeb4a44c1f84812e3c8445b69a072e03535ea55acd033fd6

See more details on using hashes here.

File details

Details for the file adlfs-0.0.5a0-py3-none-any.whl.

File metadata

  • Download URL: adlfs-0.0.5a0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/2.7.15rc1

File hashes

Hashes for adlfs-0.0.5a0-py3-none-any.whl
Algorithm Hash digest
SHA256 eba6f4032da4bbe836c63c7cd4b1d3b9f8b301e1cceddfb2807ae6238c476310
MD5 1f8f763bb240c951bd7c875c1a377013
BLAKE2b-256 d5dca0f44e890f93a34276637959f00ce99e79ea33f32e72d2dfb53cf9b3d7e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page