Skip to main content

Use pyarrow with Azure Data Lake gen2

Project description

pyarrowfs-adlgen2

pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.

It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.

Reading datasets

Example usage with pandas dataframe:

import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
df = pd.read_parquet('container/dataset.parq', filesystem=fs)

Example usage with arrow tables:

import azure.identity
import pyarrow.dataset
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)
table = ds.to_table()

Writing datasets

As of pyarrow version 1.0.1, pyarrow.parquet.ParquetWriter does not support pyarrow.fs.PyFileSystem, but data can be written to open files:

with fs.open_output_stream('container/out.parq') as out:
    df.to_parquet(out) 

Or with arrow tables:

import pyarrow.parquet

with fs.open_output_stream('container/out.parq') as out:
    pyarrow.parquet.write_table(table, out)

Accessing only a single container/file-system

If you do not want, or can't access the whole storage account as a single filesystem, you can use pyarrowfs_adlgen2.FileSystemHandler to view a single file system within an account:

import azure.identity
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.FileSystemHandler.from_account_name(
   "STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential())

All access is done through the file system within the storage account.

Running tests

To run the integration tests, you need:

  • Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account)
  • To configure azure login (f. ex. use $ az login or set up environment variables, see azure.identity.DefaultAzureCredential)
  • Install pytest, f. ex. pip install pytest

NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT

AZUREARROWFS_TEST_ACT=thestorageaccount pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrowfs-adlgen2-0.1.0.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

pyarrowfs_adlgen2-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file pyarrowfs-adlgen2-0.1.0.tar.gz.

File metadata

  • Download URL: pyarrowfs-adlgen2-0.1.0.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.3 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for pyarrowfs-adlgen2-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6388429b41a351c820a178b83ca2539395c1b0cfa9b1c60d2d4eaec68901da43
MD5 5dd284fd0a5aac9e6e227e3b2643285a
BLAKE2b-256 e64981a1fd1bba4256f2117f40b3321bd84f3f83ac1e4680d590833ff6e83cc5

See more details on using hashes here.

Provenance

File details

Details for the file pyarrowfs_adlgen2-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyarrowfs_adlgen2-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.3 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for pyarrowfs_adlgen2-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 832543ebdab3431683937ed206e9911d8b9b5551cc6bffec2babccb21dde97ea
MD5 e49a8962bb4b6e9721f8427e9e297d03
BLAKE2b-256 ff49988d5ef4274167aefda5596b68d28b9daf2e7a08ef299168cd9bda800afd

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page