Skip to main content

Use pyarrow with Azure Data Lake gen2

Project description

pyarrowfs-adlgen2

pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.

It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.

Installation

pip install pyarrowfs-adlgen2

Reading datasets

Example usage with pandas dataframe:

import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
df = pd.read_parquet('container/dataset.parq', filesystem=fs)

Example usage with arrow tables:

import azure.identity
import pyarrow.dataset
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)
table = ds.to_table()

Configuring timeouts

Timeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds.

import azure.identity
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME',
    azure.identity.DefaultAzureCredential(),
    timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10)
)
# or mutate it:
handler.timeouts.file_client_timeout = 20

Writing datasets

With pyarrow version 3 or greater, you can write datasets from arrow tables:

import pyarrow as pa
import pyarrow.dataset

pyarrow.dataset.write_dataset(
    table,
    'name.pq',
    format='parquet',
    partitioning=pyarrow.dataset.partitioning(
        schema=pyarrow.schema([('year', pa.int32())]), flavor='hive'
    ),
    filesystem=pyarrow.fs.PyFileSystem(handler)
)

With earlier versions, files must be opened/written one at a time:

As of pyarrow version 1.0.1, pyarrow.parquet.ParquetWriter does not support pyarrow.fs.PyFileSystem, but data can be written to open files:

with fs.open_output_stream('container/out.parq') as out:
    df.to_parquet(out)

Or with arrow tables:

import pyarrow.parquet

with fs.open_output_stream('container/out.parq') as out:
    pyarrow.parquet.write_table(table, out)

Accessing only a single container/file-system

If you do not want, or can't access the whole storage account as a single filesystem, you can use pyarrowfs_adlgen2.FilesystemHandler to view a single file system within an account:

import azure.identity
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name(
   "STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential())

All access is done through the file system within the storage account.

Set http headers for files for pyarrow >= 5

You can set headers for any output files by using the metadata argument to handler.open_output_stream:

import pyarrowfs_adlgen2

fs = pyarrowfs_adlgen2.AccountHandler.from_account_name("theaccount").to_fs()
metadata = {"content_type": "application/json"}
with fs.open_output_stream("container/data.json", metadata) as out:
    out.write("{}")

Note that the spelling is different than you might expect! For a list of valid keys, see ContentSettings.

You can do this for pyarrow >= 5 when using pyarrow.fs.PyFileSystem, and for any pyarrow if using the handlers from pyarrowfs_adlgen2 directly.

Running tests

To run the integration tests, you need:

  • Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account)
  • To configure azure login (f. ex. use $ az login or set up environment variables, see azure.identity.DefaultAzureCredential)
  • Install pytest, f. ex. pip install pytest

NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT

AZUREARROWFS_TEST_ACT=thestorageaccount pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrowfs-adlgen2-0.2.3.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

pyarrowfs_adlgen2-0.2.3-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file pyarrowfs-adlgen2-0.2.3.tar.gz.

File metadata

  • Download URL: pyarrowfs-adlgen2-0.2.3.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for pyarrowfs-adlgen2-0.2.3.tar.gz
Algorithm Hash digest
SHA256 30baf22ff15620a4d8ee473cb4cab5fa7f839f8ee8593de8746a404f8ab4c01b
MD5 be9b8f4be697b5ba51a611d9080e6df9
BLAKE2b-256 9bb169883b506e771f81dfc3e549ffdbc3ca9cff7b53f7e0481d39b81ff2f165

See more details on using hashes here.

Provenance

File details

Details for the file pyarrowfs_adlgen2-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: pyarrowfs_adlgen2-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for pyarrowfs_adlgen2-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 648e0dc8e6b0074667100b5c5f8dbff7c8929c0f0dab1d3e996a49c0dc027939
MD5 66d21501e3baca4ffb2fe06fd9ada862
BLAKE2b-256 9fcd87ef25463aa2f4fcb512df51574069d9971a025cd01610d5b4a159fefaf3

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page