Use pyarrow with Azure Data Lake gen2

These details have not been verified by PyPI

Project links

Homepage

Project description

pyarrowfs-adlgen2

pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.

It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.

Compared with adlfs, you may see better performance when reading datasets with many files, as pyarrowfs-adlgen2 uses the datalake gen2 sdk, which has fast directory listing, unlike the blob sdk used by adlfs.

pyarrowfs-adlgen2 is stable software with a small API, and no major features are planned.

Installation

pip install pyarrowfs-adlgen2

Reading datasets

Example usage with pandas dataframe:

import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
df = pd.read_parquet('container/dataset.parq', filesystem=fs)

Example usage with arrow tables:

import azure.identity
import pyarrow.dataset
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)
table = ds.to_table()

Configuring timeouts

Timeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds.

import azure.identity
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME',
    azure.identity.DefaultAzureCredential(),
    timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10)
)
# or mutate it:
handler.timeouts.file_client_timeout = 20

Writing datasets

With pyarrow version 3 or greater, you can write datasets from arrow tables:

import pyarrow as pa
import pyarrow.dataset

pyarrow.dataset.write_dataset(
    table,
    'name.pq',
    format='parquet',
    partitioning=pyarrow.dataset.partitioning(
        schema=pyarrow.schema([('year', pa.int32())]), flavor='hive'
    ),
    filesystem=pyarrow.fs.PyFileSystem(handler)
)

With earlier versions, files must be opened/written one at a time:

As of pyarrow version 1.0.1, pyarrow.parquet.ParquetWriter does not support pyarrow.fs.PyFileSystem, but data can be written to open files:

with fs.open_output_stream('container/out.parq') as out:
    df.to_parquet(out)

Or with arrow tables:

import pyarrow.parquet

with fs.open_output_stream('container/out.parq') as out:
    pyarrow.parquet.write_table(table, out)

Accessing only a single container/file-system

If you do not want, or can't access the whole storage account as a single filesystem, you can use pyarrowfs_adlgen2.FilesystemHandler to view a single file system within an account:

import azure.identity
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name(
   "STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential())

All access is done through the file system within the storage account.

Set http headers for files for pyarrow >= 5

You can set headers for any output files by using the metadata argument to handler.open_output_stream:

import pyarrowfs_adlgen2

fs = pyarrowfs_adlgen2.AccountHandler.from_account_name("theaccount").to_fs()
metadata = {"content_type": "application/json"}
with fs.open_output_stream("container/data.json", metadata) as out:
    out.write("{}")

Note that the spelling is different than you might expect! For a list of valid keys, see ContentSettings.

You can do this for pyarrow >= 5 when using pyarrow.fs.PyFileSystem, and for any pyarrow if using the handlers from pyarrowfs_adlgen2 directly.

Running tests

To run the integration tests, you need:

Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account)
To configure azure login (f. ex. use $ az login or set up environment variables, see azure.identity.DefaultAzureCredential)
Install pytest, f. ex. pip install pytest

NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT

AZUREARROWFS_TEST_ACT=thestorageaccount pytest

Performance

Here is an informal comparison test against adlfs, done against a copy of the NYC taxi dataset.

The test setup was as follows:

Create an Azure Data Lake Gen2 storage account with a container. I clicked through the portal to do this step. Grant yourself the Azure Storage Data Owner role on the account.
Upload the NYC taxi dataset to the container. You want to do this with azcopy or az cli, or it's going to take a long time. Here's the command I used, it only took a few seconds: az storage copy -s https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow --recursive -d https://benchpyarrowfs.blob.core.windows.net/taxi/
Set up a venv for the test, and install the dependencies: python -m venv && source venv/bin/activate && pip install pyarrowfs-adlgen2 pandas pyarrow adlfs azure-identity
Make sure to log in with az login and set the correct subscription using az account set -s playground-sub

That's the entire test setup. Now we can run some commands against the dataset and time them. Let's see how long it takes to read the passengerCount and tripDistance columns for one month of data, 2014/10 using pyarrowfs-adlgen2 and the pyarrow dataset api:

$ time python adlg2_taxi.py 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14227692 entries, 0 to 14227691
Data columns (total 2 columns):
 #   Column          Dtype  
---  ------          -----  
 0   passengerCount  int32  
 1   tripDistance    float64
dtypes: float64(1), int32(1)
memory usage: 162.8 MB

real	0m11,000s
user	0m2,018s
sys	0m1,605s

Now let's do the same with adlfs:

$ time python adlfs_taxi.py 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14227692 entries, 0 to 14227691
Data columns (total 2 columns):
 #   Column          Dtype  
---  ------          -----  
 0   passengerCount  int32  
 1   tripDistance    float64
dtypes: float64(1), int32(1)
memory usage: 162.8 MB

real	0m31,985s
user	0m3,204s
sys	0m2,110s

The pyarrowfs-adlgen2 implementation is about 3 times faster than adlfs for this dataset and that's not due to bandwidth or compute limitations. This reflects my own experience using both professionally as well. I believe that the difference here is primarily due to the fact that adlfs uses the blob storage SDK, which is slow at listing directories, and that the nyc taxi data set has a lot of files and structure. adlfs is being forced to parse that to recover the structure, whereas adlgen2 gets it for free from the datalake gen2 SDK.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.5

Jun 27, 2024

0.2.4

Mar 5, 2023

0.2.3

Dec 23, 2021

0.2.2

Nov 3, 2021

0.2.1

Oct 6, 2021

0.2.0

Feb 11, 2021

0.1.4

Jan 25, 2021

0.1.3

Jan 16, 2021

0.1.2

Nov 30, 2020

0.1.1

Oct 9, 2020

0.1.0

Sep 3, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrowfs_adlgen2-0.2.5.tar.gz (14.1 kB view details)

Uploaded Jun 27, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyarrowfs_adlgen2-0.2.5-py3-none-any.whl (11.5 kB view details)

Uploaded Jun 27, 2024 Python 3

File details

Details for the file pyarrowfs_adlgen2-0.2.5.tar.gz.

File metadata

Download URL: pyarrowfs_adlgen2-0.2.5.tar.gz
Upload date: Jun 27, 2024
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for pyarrowfs_adlgen2-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`b5ae09d58b21f48f45d538cd1842a93b67eb0950b58ff31d5101bf5bb665ac41`
MD5	`a9891173543c15946fd95e0c2f1aeabf`
BLAKE2b-256	`3b1578f6d21d046db717074c5e530305d5248ac74935c864711d4b54be0f1849`

See more details on using hashes here.

File details

Details for the file pyarrowfs_adlgen2-0.2.5-py3-none-any.whl.

File metadata

Download URL: pyarrowfs_adlgen2-0.2.5-py3-none-any.whl
Upload date: Jun 27, 2024
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for pyarrowfs_adlgen2-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dad4be87a7268cfd27e89125f76501f4655fbcef36527145956ccbbfd65b8d23`
MD5	`ea4620118ee9b1364686f823b816122b`
BLAKE2b-256	`25da0637e54224a6bf3458939a180d6dcfc50d1bba16d67d5e8fd4a31094f783`

See more details on using hashes here.

pyarrowfs-adlgen2 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyarrowfs-adlgen2

Installation

Reading datasets

Configuring timeouts

Writing datasets

Accessing only a single container/file-system

Set http headers for files for pyarrow >= 5

Running tests

Performance

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes