Use pyarrow with Azure Data Lake gen2
Project description
pyarrowfs-adlgen2
pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.
It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.
Installation
pip install pyarrowfs-adlgen2
Reading datasets
Example usage with pandas dataframe:
import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2
handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
df = pd.read_parquet('container/dataset.parq', filesystem=fs)
Example usage with arrow tables:
import azure.identity
import pyarrow.dataset
import pyarrow.fs
import pyarrowfs_adlgen2
handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)
table = ds.to_table()
Writing datasets
As of pyarrow version 1.0.1, pyarrow.parquet.ParquetWriter
does not support pyarrow.fs.PyFileSystem
, but data can be written to open files:
with fs.open_output_stream('container/out.parq') as out:
df.to_parquet(out)
Or with arrow tables:
import pyarrow.parquet
with fs.open_output_stream('container/out.parq') as out:
pyarrow.parquet.write_table(table, out)
Accessing only a single container/file-system
If you do not want, or can't access the whole storage account as a single filesystem, you can use pyarrowfs_adlgen2.FilesystemHandler
to view a single file system within an account:
import azure.identity
import pyarrowfs_adlgen2
handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name(
"STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential())
All access is done through the file system within the storage account.
Running tests
To run the integration tests, you need:
- Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account)
- To configure azure login (f. ex. use
$ az login
or set up environment variables, seeazure.identity.DefaultAzureCredential
) - Install pytest, f. ex.
pip install pytest
NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT
AZUREARROWFS_TEST_ACT=thestorageaccount pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyarrowfs-adlgen2-0.1.2.tar.gz
.
File metadata
- Download URL: pyarrowfs-adlgen2-0.1.2.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.3 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb3a77fe3729873e1720a4ecbd35d186993ce4c8e9d6b3f01f7312098937cabb |
|
MD5 | 317971d6c83b82fd05fcbaf5e9a92add |
|
BLAKE2b-256 | 6aaebd013dcd4b9098e8521d091d58e54b0609033d9738cf732b6bf5cdb9c3fd |
Provenance
File details
Details for the file pyarrowfs_adlgen2-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: pyarrowfs_adlgen2-0.1.2-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.3 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5944824dcab6d13548e076a19ed21384344fcc4368e9d794ec40ab02a7e40716 |
|
MD5 | 2c9bbfa2b207d14cb0af696ebaa940f2 |
|
BLAKE2b-256 | 5d5539c7fdb4f8df0add83d2799d145939b4260a0fbd47c2f0247da97872b12b |