Dask + Delta Table
Project description
Dask-DeltaTable
Reading and writing to Delta Lake using Dask engine.
Installation
dask-deltatable is available on PyPI:
pip install dask-deltatable
And conda-forge:
conda install -c conda-forge dask-deltatable
Features:
- Read the parquet files from Delta Lake and parallelize with Dask
- Write Dask dataframes to Delta Lake (limited support)
- Supports multiple filesystems (s3, azurefs, gcsfs)
- Subset of Delta Lake features:
- Time Travel
- Schema evolution
- Parquet filters
- row filter
- partition filter
Not supported
- Writing to Delta Lake is still in development.
optimizeAPI to run a bin-packing operation on a Delta Table.
Reading from Delta Lake
import dask_deltatable as ddt
# read delta table
df = ddt.read_deltalake("delta_path")
# with specific version
df = ddt.read_deltalake("delta_path", version=3)
# with specific datetime
df = ddt.read_deltalake("delta_path", datetime="2018-12-19T16:39:57-08:00")
df is a Dask DataFrame that you can work with in the same way you normally would. See
the Dask DataFrame documentation for
available operations.
Accessing remote file systems
To be able to read from S3, azure, gcsfs, and other remote filesystems,
you ensure the credentials are properly configured in environment variables
or config files. For AWS, you may need ~/.aws/credential; for gcsfs,
GOOGLE_APPLICATION_CREDENTIALS. Refer to your cloud provider documentation
to configure these.
ddt.read_deltalake("s3://bucket_name/delta_path", version=3)
Accessing AWS Glue catalog
dask-deltatable can connect to AWS Glue catalog to read the delta table.
The method will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
environment variables, and if those are not available, fall back to
~/.aws/credentials.
Example:
ddt.read_deltalake(catalog="glue", database_name="science", table_name="physics")
Accessing Unity catalog
dask-deltatable can connect to Unity catalog to read the delta table.
The method will look for DATABRICKS_HOST and DATABRICKS_TOKEN environment
variables or try to find them as kwargs with the same name but lowercase.
Example:
ddt.read_unity_catalog(
catalog_name="projects",
schema_name="science",
table_name="physics"
)
Writing to Delta Lake
To write a Dask dataframe to Delta Lake, use to_deltalake method.
import dask.dataframe as dd
import dask_deltatable as ddt
df = dd.read_csv("s3://bucket_name/data.csv")
# do some processing on the dataframe...
ddt.to_deltalake("s3://bucket_name/delta_path", df)
Writing to Delta Lake is still in development, so be aware that some features may not work.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dask_deltatable-0.4.0.tar.gz.
File metadata
- Download URL: dask_deltatable-0.4.0.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d75af862b5d89d435bf425b5ee9c46ef557a8993fcaba6090013c3b139153d29
|
|
| MD5 |
e4b2b0f839e1a5def0177e708e93567d
|
|
| BLAKE2b-256 |
f35d151216c49cb2978f3da65dd3867f580757a77e1f21bc6b9e359f3be33441
|
File details
Details for the file dask_deltatable-0.4.0-py3-none-any.whl.
File metadata
- Download URL: dask_deltatable-0.4.0-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63da540706ac0d7eb3bf066e99d46b6891bd04fd34a35e43d0a13fc4ec5ee5e1
|
|
| MD5 |
74ae19696b336ef901931af67df14aad
|
|
| BLAKE2b-256 |
a3191143ef5e88855cd4d4e45d1cb530ebdfbe40081f2f1cfd9596fc72bfd011
|